Bug 437346 – Problem for creating RTF file for Japanese Language

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 437346 - Problem for creating RTF file for Japanese Language


Summary:	Problem for creating RTF file for Japanese Language


Status:	RESOLVED FIXED

Product:	doxygen
Classification:	Other
Component:	doxywizard
Version:	1.5.6
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Dimitri van Heesch
QA Contact:	Dimitri van Heesch

URL:
Whiteboard:

Duplicates:	166535 643068 (view as bug list)
Depends on:
Blocks:

Reported:	2007-05-10 04:21 UTC by T.M
Modified:	2011-08-14 14:04 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Doxygen configuration file and source code (4.56 KB, application/octet-stream) 2007-05-15 01:25 UTC, T.M		Details
Japanese RTF source set. (5.75 KB, application/octet-stream) 2011-08-06 15:47 UTC, hiroa		Details
sample RTF multibyte patch. (28.52 KB, patch) 2011-08-06 15:59 UTC, hiroa	none	Details \| Review

Description T.M 2007-05-10 04:21:19 UTC

Please describe the problem:
I tried to create the RTF file by using doxygen, but it failes to
create it when specifying the Japanese language.


Steps to reproduce:
(Using doxywizard)
1. Choose "Japanese" for OUTPUT_LANGUAGE in the Project tab.
2. Check "GENERATE_RTF" in the RTF tab.
3. Use the simple C source code file for input 
   (it doesn't include any Japanese characters)
--- source file ----
void	main( void )
{
	printf( "Hello! World" );
}
--------------------
4. After specifying "Working directory", push the "Start" button.

Actual results:
Doxygen produces the error below and the RTF file seems not correct.
======
Error: RTF integrity test failed at line 117 of D:/doxygen/rtf/refman.rtf due to a bracket mismatch.
       Please try to create a small code example that produces this error 
       and send that to dimitri@stack.nl.
*** Doxygen has finished
=====



Expected results:
Creating proper RTF file.

Does this happen every time?
Yes.

Other information:
When I see the 'wrong' RTF file, I find a '}' character in the title string.
I think it may cause this problem.   ( like "83}\" )
== in RTF file ===================
{\title {\comment TEST \'83\'8A\'83t\'83@\'83\'8C\'83\'93\'83X\'83}\'83j\'83\'85\'83A\'83\'8B {\s17\sa60\sb30\widctlpar\qj \fs22\cgrid 
==================================
(This line is for "Reference Manual" in English)

If I choose "Japanese-en" for the OUTPUT_LANGUAGE, it succeeds to create 
a RTF file.



my OS : Windows XP Professional SP2 (Japanese)

Comment 1 Dimitri van Heesch 2007-05-13 11:33:45 UTC

Did you set INPUT_ENCODING to the correct value? If not the input is assumed to be encoded as UTF-8. You probably need to set it to EUC-JP, SHIFT_JIS, or EUC-JISX0213 in your case. Does this solve your problem?

Comment 2 T.M 2007-05-14 02:59:13 UTC

(In reply to comment #1)
I tried several combinations for INPUT_ENCODING and DOXYFILE_ENCODING,
but they also failed. (same error was produced)

INPUT_ENCODING	DOXYFILE_ENCODING	result
------------------------------------------------
  UTF-8		UTF-8			fail
  SHIFT_JIS	UTF-8			fail
  EUC-JP	UTF-8			fail
  SHIFT_JIS	SHIFT_JIS		fail
  EUC-JP	EUC-JP			fail
  UTF-8		SHIFT_JIS		fail
------------------------------------------------

Comment 3 T.M 2007-05-15 01:25:13 UTC

Created attachment 88188 [details]
Doxygen configuration file and source code

This file includes the configuration file, the source file, 
and the RTF file generated by doxygen.
It may help to reproduce this problem.

Comment 4 T.M 2008-09-09 00:27:26 UTC

(In reply to comment #0)
This problem occurs when a multibyte character includes
a special character, such as '}'(0x7D), '{'(0x7B) or '\'(0x5C),
in the second byte. For example, the multibyte code 0x837D is 
converted to "\'83}" by the current software and the character
'}' causes the problem for a RTF file. I think the output code
should be "\'83\}" or "\'83\'7D".

If I change one of the function in the source code 'rtfgen.cpp'
to put the second multibyte code in the hex format, it seems
to work well. I confirmed this only for Japanese language, 
so I'm not sure whether this modification causes the problem 
for other lanugages or not.

=== file: rtfgen.cpp ==========================
void RTFGenerator::postProcess(QByteArray &a)
{
  QByteArray enc(a.size()*4); // worst case
  int off=0;
  uint i;
  uint mb_flag = 0;                           // <-  Add
  for (i=0;i<a.size();i++)
  {
    unsigned char c = (unsigned char)a.at(i);
    if (c>0x80 || mb_flag==1)                 // <- Add (mb_flag==1)
    {
      char s[10];
      sprintf(s,"\\'%X",c);
      qstrcpy(enc.data()+off,s);
      off+=qstrlen(s);
      mb_flag = 1 - mb_flag;                  // <- Add
    }
    else
    {
        enc.at(off++)=c;
    }
  }
  enc.resize(off);
  a = enc;
}
=========================================================

I hope this information is helpful to resolve the problem.

Comment 5 Dimitri van Heesch 2008-10-12 11:10:44 UTC

Thanks for the feedback, I plan to change the postProcess function like this:

void RTFGenerator::postProcess(QByteArray &a)
{
  QByteArray enc(a.size()*4); // worst case
  int off=0;
  uint i;
  bool mbFlag=FALSE;
  for (i=0;i<a.size();i++)
  {
    unsigned char c = (unsigned char)a.at(i);
    if (c>0x80 || mbFlag)
    {
      char s[10];
      sprintf(s,"\\'%X",c);
      qstrcpy(enc.data()+off,s);
      off+=qstrlen(s);
      mbFlag=c>0x80;
    }
    else
    {
      enc.at(off++)=c;
    }
  }
  enc.resize(off);
  a = enc;
}

Do you see issues with this? The idea is escaping one character <0x80 after a sequence of one or more >0x80 characters.

Comment 6 Dimitri van Heesch 2008-12-27 14:12:42 UTC

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.5.8. Please verify if this is indeed the case and reopen the
bug if you think it is not fixed (include any additional information that you
think can be relevant).

Comment 7 hiroa 2011-08-06 15:47:28 UTC

Created attachment 193349 [details]
Japanese RTF source set.

OS is Windows.
INPUT_ENCODE is UTF-8.
OUTPUT_LANGUAGE is Japanese.

Wrong point
NG:構'90ｬ索引, OK:構成索引  defined in translator_jp.h
NG:機能'82P,   OK:機能１    defined in enum.h line 10.
NG??: \\,      OK??:\\\\    defined in enum.h line 9.

Comment 8 hiroa 2011-08-06 15:59:41 UTC

Created attachment 193350 [details] [review]
sample RTF multibyte patch.

I want to use the Japanese RTF output(cp932). 

The multi-byte encoding of the RTF generators has trouble from Version 1.5.8 to 1.7.4.

Version 1.5.8, If the second byte is more than 0x80, the third byte will be encoded unintentionally.
Since version 1.6.3, when the second byte 0x5c is not encoded escape, and '\' appeared. As a result, the wrong expression.

I make the sample patch.

Code Pages Supported by Windows
http://msdn.microsoft.com/ja-jp/goglobal/bb964654.aspx

Comment 9 Dimitri van Heesch 2011-08-06 16:26:59 UTC

Hi Hiroa,

Thanks for your patch. I plan to introduce a more generic solution for RTF encoding, using the following change to encodeForOutput:

// note: function is not reentrant!
static void encodeForOutput(FTextStream &t,const QCString &s)
{
  QCString encoding;
  bool converted=FALSE;
  int l = s.length();
  static QByteArray enc;
  if (l*4>(int)enc.size()) enc.resize(l*4); // worst case
  encoding.sprintf("CP%s",theTranslator->trRTFansicp().data());
  if (!encoding.isEmpty())
  {
    // convert from UTF-8 back to the output encoding
    void *cd = portable_iconv_open(encoding,"UTF-8");
    if (cd!=(void *)(-1))
    {
      size_t iLeft=l;
      size_t oLeft=enc.size();
      const char *inputPtr = s.data();
      char *outputPtr = enc.data();
      if (!portable_iconv(cd, &inputPtr, &iLeft, &outputPtr, &oLeft))
      {
        enc.resize(enc.size()-oLeft);
        converted=TRUE;
      }
      portable_iconv_close(cd);
    }
  }
  if (!converted) // if we did not convert anything, copy as is.
  {
    memcpy(enc.data(),s.data(),l);
    enc.resize(l);
  }
  uint i;
  for (i=0;i<enc.size();i++)
  {
    uchar c = (uchar)enc.at(i);
    if (c>=0x80)
    {
      char esc[10];
      sprintf(esc,"\\'%X",c);
      t << esc;

      // write 2nd byte
      i++;
      if (i<enc.size())
      {
        uchar c2 = (uchar)enc.at(i);
        sprintf(esc,"\\'%X",c2);
        t << esc;
      }

      if (((uchar)c&0xE0)==0xE0)
      {
        // write 3rd byte
        i++;
        if (i<enc.size())
        {
          uchar c3 = (uchar)enc.at(i);
          sprintf(esc,"\\'%X",c3);
          t << esc;
        }
      }
      if (((uchar)c&0xF0)==0xF0)
      {
        // write 4th byte
        i++;
        if (i<enc.size())
        {
          uchar c4 = (uchar)enc.at(i);
          sprintf(esc,"\\'%X",c4);
          t << esc;
        }
      }
    }
    else
    {
      t << (char)c;
    }
  }
}

Can you check if this also works for you?

Comment 10 hiroa 2011-08-07 03:09:39 UTC

Hi Dimitri,

I read the source code. I did not understand code page you assumed.
It is handled incorrectly in Japanese (and perhaps Chinese and Korean).
I think it is better not change from the original patch if there is no mistake.

It is necessary to process the loop of for (i=0;i<enc.size();i++) according to the output code page of RTF. 

In the character set before Unicode is standardized, The Single Byte Character Set(only single byte character) or The Double Byte Character Set(single byte character and double character) is most.

cp932 (DBCS: Japanese Shift-JIS)
cp936 (DBCS: Simplified Chinese GBK)
cp949 (DBCS: Korean)
cp950 (DBCS: Traditional Chinese Big5)
cp1252,1251 etc...(SBCS: LatinI, Cyrillic...)

Because these are similar, but are different, probably there is no generic solution.

---
Concretely, 0xB1 0x5c works with one character in GBK, but handles it in Shift-JIS with two characters.
cp936 :盶(U+76F6)
cp932 :ｱ(U+FF71) \(U+005C)   Sorry, 0x5C is displayed '\' in Shift-JIS. it is a backslash in ascii .

The "\'B1\'5C" encoding in GBK are correct.
but 0x5C of the second character must not encode in Shift-JIS.
---
In addition,  0x91 0x5C works with one character in GBK and Shift-JIS, but handles it in Latin-I with two characters.

cp936 :慭(U+616D)
cp932 :曾(U+66FE)
cp1252:‘(U+2018) \(U+005C)
---

That's why the code page judgment should speed up, but do not change the part to change processing every language from an original patch greatly.

Comment 11 Dimitri van Heesch 2011-08-07 07:34:38 UTC

Hi Hiroa,

Thanks for your explanation. I see now why my proposed patch is wrong. I will use your patch instead. Thanks a lot for your help.

Comment 12 Dimitri van Heesch 2011-08-07 07:55:48 UTC

*** Bug 643068 has been marked as a duplicate of this bug. ***

Comment 13 Dimitri van Heesch 2011-08-07 07:58:37 UTC

*** Bug 166535 has been marked as a duplicate of this bug. ***

Comment 14 Dimitri van Heesch 2011-08-14 14:04:44 UTC

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.7.5. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information
that you think can be relevant.