GNOME Bugzilla – Bug 437346
Problem for creating RTF file for Japanese Language
Last modified: 2011-08-14 14:04:44 UTC
Please describe the problem: I tried to create the RTF file by using doxygen, but it failes to create it when specifying the Japanese language. Steps to reproduce: (Using doxywizard) 1. Choose "Japanese" for OUTPUT_LANGUAGE in the Project tab. 2. Check "GENERATE_RTF" in the RTF tab. 3. Use the simple C source code file for input (it doesn't include any Japanese characters) --- source file ---- void main( void ) { printf( "Hello! World" ); } -------------------- 4. After specifying "Working directory", push the "Start" button. Actual results: Doxygen produces the error below and the RTF file seems not correct. ====== Error: RTF integrity test failed at line 117 of D:/doxygen/rtf/refman.rtf due to a bracket mismatch. Please try to create a small code example that produces this error and send that to dimitri@stack.nl. *** Doxygen has finished ===== Expected results: Creating proper RTF file. Does this happen every time? Yes. Other information: When I see the 'wrong' RTF file, I find a '}' character in the title string. I think it may cause this problem. ( like "83}\" ) == in RTF file =================== {\title {\comment TEST \'83\'8A\'83t\'83@\'83\'8C\'83\'93\'83X\'83}\'83j\'83\'85\'83A\'83\'8B {\s17\sa60\sb30\widctlpar\qj \fs22\cgrid ================================== (This line is for "Reference Manual" in English) If I choose "Japanese-en" for the OUTPUT_LANGUAGE, it succeeds to create a RTF file. my OS : Windows XP Professional SP2 (Japanese)
Did you set INPUT_ENCODING to the correct value? If not the input is assumed to be encoded as UTF-8. You probably need to set it to EUC-JP, SHIFT_JIS, or EUC-JISX0213 in your case. Does this solve your problem?
(In reply to comment #1) I tried several combinations for INPUT_ENCODING and DOXYFILE_ENCODING, but they also failed. (same error was produced) INPUT_ENCODING DOXYFILE_ENCODING result ------------------------------------------------ UTF-8 UTF-8 fail SHIFT_JIS UTF-8 fail EUC-JP UTF-8 fail SHIFT_JIS SHIFT_JIS fail EUC-JP EUC-JP fail UTF-8 SHIFT_JIS fail ------------------------------------------------
Created attachment 88188 [details] Doxygen configuration file and source code This file includes the configuration file, the source file, and the RTF file generated by doxygen. It may help to reproduce this problem.
(In reply to comment #0) This problem occurs when a multibyte character includes a special character, such as '}'(0x7D), '{'(0x7B) or '\'(0x5C), in the second byte. For example, the multibyte code 0x837D is converted to "\'83}" by the current software and the character '}' causes the problem for a RTF file. I think the output code should be "\'83\}" or "\'83\'7D". If I change one of the function in the source code 'rtfgen.cpp' to put the second multibyte code in the hex format, it seems to work well. I confirmed this only for Japanese language, so I'm not sure whether this modification causes the problem for other lanugages or not. === file: rtfgen.cpp ========================== void RTFGenerator::postProcess(QByteArray &a) { QByteArray enc(a.size()*4); // worst case int off=0; uint i; uint mb_flag = 0; // <- Add for (i=0;i<a.size();i++) { unsigned char c = (unsigned char)a.at(i); if (c>0x80 || mb_flag==1) // <- Add (mb_flag==1) { char s[10]; sprintf(s,"\\'%X",c); qstrcpy(enc.data()+off,s); off+=qstrlen(s); mb_flag = 1 - mb_flag; // <- Add } else { enc.at(off++)=c; } } enc.resize(off); a = enc; } ========================================================= I hope this information is helpful to resolve the problem.
Thanks for the feedback, I plan to change the postProcess function like this: void RTFGenerator::postProcess(QByteArray &a) { QByteArray enc(a.size()*4); // worst case int off=0; uint i; bool mbFlag=FALSE; for (i=0;i<a.size();i++) { unsigned char c = (unsigned char)a.at(i); if (c>0x80 || mbFlag) { char s[10]; sprintf(s,"\\'%X",c); qstrcpy(enc.data()+off,s); off+=qstrlen(s); mbFlag=c>0x80; } else { enc.at(off++)=c; } } enc.resize(off); a = enc; } Do you see issues with this? The idea is escaping one character <0x80 after a sequence of one or more >0x80 characters.
This bug was previously marked ASSIGNED, which means it should be fixed in doxygen version 1.5.8. Please verify if this is indeed the case and reopen the bug if you think it is not fixed (include any additional information that you think can be relevant).
Created attachment 193349 [details] Japanese RTF source set. OS is Windows. INPUT_ENCODE is UTF-8. OUTPUT_LANGUAGE is Japanese. Wrong point NG:構'90ャ索引, OK:構成索引 defined in translator_jp.h NG:機能'82P, OK:機能1 defined in enum.h line 10. NG??: \\, OK??:\\\\ defined in enum.h line 9.
Created attachment 193350 [details] [review] sample RTF multibyte patch. I want to use the Japanese RTF output(cp932). The multi-byte encoding of the RTF generators has trouble from Version 1.5.8 to 1.7.4. Version 1.5.8, If the second byte is more than 0x80, the third byte will be encoded unintentionally. Since version 1.6.3, when the second byte 0x5c is not encoded escape, and '\' appeared. As a result, the wrong expression. I make the sample patch. Code Pages Supported by Windows http://msdn.microsoft.com/ja-jp/goglobal/bb964654.aspx
Hi Hiroa, Thanks for your patch. I plan to introduce a more generic solution for RTF encoding, using the following change to encodeForOutput: // note: function is not reentrant! static void encodeForOutput(FTextStream &t,const QCString &s) { QCString encoding; bool converted=FALSE; int l = s.length(); static QByteArray enc; if (l*4>(int)enc.size()) enc.resize(l*4); // worst case encoding.sprintf("CP%s",theTranslator->trRTFansicp().data()); if (!encoding.isEmpty()) { // convert from UTF-8 back to the output encoding void *cd = portable_iconv_open(encoding,"UTF-8"); if (cd!=(void *)(-1)) { size_t iLeft=l; size_t oLeft=enc.size(); const char *inputPtr = s.data(); char *outputPtr = enc.data(); if (!portable_iconv(cd, &inputPtr, &iLeft, &outputPtr, &oLeft)) { enc.resize(enc.size()-oLeft); converted=TRUE; } portable_iconv_close(cd); } } if (!converted) // if we did not convert anything, copy as is. { memcpy(enc.data(),s.data(),l); enc.resize(l); } uint i; for (i=0;i<enc.size();i++) { uchar c = (uchar)enc.at(i); if (c>=0x80) { char esc[10]; sprintf(esc,"\\'%X",c); t << esc; // write 2nd byte i++; if (i<enc.size()) { uchar c2 = (uchar)enc.at(i); sprintf(esc,"\\'%X",c2); t << esc; } if (((uchar)c&0xE0)==0xE0) { // write 3rd byte i++; if (i<enc.size()) { uchar c3 = (uchar)enc.at(i); sprintf(esc,"\\'%X",c3); t << esc; } } if (((uchar)c&0xF0)==0xF0) { // write 4th byte i++; if (i<enc.size()) { uchar c4 = (uchar)enc.at(i); sprintf(esc,"\\'%X",c4); t << esc; } } } else { t << (char)c; } } } Can you check if this also works for you?
Hi Dimitri, I read the source code. I did not understand code page you assumed. It is handled incorrectly in Japanese (and perhaps Chinese and Korean). I think it is better not change from the original patch if there is no mistake. It is necessary to process the loop of for (i=0;i<enc.size();i++) according to the output code page of RTF. In the character set before Unicode is standardized, The Single Byte Character Set(only single byte character) or The Double Byte Character Set(single byte character and double character) is most. cp932 (DBCS: Japanese Shift-JIS) cp936 (DBCS: Simplified Chinese GBK) cp949 (DBCS: Korean) cp950 (DBCS: Traditional Chinese Big5) cp1252,1251 etc...(SBCS: LatinI, Cyrillic...) Because these are similar, but are different, probably there is no generic solution. --- Concretely, 0xB1 0x5c works with one character in GBK, but handles it in Shift-JIS with two characters. cp936 :盶(U+76F6) cp932 :ア(U+FF71) \(U+005C) Sorry, 0x5C is displayed '\' in Shift-JIS. it is a backslash in ascii . The "\'B1\'5C" encoding in GBK are correct. but 0x5C of the second character must not encode in Shift-JIS. --- In addition, 0x91 0x5C works with one character in GBK and Shift-JIS, but handles it in Latin-I with two characters. cp936 :慭(U+616D) cp932 :曾(U+66FE) cp1252:‘(U+2018) \(U+005C) --- That's why the code page judgment should speed up, but do not change the part to change processing every language from an original patch greatly.
Hi Hiroa, Thanks for your explanation. I see now why my proposed patch is wrong. I will use your patch instead. Thanks a lot for your help.
*** Bug 643068 has been marked as a duplicate of this bug. ***
*** Bug 166535 has been marked as a duplicate of this bug. ***
This bug was previously marked ASSIGNED, which means it should be fixed in doxygen version 1.7.5. Please verify if this is indeed the case. Reopen the bug if you think it is not fixed and please include any additional information that you think can be relevant.