GNOME Bugzilla – Bug 564462
Tabs (in input) and multibytes characters
Last modified: 2013-05-19 12:36:12 UTC
Hello, I want to report a bug about the Doxygen feature that makes Doxygen replacing tabs in input source code by a computed number of spaces, according to the TAB_SIZE configuration option. The bug appears if, in your input source code, you have multibytes characters (for example French letters with accents inside C/C++ character strings) followed by tabs on the same line. Because of this bug, the number of spaces inserted for replacing these tabs is not computed correctly, resulting in misaligned code in the generated output. This lets me thinking that columns positions are probably computed by counting bytes rather than counting characters when processing source code lines, which is not appropriated if these lines include multibytes characters like all characters outside the 0x20 - 0x7f ASCII code range inside a UTF-8 characters string. This is not a very important bug, but it is anyway a little irritating, and I think it should be quite easy to fix. Gingko
You are correct about the byte counting. For my convenience: can you attach a self contained example (source + config file in a zip) which allows me to reproduce the problem to this bug report?
Created attachment 124820 [details] Sample file (zipped) for bug #564462 Ok. This is the sample that you asked for. Content : sample.cpp Doxyfile Best regards, Gingko
Reopening as the requested information has been provided.
Created attachment 239073 [details] [review] PATCH: count multi-byte characters in source code output correctly Problem was indeed the byte counting as the special characters are converted in util.cpp to UTF8 characters. These characters are printed correctly but as each character had multiple bytes these bytes were counted separately. Corrected output for HTML, man, rtf and xml. Output for latex / PDF looks already correct.
Created attachment 239184 [details] [review] PATCH: extend to all defined UTF-8 characters This patch extends the previous patch. In the previous patch only the UTF-8 "characters" starting with a byte as set by Doxygen were supported. With this patch all currently valid UTF-8 characters are supported (valid UTF-8 "characters" taken from http://en.wikipedia.org/wiki/UTF-8).
Hi Albert, Can you make a patch with only the last set of changes? Note that the function nextUtf8CharPosition() in util.cpp contains a somewhat compacter way to find the next character in a UTF-8 byte stream.
Created attachment 239194 [details] [review] PATCH: count multi-byte characters in source code output correctly, based on comment 6 from Dimitri Making one patch of both changes (making both obsolete). Also incorporated remark regarding nextUtf8CharPosition, created analogous function in util.cpp for this.
Thanks, I'll include the patch in the next subversion update.
This bug was previously marked ASSIGNED, which means it should be fixed in doxygen version 1.8.4. Please verify if this is indeed the case. Reopen the bug if you think it is not fixed and please include any additional information that you think can be relevant.