Bug 564462 – Tabs (in input) and multibytes characters

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 564462 - Tabs (in input) and multibytes characters


Summary:	Tabs (in input) and multibytes characters


Status:	RESOLVED FIXED

Product:	doxygen
Classification:	Other
Component:	general
Version:	1.5.7.1
Hardware:	Other Windows

Importance:	Normal minor
Target Milestone:	---
Assigned To:	Dimitri van Heesch
QA Contact:	Dimitri van Heesch

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-12-14 10:40 UTC by Gingko
Modified:	2013-05-19 12:36 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Sample file (zipped) for bug #564462 (3.42 KB, application/octet-stream) 2008-12-16 19:40 UTC, Gingko		Details
PATCH: count multi-byte characters in source code output correctly (5.53 KB, patch) 2013-03-17 20:15 UTC, albert	none	Details \| Review
PATCH: extend to all defined UTF-8 characters (9.51 KB, patch) 2013-03-18 19:21 UTC, albert	none	Details \| Review
PATCH: count multi-byte characters in source code output correctly, based on comment 6 from Dimitri (5.45 KB, patch) 2013-03-18 20:39 UTC, albert	none	Details \| Review

Description Gingko 2008-12-14 10:40:22 UTC

Hello,

I want to report a bug about the Doxygen feature that makes Doxygen replacing tabs in input source code by a computed number of spaces, according to the TAB_SIZE configuration option.

The bug appears if, in your input source code, you have multibytes characters (for example French letters with accents inside C/C++ character strings) followed by tabs on the same line.

Because of this bug, the number of spaces inserted for replacing these tabs is not computed correctly, resulting in misaligned code in the generated output.

This lets me thinking that columns positions are probably computed by counting bytes rather than counting characters when processing source code lines, which is not appropriated if these lines include multibytes characters like all characters outside the 0x20 - 0x7f ASCII code range inside a UTF-8 characters string.

This is not a very important bug, but it is anyway a little irritating, and I think it should be quite easy to fix.

Gingko

Comment 1 Dimitri van Heesch 2008-12-16 18:59:01 UTC

You are correct about the byte counting. For my convenience: can you attach a self contained example (source + config file in a zip) which allows me to reproduce the problem to this bug report?

Comment 2 Gingko 2008-12-16 19:40:43 UTC

Created attachment 124820 [details]
Sample file (zipped) for bug #564462

Ok. This is the sample that you asked for.

Content :
  sample.cpp
  Doxyfile

Best regards,

Gingko

Comment 3 Tobias Mueller 2009-05-30 15:04:27 UTC

Reopening as the requested information has been provided.

Comment 4 albert 2013-03-17 20:15:36 UTC

Created attachment 239073 [details] [review]
PATCH: count multi-byte characters in source code output correctly

Problem was indeed the byte counting as the special characters are converted in util.cpp to UTF8 characters. These characters are printed correctly but as each character had multiple bytes these bytes were counted separately. Corrected output for HTML, man, rtf and xml. Output for latex / PDF looks already correct.

Comment 5 albert 2013-03-18 19:21:25 UTC

Created attachment 239184 [details] [review]
PATCH: extend to all defined UTF-8 characters

This patch extends the previous patch. In the previous patch only the UTF-8 "characters" starting with a byte as set by Doxygen were supported. With this patch all currently valid UTF-8 characters are supported (valid UTF-8 "characters" taken from http://en.wikipedia.org/wiki/UTF-8).

Comment 6 Dimitri van Heesch 2013-03-18 19:31:05 UTC

Hi Albert,

Can you make a patch with only the last set of changes?

Note that the function nextUtf8CharPosition() in util.cpp contains a somewhat compacter way to find the next character in a UTF-8 byte stream.

Comment 7 albert 2013-03-18 20:39:57 UTC

Created attachment 239194 [details] [review]
PATCH: count multi-byte characters in source code output correctly,  based on comment 6 from Dimitri

Making  one patch of both changes (making both obsolete). Also incorporated remark regarding nextUtf8CharPosition, created analogous function in util.cpp for this.

Comment 8 Dimitri van Heesch 2013-03-20 19:09:01 UTC

Thanks, I'll include the patch in the next subversion update.

Comment 9 Dimitri van Heesch 2013-05-19 12:36:12 UTC

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.8.4. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information
that you think can be relevant.