Bug 705910 – Indexing and searching cannot treat non ASCII identifiers

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 705910 - Indexing and searching cannot treat non ASCII identifiers


Summary:	Indexing and searching cannot treat non ASCII identifiers


Status:	RESOLVED OBSOLETE

Product:	doxygen
Classification:	Other
Component:	general
Version:	1.8.6-GIT
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Dimitri van Heesch
QA Contact:	Dimitri van Heesch

URL:
Whiteboard:	[moved_to_github]

Depends on:
Blocks:

Reported:	2013-08-13 13:24 UTC by Suzumizaki-Kimitaka
Modified:	2018-07-30 10:41 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
fix indexing and built-in searching for non ASCII identifiers (33.69 KB, application/octet-stream) 2013-08-13 13:24 UTC, Suzumizaki-Kimitaka		Details
The updated and fixed patch (42.39 KB, patch) 2013-09-10 10:49 UTC, Suzumizaki-Kimitaka	none	Details \| Review
The html documents pair, to show official fix (on the git) cannot solve the problem. (1.22 MB, application/x-zip-compressed) 2013-10-22 02:26 UTC, Suzumizaki-Kimitaka		Details
The new patch against current origin/HEAD (3.09 KB, text/plain) 2013-10-26 15:07 UTC, Suzumizaki-Kimitaka		Details
Update sample project (1.40 MB, application/x-zip-compressed) 2013-10-29 01:42 UTC, Suzumizaki-Kimitaka		Details
Updated patch for 1.8.6 release (9.74 KB, patch) 2013-12-27 05:40 UTC, Suzumizaki-Kimitaka	none	Details \| Review

Description Suzumizaki-Kimitaka 2013-08-13 13:24:02 UTC

Created attachment 251489 [details]
fix indexing and built-in searching for non ASCII identifiers

I made the patch already, please just apply it.
The details and notes about regression test are below.

Details:
a1) Because the indices make groups against first byte(octet) of UTF-8, the entries are wrong grouped when the names start with non ASCII characters. Like U+0080 - U+00BF go to the group 0xC2 and U+0800 - U+0FFF go to the group 0xE0. Ofcourse they should go same as ASCII characters, like 'A' to 'A', 'B' to 'B'.

a2) The appearance of index group header "- A -", "- B -", "- C -", ... are correctly shown with ASCII ONLY. For non ASCII characters, all headers are shown like "- <?> -". Because 0xC0-0xFF that are NOT followed by 0x80-0xBF are all invalid sequence as UTF-8. 

b1) Builtin javascript search doesn't work with non-ASCII entries. The entries on database are escaped as UTF-8, but the entered words from the search box are escaped as (broken) UTF-16. 

For now, Javascript/ECMAscript can treat unicode directly. We don't have to escape but except to name the files in the "search" folder. Their name has the hexadecimal tail that represents the common first character of the entries. My patch makes them depends on unicode codepoints instead of UTF-8 header byte.

b2) Built-in PHP search deletes non-ASCII characters in searchbox every time.

To fix these problems,
1) Some new functions added to utils.h/cpp.
2) For indexing, index.cpp is fixed.
3) For searching, search_js.h, search.js, search_functions.php, search_functions_php.h and searchindex.cpp are fixed.

Note for regression test on Microsoft Windows:
Sorry to say about my patch posted before (Bug 705219) didn't pass the regression test.

I could not run the tests before, because I couldn't run xmllint easily.
The binary distribution of xmllint doesn't work, it requires old (and looks 'correct') version of iconv.dll. I have to build it from source code of libxml2. Even now I can only build some of libxml2 but I can get xmllint.exe for now.

Today Git SHA-1: 83fc120e5575446b1161e9ffb8168d55c423f7ac fails test 12. And my patch here doesn't fail another tests I believe.

Regards,
Suzumizaki-Kimitaka

Comment 1 Dimitri van Heesch 2013-08-22 14:29:43 UTC

Thanks for your patch, but I think it requires more thought.

I now see some loops like these in the code:

  for (p=0;p<=MAX_UNICODE_CODEPOINT;p++)

where MAX_UNICODE_CODEPOINT is 0x10FFFF

Performance wise, this is not good, especially since in 99,9% of the iterations nothing will be done other than checking if something needs to be done. If you already use a hash/map then it is better to just iterate over it.

Do you want to make an improved patch? or do you want me to improve it myself?

Comment 2 Suzumizaki-Kimitaka 2013-08-22 16:16:26 UTC

I'm sorry but I would like you to improve it, because I don't know which qtools class I should use. 

As you say, over some of the cases we should simply use iterator, but the others we seem to need to ensure iterating by codepoint-order.

Regards,
Suzumizaki-Kimitaka

Comment 3 Suzumizaki-Kimitaka 2013-08-31 12:31:32 UTC

Hello, have you started to improve the loop problem to iterators?
If not yet, I'll try to.

Tell me I should try or just wait your work.

(I want to make the patch to another issue, but before that, it seems better to resolve this problem first.)

Regards,
Suzumizaki-Kimitaka

Comment 4 Suzumizaki-Kimitaka 2013-09-10 10:49:48 UTC

Created attachment 254582 [details] [review]
The updated and fixed patch

Hello.
I found the bug like Bug 707278 with previous patch, and
I have fixed the iterator problem blamed here.

I made the new patch against SHA-1: 1e373422387e8c1131f887efb47cf3da6459e2ac.
Previous one is expired.

Please apply the new one.

Regards,
Suzumizaki-Kimitaka

Comment 5 Dimitri van Heesch 2013-09-15 18:15:38 UTC

Thanks, I've just pushed a somewhat reworked version of your patch to GitHub.

Comment 6 Suzumizaki-Kimitaka 2013-10-22 02:26:52 UTC

Created attachment 257809 [details]
The html documents pair, to show official fix (on the git) cannot solve the problem.

Sorry to say, Dimitri, your workaround (as you said at comment 5) breaks some functionalities.
Please read the html document contained in the attachment with this comment, and tell me how do you plan to do.
The failed_html is made on your work, and the correct_html is on mine.

Regards,
Suzumizaki-Kimitaka

Comment 7 Suzumizaki-Kimitaka 2013-10-26 15:07:02 UTC

Created attachment 258178 [details]
The new patch against current origin/HEAD

Okay, I made a new patch. You have another choice now.
The new patch targets SHA-1: 74815268dd88f2cfb4473462cef3c33eebd5516a

Note that I found one more bug and also fixed with this patch.
The doxygen on current origin/HEAD distinguish upper/lowercase of identifiers.
I'll make new sample project zip like I posted before.

Regards,
Suzumizaki-Kimitaka

Comment 8 Suzumizaki-Kimitaka 2013-10-29 01:42:54 UTC

Created attachment 258383 [details]
Update sample project

The update version of html documents pair, to show official fix (on the git) cannot solve the problem(see comment 6 and 7).

Comment 9 Dimitri van Heesch 2013-12-24 18:59:58 UTC

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.8.6. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information 
that you think can be relevant (preferrably in the form of a self-contained example).

Comment 10 Suzumizaki-Kimitaka 2013-12-27 05:40:48 UTC

Created attachment 264918 [details] [review]
Updated patch for 1.8.6 release

As I told before, the work against this issue is not finished.
(Note again this is NOT my fault! The rework told at comment 5 IS failed.)

Here's updated patch, but in fact, only the line of the target files are fixed.

Regards,
Suzumizaki-Kimitaka

Comment 11 André Klapper 2018-07-30 10:41:06 UTC

As discussed in https://github.com/doxygen/doxygen/pull/734 , Doxygen has moved its issue tracking to 

   https://github.com/doxygen/doxygen/issues

All Doxygen tickets in GNOME Bugzilla have been migrated to Github. You can subscribe and participate in the new ticket in Github. You can find the corresponding Github ticket by searching for its Bugzilla ID (number) in Github.

Hence I am closing this GNOME Bugzilla ticket.
Please use the corresponding ticket in Github instead. Thanks a lot!