Bug 579756 – Unicode Normalization is broken in Indexer and/or Search

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 579756 - Unicode Normalization is broken in Indexer and/or Search


Summary:	Unicode Normalization is broken in Indexer and/or Search


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	General
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-indexer
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-04-21 18:08 UTC by Denis Jacquerye
Modified:	2010-05-25 08:31 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
testcase NFC (148 bytes, text/plain) 2009-04-28 05:25 UTC, Denis Jacquerye	Details
testcase NFD (168 bytes, text/plain) 2009-04-28 05:26 UTC, Denis Jacquerye	Details

Description Denis Jacquerye 2009-04-21 18:08:38 UTC

When searching for a string in NFD, with decomposed characters, the string in NFC is not match in files where it must be.

Combining diacritics seem to behave as word separators, but they are not.

The following should match each other:
"école" U+0065 U+0301 U+0063 U+006F U+006C U+0065
"école" U+00E9 U+0063 U+006F U+006C U+0065

However querying for the first (NFD), matches "e" and "cole" but not the comple word itself.

Comment 1 Denis Jacquerye 2009-04-28 05:25:09 UTC

Created attachment 133467 [details]
testcase NFC

Comment 2 Denis Jacquerye 2009-04-28 05:26:55 UTC

Created attachment 133468 [details]
testcase NFD

Once indexed, the NFD string "école" in the file doesn’t match the query for "école", and vice versa with the NFD testcase file.

Comment 3 Martyn Russell 2010-03-11 15:21:54 UTC

Can the reporter of this bug set the _version_ so we can target bugs to work on more easily please. We can also have an idea about if these bugs are likely to be obsolete too.

If you don't set the version, we are less likely to look at them.

Sorry for the spam, but I don't want to say this 88 times on each bug :)

Comment 4 Denis Jacquerye 2010-03-11 23:19:47 UTC

(In reply to comment #3)
The version I have right now is 0.6.95 and the bug is still there.

Comment 5 Denis Jacquerye 2010-03-12 00:24:02 UTC

Looking at 0.7.X, it might still be a problem.

In src/libtracker-extract/tracker-utils.c, tracker_text_normalize() only considers the following as word characters, and breaks words with any other character:
G_UNICODE_LOWERCASE_LETTER "Letter, Lowercase" (Ll)
G_UNICODE_MODIFIER_LETTER "Letter, Modifier" (Lm)
G_UNICODE_OTHER_LETTER "Letter, Other" (Lo)
G_UNICODE_TITLECASE_LETTER "Letter, Titlecase" (Lt)
G_UNICODE_UPPERCASE_LETTER "Letter, Uppercase" (Lu)

However Marks should be considered as word characters. Whether normalized (NFC) or not Marks are unavoidable, for example the Lingala word "mbɔ́tɛ" will have the non-spacing mark U+0301. Latin, Cyrillic, Hebrew, Arabic, Indic scripts, etc. use combining diacritics. It's just wrong to split words where those are.

G_UNICODE_COMBINING_MARK "Mark, Spacing Combining" (Mc)
G_UNICODE_ENCLOSING_MARK "Mark, Enclosing" (Me)
G_UNICODE_NON_SPACING_MARK "Mark, Nonspacing" (Mn)
should probably be considered as word character and _not_ as word breaking characters.

Comment 6 Denis Jacquerye 2010-03-12 03:00:59 UTC

strip_word() in src/libtracker-fts/tracker-parser.c uses unac_string_utf16().

However unac_string_utf16 does not strip all accents as expected (unless I misunderstand what it's supposed to do).
"école" will become "ecole", but "école" with U+0301 will not be changed.

Comment 7 Ivan Frade 2010-03-12 09:44:32 UTC

That code hasn't change in the 0.7 rewrite. Setting version to trunk.

Comment 8 Aleksander Morgado 2010-04-19 14:19:13 UTC

Word-break detection algorithm is explained in Unicode Annex #29 (Unicode Text Segmentation), available online at:
http://unicode.org/reports/tr29/#Default_Word_Boundaries

This actually provides all possible rules for properly detecting what a word is, and not depending on NFC or NFD normalizations.

A full implementation of this algorithm is available in the GNU PDF library sources (which I actually wrote some years ago): http://pastebin.com/5SxgChuM

Also available it seems in the pretty new GNU libunistring:
http://www.gnu.org/software/libunistring

GLib doesn't seem to provide a method to properly detect words in an Unicode string, although it may be quite useful.

Comment 9 Aleksander Morgado 2010-04-27 14:36:15 UTC

> In src/libtracker-extract/tracker-utils.c, tracker_text_normalize() only
> considers the following as word characters, and breaks words with any other
> character:

tracker_text_normalize() is going to be removed as per bug #616845

Anyway, the plain-text file extractor doesn't currently use tracker_text_normalize() so the problem is not there.

Comment 10 Aleksander Morgado 2010-04-27 14:41:34 UTC

> strip_word() in src/libtracker-fts/tracker-parser.c uses unac_string_utf16().
> 
> However unac_string_utf16 does not strip all accents as expected (unless I
> misunderstand what it's supposed to do).
> "école" will become "ecole", but "école" with U+0301 will not be changed.

Without UNAC stripping, the issue is still there.

Comment 11 Aleksander Morgado 2010-04-27 14:42:37 UTC

Really seems a problem of the parser_next() method in tracker-parser.c, which doesn't assume combining marks as part of the word.

* One solution would be to perform NFC normalization on the whole input string, and then use parser_next() as-is. The drawback is that if the input string has 30000 words an we only want 1000, we would be normalizing too much.

* Another solution would be to use a Unicode-based word-break for all CJK and non-CJK strings, without the need of 'detecting' the type of string in advance (current method btw will not work properly if the string has both CJK and non-CJK characters). This would mean either: use pango_next() always, or try to use another word-breaking algorithm like the one in libunistring to see if its faster than the pango version.

Comment 12 Jamie McCracken 2010-04-27 15:13:54 UTC

You really want to use UNAC stripping for this and indeed for anything that contains accents. Its a nightmare trying to search for stuff without it

THe parser performs NFC normalisation so I dont understand why its a problem for the parser? (surely a bug in glibs unicode implementation if it is indeed a bug?)

Indeed the parser passes the broken word to UNAC which strips the accent so I am pretty sure the parser word breaking is fine 

Its extremely unlikely that a string would contain both CJK and non-CJK so its not worth slowing down the parser for this extreme corner case IMO - of course if you could show via benchmarks that any slowdown was negligible then thats a different matter

Comment 13 Aleksander Morgado 2010-04-27 15:39:24 UTC

> You really want to use UNAC stripping for this and indeed for anything that
> contains accents. Its a nightmare trying to search for stuff without it

Yeah, UNAC works ok, but only when there's a proper input string :-)

> 
> THe parser performs NFC normalisation so I dont understand why its a problem
> for the parser? (surely a bug in glibs unicode implementation if it is indeed a
> bug?)
> 

The issue is that normalization is done after having detecting the word. I just added some debug logs in tracker_parser_process_word(), which is the one actually doing the NFC normalization, and this is what I get when putting as input the NFD-form of 'école' (U+0065 U+0301 U+0063 U+006F U+006C U+0065
):

27 Apr 2010, 17:28:03: (null): ORIGINAL word: 'e' (65)
27 Apr 2010, 17:28:03: (null):   After NFC normalization: 'e' (65)
27 Apr 2010, 17:28:03: (null): ORIGINAL word: 'cole' (63:6F:6C:65)
27 Apr 2010, 17:28:03: (null):   After NFC normalization: 'cole' (63:6F:6C:65)

So when the word is going to be processed and normalized, tracker_next() already did split the word in two parts, as takes U+0301 as word-separator.


> Indeed the parser passes the broken word to UNAC which strips the accent so I
> am pretty sure the parser word breaking is fine 
> 

Well, don't think so based on the example above.

If putting as input the NFC-form of 'école' (U+00E9 U+0063 U+006F U+006C U+0065), these are the logs I get:

27 Apr 2010, 17:32:21: (null): ORIGINAL word: 'école' (C3:A9:63:6F:6C:65)
27 Apr 2010, 17:32:21: (null):  After UNAC stripping: 'ecole' (65:63:6F:6C:65)
27 Apr 2010, 17:32:21: (null):   After NFC normalization: 'ecole' (65:63:6F:6C:65)

In this case, parser_next() properly detects the word, UNAC strips the accent and then gets normalized.


> Its extremely unlikely that a string would contain both CJK and non-CJK so its
> not worth slowing down the parser for this extreme corner case IMO - of course
> if you could show via benchmarks that any slowdown was negligible then thats a
> different matter

I think it would deserve to try libunistring's word-breaker to see how fast or slow it is, compared to the pango version. A proper word-breaker doesn't depend on the normalization type of the input string, I believe.

Comment 14 Aleksander Morgado 2010-04-27 18:24:58 UTC

Just found that there is still another issue: UNAC doesn't strip accents to NFD strings. So another thing to do would be to first NFC-normalize and only then perform UNAC stripping.

Comment 15 Aleksander Morgado 2010-05-12 09:56:16 UTC

This issue is now addressed in the "parser-unicode-libs-review" branch in gnome
git.

When using either libunistring or libicu based parsers, normalization of the input string doesn't affect to the parsing process.

Comment 16 Martyn Russell 2010-05-17 13:30:41 UTC

Moving "Indexer" component bugs to "General" since "Indexer" refers to the old 0.6 architecture

Comment 17 Martyn Russell 2010-05-20 16:34:42 UTC

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.

Comment 18 Denis Jacquerye 2010-05-20 20:57:50 UTC

I'm testing 0.8.7. Tracker is currently re-indexing.
I still get wrong matches for the NFD query.

Same as previous NFD "école" is broken up at U+0301 as if it was not a word character, "cole" is a match. The NFC query yields different results.

Comment 19 Aleksander Morgado 2010-05-20 21:10:34 UTC

The changes are not in 0.8.7 (stable), they're in 0.9.5 (unstable) or git master, and only if compiled with libunistring (preferred) or libicu parsers (--with-unicode-support configure option)

Comment 20 Denis Jacquerye 2010-05-20 22:03:32 UTC

Oops, sorry for the misunderstanding.

It works great.
Just noticed one thing, in tracker-search-tool NFD and NFC both show up in the history, they should be handled as the same history item. Should I open a bug report for that?

Comment 21 Aleksander Morgado 2010-05-20 22:19:01 UTC

> 
> It works great.

Nice!

> Just noticed one thing, in tracker-search-tool NFD and NFC both show up in the
> history, they should be handled as the same history item. Should I open a bug
> report for that?

Hum... I'm not really sure if someone else than you and me will actually be testing t-s-t with the same word in NFD and NFC forms :-)

Storing search items always with the same normalization seems good in some situations (like to avoid duplicated items in the history, even if they are not really real byte-per-byte duplicates), but maybe is not a good idea because you would actually be modifying the real string the user inserted, so it's not history any more, it's... "normalized history".

Anyway, feel free to open a new bug report about that :-)

Comment 22 Aleksander Morgado 2010-05-25 08:31:32 UTC

> 
> > Just noticed one thing, in tracker-search-tool NFD and NFC both show up in the
> > history, they should be handled as the same history item. Should I open a bug
> > report for that?
> 
> Hum... I'm not really sure if someone else than you and me will actually be
> testing t-s-t with the same word in NFD and NFC forms :-)
> 
> Storing search items always with the same normalization seems good in some
> situations (like to avoid duplicated items in the history, even if they are not
> really real byte-per-byte duplicates), but maybe is not a good idea because you
> would actually be modifying the real string the user inserted, so it's not
> history any more, it's... "normalized history".
> 

Just as reference, as per bug 619504, history will include items normalized in NFC, so this issue would be solved.