GNOME Bugzilla – Bug 579756
Unicode Normalization is broken in Indexer and/or Search
Last modified: 2010-05-25 08:31:32 UTC
When searching for a string in NFD, with decomposed characters, the string in NFC is not match in files where it must be. Combining diacritics seem to behave as word separators, but they are not. The following should match each other: "école" U+0065 U+0301 U+0063 U+006F U+006C U+0065 "école" U+00E9 U+0063 U+006F U+006C U+0065 However querying for the first (NFD), matches "e" and "cole" but not the comple word itself.
Created attachment 133467 [details] testcase NFC
Created attachment 133468 [details] testcase NFD Once indexed, the NFD string "école" in the file doesn’t match the query for "école", and vice versa with the NFD testcase file.
Can the reporter of this bug set the _version_ so we can target bugs to work on more easily please. We can also have an idea about if these bugs are likely to be obsolete too. If you don't set the version, we are less likely to look at them. Sorry for the spam, but I don't want to say this 88 times on each bug :)
(In reply to comment #3) The version I have right now is 0.6.95 and the bug is still there.
Looking at 0.7.X, it might still be a problem. In src/libtracker-extract/tracker-utils.c, tracker_text_normalize() only considers the following as word characters, and breaks words with any other character: G_UNICODE_LOWERCASE_LETTER "Letter, Lowercase" (Ll) G_UNICODE_MODIFIER_LETTER "Letter, Modifier" (Lm) G_UNICODE_OTHER_LETTER "Letter, Other" (Lo) G_UNICODE_TITLECASE_LETTER "Letter, Titlecase" (Lt) G_UNICODE_UPPERCASE_LETTER "Letter, Uppercase" (Lu) However Marks should be considered as word characters. Whether normalized (NFC) or not Marks are unavoidable, for example the Lingala word "mbɔ́tɛ" will have the non-spacing mark U+0301. Latin, Cyrillic, Hebrew, Arabic, Indic scripts, etc. use combining diacritics. It's just wrong to split words where those are. G_UNICODE_COMBINING_MARK "Mark, Spacing Combining" (Mc) G_UNICODE_ENCLOSING_MARK "Mark, Enclosing" (Me) G_UNICODE_NON_SPACING_MARK "Mark, Nonspacing" (Mn) should probably be considered as word character and _not_ as word breaking characters.
strip_word() in src/libtracker-fts/tracker-parser.c uses unac_string_utf16(). However unac_string_utf16 does not strip all accents as expected (unless I misunderstand what it's supposed to do). "école" will become "ecole", but "école" with U+0301 will not be changed.
That code hasn't change in the 0.7 rewrite. Setting version to trunk.
Word-break detection algorithm is explained in Unicode Annex #29 (Unicode Text Segmentation), available online at: http://unicode.org/reports/tr29/#Default_Word_Boundaries This actually provides all possible rules for properly detecting what a word is, and not depending on NFC or NFD normalizations. A full implementation of this algorithm is available in the GNU PDF library sources (which I actually wrote some years ago): http://pastebin.com/5SxgChuM Also available it seems in the pretty new GNU libunistring: http://www.gnu.org/software/libunistring GLib doesn't seem to provide a method to properly detect words in an Unicode string, although it may be quite useful.
> In src/libtracker-extract/tracker-utils.c, tracker_text_normalize() only > considers the following as word characters, and breaks words with any other > character: tracker_text_normalize() is going to be removed as per bug #616845 Anyway, the plain-text file extractor doesn't currently use tracker_text_normalize() so the problem is not there.
> strip_word() in src/libtracker-fts/tracker-parser.c uses unac_string_utf16(). > > However unac_string_utf16 does not strip all accents as expected (unless I > misunderstand what it's supposed to do). > "école" will become "ecole", but "école" with U+0301 will not be changed. Without UNAC stripping, the issue is still there.
Really seems a problem of the parser_next() method in tracker-parser.c, which doesn't assume combining marks as part of the word. * One solution would be to perform NFC normalization on the whole input string, and then use parser_next() as-is. The drawback is that if the input string has 30000 words an we only want 1000, we would be normalizing too much. * Another solution would be to use a Unicode-based word-break for all CJK and non-CJK strings, without the need of 'detecting' the type of string in advance (current method btw will not work properly if the string has both CJK and non-CJK characters). This would mean either: use pango_next() always, or try to use another word-breaking algorithm like the one in libunistring to see if its faster than the pango version.
You really want to use UNAC stripping for this and indeed for anything that contains accents. Its a nightmare trying to search for stuff without it THe parser performs NFC normalisation so I dont understand why its a problem for the parser? (surely a bug in glibs unicode implementation if it is indeed a bug?) Indeed the parser passes the broken word to UNAC which strips the accent so I am pretty sure the parser word breaking is fine Its extremely unlikely that a string would contain both CJK and non-CJK so its not worth slowing down the parser for this extreme corner case IMO - of course if you could show via benchmarks that any slowdown was negligible then thats a different matter
> You really want to use UNAC stripping for this and indeed for anything that > contains accents. Its a nightmare trying to search for stuff without it Yeah, UNAC works ok, but only when there's a proper input string :-) > > THe parser performs NFC normalisation so I dont understand why its a problem > for the parser? (surely a bug in glibs unicode implementation if it is indeed a > bug?) > The issue is that normalization is done after having detecting the word. I just added some debug logs in tracker_parser_process_word(), which is the one actually doing the NFC normalization, and this is what I get when putting as input the NFD-form of 'école' (U+0065 U+0301 U+0063 U+006F U+006C U+0065 ): 27 Apr 2010, 17:28:03: (null): ORIGINAL word: 'e' (65) 27 Apr 2010, 17:28:03: (null): After NFC normalization: 'e' (65) 27 Apr 2010, 17:28:03: (null): ORIGINAL word: 'cole' (63:6F:6C:65) 27 Apr 2010, 17:28:03: (null): After NFC normalization: 'cole' (63:6F:6C:65) So when the word is going to be processed and normalized, tracker_next() already did split the word in two parts, as takes U+0301 as word-separator. > Indeed the parser passes the broken word to UNAC which strips the accent so I > am pretty sure the parser word breaking is fine > Well, don't think so based on the example above. If putting as input the NFC-form of 'école' (U+00E9 U+0063 U+006F U+006C U+0065), these are the logs I get: 27 Apr 2010, 17:32:21: (null): ORIGINAL word: 'école' (C3:A9:63:6F:6C:65) 27 Apr 2010, 17:32:21: (null): After UNAC stripping: 'ecole' (65:63:6F:6C:65) 27 Apr 2010, 17:32:21: (null): After NFC normalization: 'ecole' (65:63:6F:6C:65) In this case, parser_next() properly detects the word, UNAC strips the accent and then gets normalized. > Its extremely unlikely that a string would contain both CJK and non-CJK so its > not worth slowing down the parser for this extreme corner case IMO - of course > if you could show via benchmarks that any slowdown was negligible then thats a > different matter I think it would deserve to try libunistring's word-breaker to see how fast or slow it is, compared to the pango version. A proper word-breaker doesn't depend on the normalization type of the input string, I believe.
Just found that there is still another issue: UNAC doesn't strip accents to NFD strings. So another thing to do would be to first NFC-normalize and only then perform UNAC stripping.
This issue is now addressed in the "parser-unicode-libs-review" branch in gnome git. When using either libunistring or libicu based parsers, normalization of the input string doesn't affect to the parsing process.
Moving "Indexer" component bugs to "General" since "Indexer" refers to the old 0.6 architecture
This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.
I'm testing 0.8.7. Tracker is currently re-indexing. I still get wrong matches for the NFD query. Same as previous NFD "école" is broken up at U+0301 as if it was not a word character, "cole" is a match. The NFC query yields different results.
The changes are not in 0.8.7 (stable), they're in 0.9.5 (unstable) or git master, and only if compiled with libunistring (preferred) or libicu parsers (--with-unicode-support configure option)
Oops, sorry for the misunderstanding. It works great. Just noticed one thing, in tracker-search-tool NFD and NFC both show up in the history, they should be handled as the same history item. Should I open a bug report for that?
> > It works great. Nice! > Just noticed one thing, in tracker-search-tool NFD and NFC both show up in the > history, they should be handled as the same history item. Should I open a bug > report for that? Hum... I'm not really sure if someone else than you and me will actually be testing t-s-t with the same word in NFD and NFC forms :-) Storing search items always with the same normalization seems good in some situations (like to avoid duplicated items in the history, even if they are not really real byte-per-byte duplicates), but maybe is not a good idea because you would actually be modifying the real string the user inserted, so it's not history any more, it's... "normalized history". Anyway, feel free to open a new bug report about that :-)
> > > Just noticed one thing, in tracker-search-tool NFD and NFC both show up in the > > history, they should be handled as the same history item. Should I open a bug > > report for that? > > Hum... I'm not really sure if someone else than you and me will actually be > testing t-s-t with the same word in NFD and NFC forms :-) > > Storing search items always with the same normalization seems good in some > situations (like to avoid duplicated items in the history, even if they are not > really real byte-per-byte duplicates), but maybe is not a good idea because you > would actually be modifying the real string the user inserted, so it's not > history any more, it's... "normalized history". > Just as reference, as per bug 619504, history will include items normalized in NFC, so this issue would be solved.