GNOME Bugzilla – Bug 421678
search should normalize strings
Last modified: 2007-04-02 17:55:54 UTC
If there a string in a file with a precomposed character like "école" with <U+00E9 LATIN SMALL LETTER E WITH ACUTE> and the searched string is "école" with <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT>, no match is found. The other way around should work too. Equivalent unicode strings should match. Search should use g_utf8_normalize() with G_NORMALIZE_NFD aka G_NORMALIZE_DEFAULT before comparing strings.
Could you supply a test file with a few of these examples?
Created attachment 85328 [details] sample gnumeric file with precomposed and composed equivalent strings Here's a sample file with a pair of precomposed and composed character strings. Gnumeric should consider either element of each pair as the other. So searching for one should match the other. The function g_utf8_normalize() can be used before comparing strings. I'd suggest using G_NORMALIZE_DEFAULT = G_NORMALIZE_NFD by default.
The pattern is now normalized (in goffice). The text being searched is a good deal more complicted, at least in the search-and-replace case. Ideally we need to be able to map positions in the searched text back into the original text. It isn't clear to be how we can do that.
I note that the pairs do not even look the same. Have you reported that against pango?
(In reply to comment #4) > I note that the pairs do not even look the same. Have you reported that > against pango? That's related to Bug 322234 but if your font has OpenType tables positioning diacritics it should work. (DejaVu Sans Mono Book >=2.15 does)
Interestingly, the LEN function can tell the difference between the one-char and the two-char versions. (Both us and Excel). That has an interesting effect: if we do search-and-replace as... n = normalize (src); if (match (n, pattern)) { dst = replace_as_needed (n, ...); store dst; } then search and replace will imply normalization when there is a match. In other words, if we replace "x" by "y" in =LEN("<pair>"x) we would get =LEN("<combined>"y) and the result would go down by 1. I don't know how big a problem that would be in practice, though.
Morten: You could always normalized strings, with NFC for better compatibility with legacy.
I was thinking on normalizing all strings on input (from keyboard or files), but I cannot do that if I want to remain Excel compatible for the LEN function. I would find it surprising that replacing "x" by "y" would change a string's length.
XL does not seem to normalize for the purpose of the SEARCH function. In particular, "?" seems to match one unicode character only. It doesn't normalize for the gui search either, but I don't see any reason why we should not do so.
This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.