Bug 421678 – search should normalize strings

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 421678 - search should normalize strings


Summary:	search should normalize strings


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	General
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Jody Goldberg
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:	423036

Reported:	2007-03-22 22:17 UTC by Denis Jacquerye
Modified:	2007-04-02 17:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
sample gnumeric file with precomposed and composed equivalent strings (2.14 KB, application/x-gnumeric) 2007-03-26 18:51 UTC, Denis Jacquerye	Details

Description Denis Jacquerye 2007-03-22 22:17:50 UTC

If there a string in a file with a precomposed character like "école" with <U+00E9 LATIN SMALL LETTER E WITH ACUTE> and the searched string is "école" with <U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT>, no match is found. The other way around should work too.

Equivalent unicode strings should match.

Search should use g_utf8_normalize() with G_NORMALIZE_NFD aka G_NORMALIZE_DEFAULT before comparing strings.

Comment 1 Morten Welinder 2007-03-23 14:59:11 UTC

Could you supply a test file with a few of these examples?

Comment 2 Denis Jacquerye 2007-03-26 18:51:12 UTC

Created attachment 85328 [details]
sample gnumeric file with precomposed and composed equivalent strings

Here's a sample file with a pair of precomposed and composed character strings.

Gnumeric should consider either element of each pair as the other. So searching for one should match the other. 

The function g_utf8_normalize() can be used before comparing strings. I'd suggest using G_NORMALIZE_DEFAULT = G_NORMALIZE_NFD by default.

Comment 3 Morten Welinder 2007-03-26 22:16:18 UTC

The pattern is now normalized (in goffice).

The text being searched is a good deal more complicted, at least in the
search-and-replace case.  Ideally we need to be able to map positions in
the searched text back into the original text.  It isn't clear to be how
we can do that.

Comment 4 Morten Welinder 2007-03-27 17:23:29 UTC

I note that the pairs do not even look the same.  Have you reported that
against pango?

Comment 5 Denis Jacquerye 2007-03-27 17:51:32 UTC

(In reply to comment #4)
> I note that the pairs do not even look the same.  Have you reported that
> against pango?
That's related to Bug 322234 but if your font has OpenType tables positioning diacritics it should work. (DejaVu Sans Mono Book >=2.15 does)

Comment 6 Morten Welinder 2007-03-27 18:34:32 UTC

Interestingly, the LEN function can tell the difference between the
one-char and the two-char versions.  (Both us and Excel).  That has
an interesting effect: if we do search-and-replace as...

  n = normalize (src);
  if (match (n, pattern)) {
    dst = replace_as_needed (n, ...);
    store dst;
  }

then search and replace will imply normalization when there is a match.
In other words, if we replace "x" by "y" in

   =LEN("<pair>"x)

we would get

   =LEN("<combined>"y)

and the result would go down by 1.  I don't know how big a problem that
would be in practice, though.

Comment 7 Denis Jacquerye 2007-03-27 19:22:22 UTC

Morten: You could always normalized strings, with NFC for better compatibility with legacy.

Comment 8 Morten Welinder 2007-03-27 20:18:47 UTC

I was thinking on normalizing all strings on input (from keyboard or files),
but I cannot do that if I want to remain Excel compatible for the
LEN function.

I would find it surprising that replacing "x" by "y" would change a string's
length.

Comment 9 Morten Welinder 2007-04-02 13:41:12 UTC

XL does not seem to normalize for the purpose of the SEARCH function.
In particular, "?" seems to match one unicode character only.

It doesn't normalize for the gui search either, but I don't see any reason
why we should not do so.

Comment 10 Morten Welinder 2007-04-02 17:55:54 UTC

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.