Bug 616836 – Use libunistring's u8_normalize() instead of GLib's g_utf8_normalize()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 616836 - Use libunistring's u8_normalize() instead of GLib's g_utf8_normalize()


Summary:	Use libunistring's u8_normalize() instead of GLib's g_utf8_normalize()


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	General
Version:	0.9.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-indexer
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-04-26 10:56 UTC by Aleksander Morgado
Modified:	2010-05-20 17:02 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Aleksander Morgado 2010-04-26 10:56:24 UTC

GLib's Unicode normalization methods strongly rely on heap allocations to perform the normalization.

libunistring's normalization methods don't allocate themselves the output buffer, and thus, even stack-allocated memory can be used to perform the normalization:
http://www.gnu.org/software/libunistring/manual/libunistring.html#Normalization-of-strings

Thus, linking to libunistring to perform Unicode normalizations could really improve the performance of the parsing operations.

Also:
 * A full-text normalization instead of a word-by-word one could be done.
 * Same approach of using libunistring could be applied for casefold-ing done just before normalization.

Note: instead of libunistring, libicu is also probably a good choice:
http://bugs.icu-project.org/trac/browser/icu/trunk/source/common/unicode/unorm2.h

Comment 1 Aleksander Morgado 2010-05-11 12:41:32 UTC

This issue is now addressed in the "parser-unicode-libs-review" branch in gnome git.

Both libunistring and libicu choices are given.

Comment 2 Martyn Russell 2010-05-17 13:33:49 UTC

Moving "Indexer" component bugs to "General" since "Indexer" refers to the old 0.6 architecture

Comment 3 Martyn Russell 2010-05-20 17:02:15 UTC

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.