GNOME Bugzilla – Bug 675660
Any search will fail with error about libicu
Last modified: 2012-07-30 13:32:36 UTC
Whatever I type in search, tracker-needle shows an empty list, and prints a lot of errors on the console like this : (tracker-needle:25809): Tracker-WARNING **: Error initializing libicu support: 'U_ILLEGAL_ARGUMENT_ERROR' I tried with tracker-search but have same result : (tracker-search:25833): Tracker-WARNING **: Error initializing libicu support: 'U_ILLEGAL_ARGUMENT_ERROR' I use ArchLinux, x64, with latest updates. (Tracker 0.14.1) tracker-stats show that indexing is done
Got this problem too on a Gentoo system.
Hi I have a fix for this. In fact, when UTF-8 and UTF8 is the same for many applications, tracker-needle and tracker-search only accept UTF-8. Running : LANG=fr_FR.utf8 tracker-needle will make search fail But running : LANG=fr_FR.UTF-8 tracker-needle will make search work
Sorry thas not a 'fix' but rather a workaround ...
What locales are supported with locale -a ?
my case: locale -a C de_DE de_DE@euro de_DE.iso88591 de_DE.iso885915@euro de_DE.utf8 deutsch en_US en_US.iso88591 en_US.utf8 german portuguese POSIX pt_PT pt_PT.iso88591 pt_PT.utf8 also i have icu 49.1.1 installed and i think it's the cause of such problems.
$ locale -a C en_US en_US.iso88591 en_US.utf8 fran�ais french fr_FR fr_FR@euro fr_FR.iso88591 fr_FR.iso885915@euro fr_FR.utf8 POSIX I'm surprised to see 'utf8' here ! That may be why, when I choose my locale from a GUI, then 'utf8' is used instead of 'UTF-8'.
(In reply to comment #5) Can you confirm that when running LANG=de_DE.UTF-8 tracker-needle there is no more error ?
(In reply to comment #7) > (In reply to comment #5) > > Can you confirm that when running > LANG=de_DE.UTF-8 tracker-needle > there is no more error ? same thing here (tracker-needle:2400): Tracker-WARNING **: Error initializing libicu support: 'U_ILLEGAL_ARGUMENT_ERROR' also tracker makes gnome-shell and empathy crash.
(In reply to comment #8) > (In reply to comment #7) > > (In reply to comment #5) > > > > Can you confirm that when running > > LANG=de_DE.UTF-8 tracker-needle > > there is no more error ? > > same thing here (tracker-needle:2400): Tracker-WARNING **: Error initializing > libicu support: 'U_ILLEGAL_ARGUMENT_ERROR' > > also tracker makes gnome-shell and empathy crash. What does LANG=de_DE.UTF-8 TRACKER_VERBOSITY=3 tracker-needle | grep TRACKER_LOCALE gives ? Are there locales set to something else ? I had to add some LC_* variables too. Maybe running just locale may help see what LC_* variables are not correct.
(In reply to comment #8) > (In reply to comment #7) > > (In reply to comment #5) > > > > Can you confirm that when running > > LANG=de_DE.UTF-8 tracker-needle > > there is no more error ? > > same thing here (tracker-needle:2400): Tracker-WARNING **: Error initializing > libicu support: 'U_ILLEGAL_ARGUMENT_ERROR' > > also tracker makes gnome-shell and empathy crash. That's quite a bold claim which I would say is unfounded and untrue. Please provide evidence as to how we're making those crash.
Forget Martyn that was caused by a mistake of mine. Sorry for the trouble but the warning is still there.
According to comments in the Gentoo bugzilla (https://bugs.gentoo.org/show_bug.cgi?id=426276): > Bernd Feige 2012-07-20 10:27:46 UTC > > Update: Setting *all of* LANG=C LC_CTYPE=C LC_NUMERIC=C both tracker-store and tracker-search work (needed to remove ~/.local/share/tracker/data though to get any matches despite a quite sizable tracker-store.journal...) > > Bernd Feige 2012-07-20 12:37:23 UTC > > Update 2: I think I found the reason for the less-than-overwhelming response now: The problem only occurs with "mixed" LC_* settings such as my own. > > When not touching LC_NUMERIC (i.e. unset LC_NUMERIC) everything is file also using LANG=de_DE.UTF-8. > > Now I'm sure that a relatively recent change caused this; could have either been > > sys-libs/glibc-2.15-r2 > > or > > dev-libs/icu-49.1.2
Created attachment 219332 [details] [review] proposed patch I think I see what happened. In the FTS parser, tracker_parser_reset() calls ubrk_open(UBRK_WORD, setlocale (LC_ALL, NULL), ...). Depending on your locale setup, setlocale(LC_ALL, NULL) as implemented by glibc can easily return a string of >200 characters long. ICU uses a fixed-size buffer (welcome to the year 1990!) to process locale strings. The size of this buffer is ULOC_FULLNAME_CAPACITY bytes, where ULOC_FULLNAME_CAPACITY is defined as 157. In the chain of calls from ubrk_open() (BreakIterator::createInstance → BreakIterator::makeInstance → BreakIterator::buildInstance → ures_open → uloc_getBaseName → _canonicalize → u_terminateChars), when supplied an overly long locale definition, ICU overflows an ULOC_FULLNAME_CAPACITY-size buffer, and rather than corrupting memory, throws an error. Which in turn causes tracker_parser_reset() to fail. Ideally, ICU ought to make its buffers bigger. However, since ULOC_FULLNAME_CAPACITY is a part of ICU's API and ABI (it's defined in unicode/uloc.h) and is used over 100 places in the ICU source, this is unlikely to happen in the near future. Fortunately, tracker can easily work around the problem by calling ubrk_open(UBRK_WORD, setlocale (LC_CTYPE, NULL), ...), since that should be sufficient for detecting word boundaries, and the LC_CTYPE definition is certainly less than 157 characters.
After thinking about this a bit further: calling ubrk_open(UBRK_WORD, setlocale(LC_ALL, NULL), ...) was not merely bad practice for some complex locale setups, but *conceptually wrong* in the first place. ubrk_open expects the name of just a single locale (e.g. "en_US.UTF-8"), not the full definition of your various locale variables and their values as returned by glibc's setlocale(LC_ALL, NULL).
I agree that LC_CTYPE sounds like the best choice here. The setlocale manpage is not exactly clear that setlocale (LC_ALL, NULL) may return more than a single locale but it definitely does in some setups.
commit 48713ba26af38a15a97fc7ebb0828cd287ef2447 Author: Alexandre Rostovtsev <tetromino@gentoo.org> Date: Fri Jul 20 10:46:33 2012 -0400 libtracker-fts: ICU cannot handle complex locale descriptions ubrk_open expects the name of just a single locale (e.g. "en_US.UTF-8"), not the full definition of your various locale variables and their values as returned by glibc's setlocale(LC_ALL, NULL). Instead, limit ourselves to LC_CTYPE, since after all, that's all we need to determine word boundaries. Fixes GB#675660.
*** Bug 676989 has been marked as a duplicate of this bug. ***