GNOME Bugzilla – Bug 666749
Empty window in LANG=ko_KR.UTF-8
Last modified: 2014-03-21 12:52:55 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=770024 Description of problem: Empty window in LANG=ko_KR.UTF-8. Then, Many document icons in $ LANG=en_US.UTF-8 or $ LANG=ja_JP.UTF-8 gnome-documents. Version-Release number of selected component (if applicable): 0.2.1-1.fc16.x86_64 How reproducible: always Steps to Reproduce: 1. $LANG=ko_KR.UTF-8 gnome-documents 2. 3. --- gjs-1.30.0-1.fc16.x86_64 tracker-0.12.8-2.fc16.x86_64
-> tracker This looks like a bug in tracker's implementation of the fn:starts-with function, which we use to filter out files from URIs we're not interested in. As a testcase, run the following commands tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc/Pictures\")) } " Results: [ a lot of results ] LANG=ko_KR.UTF-8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc/Pictures\")) } " Results: None Replacing fn:starts-with with fn:contains fixes the bug on my machine; I will probably commit such a workaround to gnome-documents if we don't manage to get this fixed in a better way before 3.4.
Actually reassigning.
This issue still happens in tracker 0.14.0
(In reply to comment #1) > -> tracker > > This looks like a bug in tracker's implementation of the fn:starts-with > function, which we use to filter out files from URIs we're not interested in. > > As a testcase, run the following commands > > tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER > (fn:starts-with(nie:url(?u), \"file:///home/cosimoc/Pictures\")) } " > Results: > [ a lot of results ] > > LANG=ko_KR.UTF-8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo > FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc/Pictures\")) } " > Results: > None > > Replacing fn:starts-with with fn:contains fixes the bug on my machine; I will > probably commit such a workaround to gnome-documents if we don't manage to get > this fixed in a better way before 3.4. Hello Cosimo, I just tested this with master and it seems to work fine, but I don't see any differences in master from the 0.14 branch which would relate to this. This is what I get: $ LANG=ko_KR.UTF-8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/martyn/Pictures\")) } "|wc -l 1000 $ tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/martyn/Pictures\")) } "|wc -l 1000 I was considering this being a collation issue, but I am unable to reproduce the issue locally. Is there anything special about the file names (i.e. are they non-ascii)?
(In reply to comment #4) > Hello Cosimo, I just tested this with master and it seems to work fine, but I > don't see any differences in master from the 0.14 branch which would relate to > this. > > This is what I get: > > $ LANG=ko_KR.UTF-8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a > nmm:Photo FILTER > (fn:starts-with(nie:url(?u), \"file:///home/martyn/Pictures\")) } "|wc -l > 1000 > > $ tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER > (fn:starts-with(nie:url(?u), \"file:///home/martyn/Pictures\")) } "|wc -l > 1000 Weird; with the same test query I get (using Tracker 0.14) LANG=ko_KR.UTF-8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc/Pictures\")) } "|wc -l 3 (which is "Results", "None", and newline) But without setting the locale to ko_KR.UTF-8 I get: tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc/Pictures\")) } "|wc -l 292 I can try to test with git master, but I don't see any relevant commits either...any information I can provide to debug this further? Which distribution are you using (I'm on Fedora)? Maybe this is triggered by a different configuration in the underlying locale plumbing between distros?
*** Bug 673224 has been marked as a duplicate of this bug. ***
I just tested this on Fedora 16: tracker-0.12.10-1.fc16.x86_64 gnome-documents-0.2.1-1.fc16.x86_64 [julas@snowball2 ~]$ LANG=pl_PL.utf8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/julas/Obrazy\")) } " | wc -l 3 [julas@snowball2 ~]$ LANG=en_US.utf8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/julas/Obrazy\")) } " | wc -l 3766 and on Fedora 17: tracker-0.14.0-1.fc17.x86_64 gnome-documents-0.4.0.1-1.fc17.x86_64 [julas@branched Obrazy]$ LANG=pl_PL.utf8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/julas/Obrazy\")) } " | wc -l 3 [julas@branched Obrazy]$ LANG=en_US.utf8 tracker-sparql -q "SELECT ?u nie:url(?u) WHERE {?u a nmm:Photo FILTER (fn:starts-with(nie:url(?u), \"file:///home/julas/Obrazy\")) } " | wc -l 4
To be fair, I tested on my laptop with Ubuntu on it. Testing here with F16 (desktop), I get the same results (using tracker-0.12.10-1.fc16.x86_64) in the en_US, pl_PL and ko_KR locales we've been bouncing around here. I know that the locale affects the collation and hence the sorting for results generally, but I wouldn't expect a different number of results. Out of curiosity, what unicode backend are you using? Presumably libunistring as is used for my system here: $ rpm -qR tracker-0.12.10-1.fc16.x86_64|grep -i unistring libunistring.so.0()(64bit) Do the files you search for have interesting names at all? I wonder if I lack the material to test with this end?
(In reply to comment #9) > Out of curiosity, what unicode backend are you using? Presumably libunistring > as is used for my system here: > > $ rpm -qR tracker-0.12.10-1.fc16.x86_64|grep -i unistring > libunistring.so.0()(64bit) Same here, tracker 0.14 is compiled against the unistring backend on F17. > Do the files you search for have interesting names at all? I wonder if I lack > the material to test with this end? I think the problem doesn't lie in the names of the files, but it's in the way the FILTER directive is processed; as a data point supporting this, these two (almost equivalent) queries, both with Korean locale, give two completely different results: $ LANG=ko_KR.utf-8 tracker-sparql -q "SELECT ?u WHERE { ?u a rdfs:Resource }" | wc -l 4938 $ LANG=ko_KR.utf-8 tracker-sparql -q "SELECT ?u WHERE { ?u a rdfs:Resource FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc\")) }" | wc -l 3
(In reply to comment #10) > I think the problem doesn't lie in the names of the files, but it's in the way > the FILTER directive is processed; as a data point supporting this, these two > (almost equivalent) queries, both with Korean locale, give two completely > different results: > > $ LANG=ko_KR.utf-8 tracker-sparql -q "SELECT ?u WHERE { ?u a rdfs:Resource }" | > wc -l > 4938 > > $ LANG=ko_KR.utf-8 tracker-sparql -q "SELECT ?u WHERE { ?u a rdfs:Resource > FILTER (fn:starts-with(nie:url(?u), \"file:///home/cosimoc\")) }" | wc -l > 3 Indeed. I discussed this with Jürg in the #tracker room today and I checked the code too. The reason fn:starts-with doesn't work the same way but it does with fn:contains is because (AFAICS) one uses a GLOB and the other uses BETWEEN keywords in SQL. Now, we do something fancy by comparing between 'A' and 'B'+TRACKER_COLLATION_LAST_CHAR. We do this for collation reasons. What makes this complicated is, the TRACKER_COLLATION_LAST_CHAR is not the same for all backends. You can see this in the Tracker source directory: $ git grep COLLATION_LAST_CHAR . | grep -i define src/libtracker-data/tracker-collation.h:#define TRACKER_COLLATION_LAST_CHAR ((gunichar) 0x10fffd) src/libtracker-data/tracker-collation.h:#define TRACKER_COLLATION_LAST_CHAR ((gunichar) 0x9fa5) One is for libunistring and the other for libicu. So we switch depending on the implementation we were built with (according to configure). One way this would cause your situation is if the TRACKER_COLLATION_LAST_CHAR is *not* the last character any more or it's now sorted into a position which breaks things for us. What I wonder is, if you switch to libicu in your build of Tracker, does this change anything for you? I've CCd Aleksander into this bug so he can comment. He is our resident unicode specialist :) and may correct me on what i've said above and provide some additional input.
(In reply to comment #11) > Indeed. I discussed this with Jürg in the #tracker room today and I checked the > code too. The reason fn:starts-with doesn't work the same way but it does with > fn:contains is because (AFAICS) one uses a GLOB and the other uses BETWEEN > keywords in SQL. Now, we do something fancy by comparing between 'A' and > 'B'+TRACKER_COLLATION_LAST_CHAR. We do this for collation reasons. What makes > this complicated is, the TRACKER_COLLATION_LAST_CHAR is not the same for all > backends. You can see this in the Tracker source directory: > > $ git grep COLLATION_LAST_CHAR . | grep -i define > src/libtracker-data/tracker-collation.h:#define TRACKER_COLLATION_LAST_CHAR > ((gunichar) 0x10fffd) > src/libtracker-data/tracker-collation.h:#define TRACKER_COLLATION_LAST_CHAR > ((gunichar) 0x9fa5) > > One is for libunistring and the other for libicu. So we switch depending on the > implementation we were built with (according to configure). > > One way this would cause your situation is if the TRACKER_COLLATION_LAST_CHAR > is *not* the last character any more or it's now sorted into a position which > breaks things for us. > > What I wonder is, if you switch to libicu in your build of Tracker, does this > change anything for you? Martyn, thanks for investigation and the time you are spending into this. I now tested with Tracker git master rebuilt with the libicu backend, and I can confirm that your analysis is right: with that backend I get the correct number of results when testing with the Korean locale.
*** Bug 676368 has been marked as a duplicate of this bug. ***
Martyn, I felt free to raise the importance of this report to blocker, since it renders applications such as Documents completely unusable for users with non-english locales. Do you suggest to just switch to libicu as the default backend?
This thread in the libunistring mailing list talks about the issue; but didn't get any reply about when libunistring will provide a non-strcoll() UTS#10-based collation: http://lists.gnu.org/archive/html/bug-libunistring/2010-11/msg00008.html If switching to libicu, just note that it will make the FTS parsing much slower due to extra conversions to/from UTF-16. But of course, if that is the only way to have a proper collation... Just wondering, can't we re-work Jürg's fix in order to handle these new cases with libunistring? Maybe providing a custom collation method which would treat 0x10fffd really as the last char always and calling libunistring's collator internally?
(In reply to comment #14) > Martyn, I felt free to raise the importance of this report to blocker, since it > renders applications such as Documents completely unusable for users with > non-english locales. > > Do you suggest to just switch to libicu as the default backend? Cosimo, thanks for raising it. I wasn't actually sure the best way forward here. After consideration, perhaps the best approach is two fold... 1. We attempt to patch the issue Aleksander mentions on the unistring mailing list. This may end up meaning we fix strcoll() since I believe libunistring is using that under the hood. 2. We try to fix it in Tracker as Aleksander suggests for the short term. I am slightly concerned that libicu is the less perfect choice of the two because of the reasons Aleksander pointed out around performance. We've seen some bugs reported against libicu use recently too: https://bugzilla.gnome.org/show_bug.cgi?id=675660 https://bugzilla.gnome.org/show_bug.cgi?id=676989 Though, I suspect these are related to incorrectly set up environments: https://bbs.archlinux.org/viewtopic.php?id=140435 I have a suspicion fixing this bug would resolve the above issues: https://bugzilla.gnome.org/show_bug.cgi?id=676209 We could default to libicu over libunistring (in the order of discovery in configure.ac). That would likely help here. But I would like to see bugs/patches related to other components in the stack filed/created (i.e. for glibc, perhaps Tracker improvements and for libunistring). (In reply to comment #15) > This thread in the libunistring mailing list talks about the issue; but didn't > get any reply about when libunistring will provide a non-strcoll() UTS#10-based > collation: > http://lists.gnu.org/archive/html/bug-libunistring/2010-11/msg00008.html Is it worth asking again? > If switching to libicu, just note that it will make the FTS parsing much slower > due to extra conversions to/from UTF-16. But of course, if that is the only way > to have a proper collation... Any idea how much slower? 1/2 speed? Of course it depends on your end machine. > Just wondering, can't we re-work Jürg's fix in order to handle these new cases > with libunistring? Maybe providing a custom collation method which would treat > 0x10fffd really as the last char always and calling libunistring's collator > internally? I've CCd Jürg. Any comments Jürg?
I am not 100% certain, but gnome-contacts-3.4.1-1.fc17.x86_64 might be suffering from this problem too. It is a bigger issue since in gnome 3.4 you need it to group empathy contacts.
Gnome-contacts 3.4.0-1 seems to work on my system where gnome-documents doesn't.
I doubt it's related - Contacts 3.4 does not use Tracker in any way AFAIK.
This bug still happens in GNOME 3.6Beta. tracker-0.14.2-2.fc18.x86_64 gnome-documents-3.5.90-2.fc18.x86_64
Sangu, sadly, nothing has changed here. My comment #16 suggests approaches to fix/improve this situation, but we're at a bit of a stale mate here.
gnome-documents has a (performance-reducing) workaround in place now: http://git.gnome.org/browse/gnome-documents/commit/?id=29b6bc7d2db52955117a3340bd2ff5434b39dc56
And at least for Fedora, we'll get tracker built against icu. Maybe that's worth recommending on distributor-list.
I was hoping to patch master and release a 0.14.3 some time soon with this too. Thanks Matthias.
Dropping off the blocker list, workaround is in place.
*** Bug 684640 has been marked as a duplicate of this bug. ***
I've now defaulted to icu for the unicode support in master to try to avoid this problem. Release 0.14.3. should also have this change.
*** Bug 679316 has been marked as a duplicate of this bug. ***
Cosimo, did you say to me on IRC some time ago, this is no longer reproducible? i.e. possible ICU bug fix?
It was actually me. I found that sometime after comment 23 an unrelated change to the tracker package switched it back to libunistring in Fedora. But that did not cause this bug to reappear. I tried the reproducers on this bug, but could not make it fail. While I have switched the package back to use libicu, it might be so that something was fixed somewhere.
Thanks Rishi, I wonder if we can close the bug, that's all. Any problem with me closing as OBSOLETE?
Lets close this OBSOLETE.
Hi Martin, what is the current state of this issue: Is it ok now to use libunistring or is libicu still recommended? /me wonders which library I should for the Debian package
Michael, so we defaulted to libicu in master to improve the situation (see above comments), but Rishi is saying that Fedora defaulted to libunistring, but now they don't. So everyone (me included) seems to go for libicu, and it makes more sense to me because we also have MP3 encoding detection with libicu. However, it's not clear if it's fixed with libicu. So I don't think that answers your question, but that's where we are. If Rishi could verify this bug is obsolete for libicu, that would certainly give you a more concrete way forward here.
Using libicu was originally the suggested fix or workaround for this bug. See comment 27 and a few above it. It appears that it now works properly with libunistring, but I don't know why.