GNOME Bugzilla – Bug 172690
Localised collating order borks up filenames
Last modified: 2011-02-04 16:18:41 UTC
Try creating a few files like this: mkdir t && cd t touch event.C event.h eventgenerator.C eventgenerator.h If the locale is different from LANG=C, the file selector chooses to sort the file names as event.C eventgenerator.C eventgenerator.h event.h instead of event.C event.h eventgenerator.C eventgenerator.h (the attached screenshot shows this for a Danish locale, but I know the same happens for a en_US locale). This does not seem right. I've asked a knowledgeable guy on the libc-locales mailing list, and according to him it is a feature of the localised collation mechanism: http://article.gmane.org/gmane.comp.lib.glibc.locales/96 http://article.gmane.org/gmane.comp.lib.glibc.locales/97 I guess the problem is that the dot has a different meaning in the context of file names so the file selector should try to do something clever like splitting the extension from the basename and sort the basenames first.
Created attachment 39703 [details] Screenshot of file selector with weird ordering
This is not a bug, but what is to be expected, given that the Danish locale follows the Danish orthograpy (Retskrivningsordbogen, 2001) and the Dansish standard DS 377. There it is prescribed that "." is to be ignored in the first pass of comparison.
It may still be the case that sorting files properly in the file chooser requires us to deviate from orthographically correct sorting, and use a diffent sorting, a la a) first sort by the name part (ie everything up to the first .) b) then sort by the extension (ie everything after the first .)
Windows has two settings: sort with the locale's rules on the full filename, and sort by splitting the filename into numeric and non-numeric sections. With the former, you'd get file10name.txt file1name.txt file2name.txt file3name.txt With the latter, you'd get file1name.txt file2name.txt file3name.txt file10name.txt Grep for "intuitive" here: http://www.winxpsolution.com/Tweakuixppro.aspx
Whatever we do, it needs to be exactly the same thing that Nautilus does.
I have started looking into this. The problem with the dot can easily be solved by making a special collate_key_from_file_name which splits the filename up in non-dotted parts, put these through g_utf8_collate_key and finally assemble the pieces again with the '.' replaced with '\1' (i.e. integer 1 casted to char, not '1'). If this works well in practice, we could put the function in glib and have Nautilus use it too. Getting numeric sorting (without decimal separators) right is a bit more difficult without also replacing the strcmp comparison of the sort_key's. However, with a clever enough hack, I think it might be doable too. I will think more about this and try to get something running when I get some more spare time.
I figured an algorithm out and finally got around to implementing it. Attached is a patch for GTK+ 2.6 that introduces a new collate_key generation function and uses it. The tests/testfilechooser happily sorts both filenames with dots and numbers correctly with this. So what's next? Should I prepare a patch for glib?
Created attachment 46621 [details] [review] Patch for GTK+ 2.6
The next step would be to get agreement from the nautilus side on what sorting algorithm to use. Maybe send a mail to nautilus-list@gnome.org
I agree that this seems to be a better collation order for filenames. I would like to have a g_utf8_collate_key_for_filename() (or similar) function that i can use in nautilus.
Does this algorithm handle leading zeros correctly? It doesn't seem so. It sorts "01" before "10", but not "02" before "3".
The "obvious" fix for that is to drop the initial zeros, but that means the sort order of "00" and "000" is undefined.
I think not generating superdigits for leading zeros will fix it. I'll check it.
<alex> matthias: how do you mean <alex> matthias: i was thinking moving the leading zeros to after the normal digits would work <alex> matthias: but it causes issues with other things following the digits <alex> matthias: I'm trying to find a better way.. <matthias> alex: I thought that n superdigits are basically saying: sort me as an n-digit number <alex> matthias: yeah <matthias> alex: so if you want to ignore leading zeros and sort 003 as among the single-digit numbers, just don't generate superdigits for the leading zeros <alex> How would you mean we convert "01" <matthias> alex: \2:01 I guess <matthias> ah, doesn't work, since it would be sorted before eg \2:2 <alex> so, "3" would be "\2:3" and "02" "\2:02" <alex> exactly <alex> i think this will work: <matthias> so you need subdigits... <alex> "0000123" -> \2:::123\4 <alex> and "123" -> \2:::123\0 <alex> actually, his code used (d-1) superdigits <alex> so, just :: would work too <matthias> alex: could work, but you get into trouble with more than 30 leading zeros... <matthias> alex: which can be safely ignored... <alex> matthias: how is 30 a proble? <alex> 255 i can see <matthias> alex: your suffix looks like a regular character then <matthias> but maybe thats not a problem, since we only special-case numbers at the end anyway... <alex> its unlikely to be a problem due to the leading \2 <alex> which causes this to be sorted before all other non-numbers <matthias> ah, ok <alex> and for numbers, if they don't have as many digits we won't compare them due to the superdigits <alex> it has a problem though... <alex> file1b -> file\21\0a <alex> file01a -> file\21\1b <alex> Shouldn't file01a sort before file1b? <matthias> does the algorithm acually look at numbers inside name parts, or only at the end ? * matthias checks <matthias> it looks at all numbers <alex> in order to get file01a file1a file01b file1b to sort right (all ending with b after those ending with a) you really have to collate 01 and 1 identically <alex> hmmm <alex> no, you don't <matthias> alex: you could try to append your leading-zeros-counter at the very end of the string <alex> you can append *after* the string markmc matthias <alex> matthias: exactly <alex> you have to keep track of the case with multiple numbers in a string right though <alex> so, you get a bunch of leading-zero counters at the end <matthias> yep <alex> but that *should* work <matthias> collation is always fun <alex> the only question now is interaction with the system collation key <alex> what if it generates \2 chars? <alex> glibc generates for "1" : 37, 1, 16, 1, 2 <matthias> alex thats true, does glibcs collation function reserve some low/high bytes for such marker purposes ? <alex> i have no idea <alex> matthias: i'm not sure what to do about this. Since glibc converts 'a' to "46, 1, 16, 1, 2" there is a problem with the code converting '.' to \1. I.e. "a.1" would sort before "a" <matthias> alex: would it, really ? the code doesn't use a literal a, but the collate key for a, so shouldn't it produce <matthias> something that still sorts after 46,1,16,1 ? <alex> oh, i see what you mean. It wouldn't use "a" , but "46, 1, 16, 1, 2" <alex> But its still sort of sketchy <matthias> yes, it would be better to return a sequence of keys for the various parts <matthias> but we have to return a single string <alex> what if we use a number sentinel that is very unlikely to be in a glibc collation key <alex> like \1\1\1 <matthias> you mean after each number inside the string, or at the end before the leading-zeros counts <matthias> ? <alex> instead of turning "123" into \2::123" you'd turn it into "\1\1\2::123" <alex> increases the likelyhood of this sorting before anything glibc produces <alex> I worry a bit about the glibc stuff though <alex> strxfrm("aa") != strxfrm("a") + strxfrm("a") <alex> so, the splitting up may cause things to sort wrongly <matthias> but we only split up where we want to treat parts specially, not in the middle of something we want to treat as a "collation unit" <matthias> so the main worry is to find markers for insertion between the parts which cannot be confused with glibc's markers <alex> yeah. <alex> i think a string of \1 should work <alex> it'll make pretty sure the following numerical part will be sorted at the very front.
Created attachment 46955 [details] [review] GLib patch Here is a glib version of the patch, which additionally tries to handle leading zeros in numbers.
Created attachment 46957 [details] [review] New version This version actually skips the leading zeros, skips the utf8 stuff (not needed to find ascii chars in utf8) and adds a sentinel that hopefully avoids conflicts with libcs collation keys.
Raising priority and confirming because of the patches.
I've committed the patch now. Note that there are still some corner cases where the sorting order is undefined, e.g. a01b2 vs a1b02