Bug 172690 – Localised collating order borks up filenames

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 172690 - Localised collating order borks up filenames


Summary:	Localised collating order borks up filenames


Status:	RESOLVED FIXED

Product:	gtk+
Classification:	Platform
Component:	Widget: GtkFileChooser
Version:	2.6.x
Hardware:	Other Linux

Importance:	High normal
Target Milestone:	---
Assigned To:	gtk-bugs
QA Contact:	gtk-bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-04-05 10:24 UTC by Ole Laursen
Modified:	2011-02-04 16:18 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Screenshot of file selector with weird ordering (26.97 KB, image/png) 2005-04-05 10:25 UTC, Ole Laursen		Details
Patch for GTK+ 2.6 (4.50 KB, patch) 2005-05-18 21:16 UTC, Ole Laursen	none	Details \| Review
GLib patch (5.88 KB, patch) 2005-05-27 15:16 UTC, Matthias Clasen	none	Details \| Review
New version (6.29 KB, patch) 2005-05-27 16:08 UTC, Alexander Larsson	none	Details \| Review

Description Ole Laursen 2005-04-05 10:24:35 UTC

Try creating a few files like this:

  mkdir t && cd t
  touch event.C event.h eventgenerator.C eventgenerator.h

If the locale is different from LANG=C, the file selector chooses to sort the
file names as

  event.C
  eventgenerator.C
  eventgenerator.h
  event.h

instead of

  event.C
  event.h
  eventgenerator.C
  eventgenerator.h

(the attached screenshot shows this for a Danish locale, but I know the same
happens for a en_US locale). This does not seem right. I've asked a
knowledgeable guy on the libc-locales mailing list, and according to him it is a
feature of the localised collation mechanism:

  http://article.gmane.org/gmane.comp.lib.glibc.locales/96
  http://article.gmane.org/gmane.comp.lib.glibc.locales/97

I guess the problem is that the dot has a different meaning in the context of
file names so the file selector should try to do something clever like splitting
the extension from the basename and sort the basenames first.

Comment 1 Ole Laursen 2005-04-05 10:25:17 UTC

Created attachment 39703 [details]
Screenshot of file selector with weird ordering

Comment 2 Keld Simonsen 2005-04-06 09:28:49 UTC

This is not a bug, but what is to be expected, given that the Danish locale
follows the Danish orthograpy (Retskrivningsordbogen, 2001) and the Dansish
standard DS 377. There it is prescribed that "." is to be ignored in the first
pass of comparison.

Comment 3 Matthias Clasen 2005-04-06 12:24:00 UTC

It may still be the case that sorting files properly in the file chooser
requires us to deviate from orthographically correct sorting, and use a diffent
sorting,
a la

a) first sort by the name part (ie everything up to the first .)
b) then sort by the extension (ie everything after the first .)

Comment 4 Federico Mena Quintero 2005-04-15 21:13:55 UTC

Windows has two settings:  sort with the locale's rules on the full filename,
and sort by splitting the filename into numeric and non-numeric sections.

With the former, you'd get
file10name.txt
file1name.txt
file2name.txt
file3name.txt

With the latter, you'd get
file1name.txt
file2name.txt
file3name.txt
file10name.txt

Grep for "intuitive" here:
http://www.winxpsolution.com/Tweakuixppro.aspx

Comment 5 Federico Mena Quintero 2005-04-15 21:14:21 UTC

Whatever we do, it needs to be exactly the same thing that Nautilus does.

Comment 6 Ole Laursen 2005-05-08 21:23:57 UTC

I have started looking into this. The problem with the dot can easily be solved
by making a special collate_key_from_file_name which splits the filename up in
non-dotted parts, put these through g_utf8_collate_key and finally assemble the
pieces again with the '.' replaced with '\1' (i.e. integer 1 casted to char, not
'1').

If this works well in practice, we could put the function in glib and have
Nautilus use it too.

Getting numeric sorting (without decimal separators) right is a bit more
difficult without also replacing the strcmp comparison of the sort_key's.
However, with a clever enough hack, I think it might be doable too. I will think
 more about this and try to get something running when I get some more spare time.

Comment 7 Ole Laursen 2005-05-18 21:13:42 UTC

I figured an algorithm out and finally got around to implementing it. Attached
is a patch for GTK+ 2.6 that introduces a new collate_key generation function
and uses it. The tests/testfilechooser happily sorts both filenames with dots
and numbers correctly with this.

So what's next? Should I prepare a patch for glib?

Comment 8 Ole Laursen 2005-05-18 21:16:12 UTC

Created attachment 46621 [details] [review]
Patch for GTK+ 2.6

Comment 9 Matthias Clasen 2005-05-19 13:05:18 UTC

The next step would be to get agreement from the nautilus side on what sorting
algorithm to use. Maybe send a mail to nautilus-list@gnome.org

Comment 10 Alexander Larsson 2005-05-27 09:12:56 UTC

I agree that this seems to be a better collation order for filenames. I would
like to have a g_utf8_collate_key_for_filename() (or similar) function that i
can use in nautilus.

Comment 11 Alexander Larsson 2005-05-27 13:45:13 UTC

Does this algorithm handle leading zeros correctly?
It doesn't seem so.

It sorts "01" before "10", but not "02" before "3".

Comment 12 Alexander Larsson 2005-05-27 13:46:45 UTC

The "obvious" fix for that is to drop the initial zeros, but that means the sort
order of "00" and "000" is undefined.

Comment 13 Matthias Clasen 2005-05-27 14:09:37 UTC

I think not generating superdigits for leading zeros will fix it.
I'll check it.

Comment 14 Alexander Larsson 2005-05-27 15:15:33 UTC

<alex> matthias: how do you mean
<alex> matthias: i was thinking moving the leading zeros to after the normal
digits would work
<alex> matthias: but it causes issues with other things following the digits
<alex> matthias: I'm trying to find a better way..
<matthias> alex: I thought that n superdigits are basically saying: sort me as
an n-digit number
<alex> matthias: yeah
<matthias> alex: so if you want to ignore leading zeros and sort 003 as among
the single-digit numbers, just don't generate superdigits for the leading zeros
<alex> How would you mean we convert "01"
<matthias> alex: \2:01  I guess
<matthias> ah, doesn't work, since it would be sorted before eg \2:2
<alex> so, "3" would be "\2:3" and "02" "\2:02"
<alex> exactly
<alex> i think this will work:
<matthias> so you need subdigits...
<alex> "0000123" -> \2:::123\4
<alex> and "123" -> \2:::123\0
<alex> actually, his code used (d-1) superdigits
<alex> so, just :: would work too
<matthias> alex: could work, but you get into trouble with more than 30 leading
zeros...
<matthias> alex: which can be safely ignored...
<alex> matthias: how is 30 a proble?
<alex> 255 i can see
<matthias> alex: your suffix looks like a regular character then
<matthias> but maybe thats not a problem, since we only special-case numbers at
the end anyway...
<alex> its unlikely to be a problem due to the leading \2
<alex> which causes this to be sorted before all other non-numbers
<matthias> ah, ok
<alex> and for numbers, if they don't have as many digits we won't compare them
due to the superdigits
<alex> it has a problem though...
<alex> file1b  -> file\21\0a
<alex> file01a -> file\21\1b
<alex> Shouldn't file01a sort before file1b?
<matthias> does the algorithm acually look at numbers inside name parts, or only
at the end ?
* matthias checks
<matthias> it looks at all numbers
<alex> in order to get file01a file1a file01b file1b to sort right (all ending
with b after those ending with a) you really have to collate 01 and 1 identically
<alex> hmmm
<alex> no, you don't
<matthias> alex: you could try to append your leading-zeros-counter at the very
end of the string
<alex> you can append *after* the string
 markmc matthias 
<alex> matthias: exactly
<alex> you have to keep track of the case with multiple numbers in a string
right though
<alex> so, you get a bunch of leading-zero counters at the end
<matthias> yep
<alex> but that *should* work
<matthias> collation is always fun
<alex> the only question now is interaction with the system collation key
<alex> what if it generates \2 chars?
<alex> glibc generates for "1" : 37, 1, 16, 1, 2
<matthias> alex thats true, does glibcs collation function reserve some low/high
bytes for such marker purposes ?
<alex> i have no idea
<alex> matthias: i'm not sure what to do about this. Since glibc converts 'a' to
"46, 1, 16, 1, 2" there is a problem with the code converting '.' to \1. I.e.
"a.1" would sort before "a"
<matthias> alex: would it, really ? the code doesn't use a literal a, but the
collate key for a, so shouldn't it produce
<matthias> something that still sorts after 46,1,16,1 ?
<alex> oh, i see what you mean. It wouldn't use "a" , but "46, 1, 16, 1, 2"
<alex> But its still sort of sketchy
<matthias> yes, it would be better to return a sequence of keys for the various
parts
<matthias> but we have to return a single string
<alex> what if we use a number sentinel that is very unlikely to be in a glibc
collation key
<alex> like \1\1\1
<matthias> you mean after each number inside the string, or at the end before
the leading-zeros counts
<matthias> ?
<alex> instead of turning "123" into \2::123" you'd turn it into "\1\1\2::123"
<alex> increases the likelyhood of this sorting before anything glibc produces
<alex> I worry a bit about the glibc stuff though
<alex> strxfrm("aa") != strxfrm("a") + strxfrm("a")
<alex> so, the splitting up may cause things to sort wrongly
<matthias> but we only split up where we want to treat parts specially, not in
the middle of something we want to treat as a "collation unit"
<matthias> so the main worry is to find markers for insertion between the parts
which cannot be confused with glibc's markers
<alex> yeah.
<alex> i think a string of \1 should work
<alex> it'll make pretty sure the following numerical part will be sorted at the
very front.

Comment 15 Matthias Clasen 2005-05-27 15:16:12 UTC

Created attachment 46955 [details] [review]
GLib patch

Here is a glib version of the patch, which additionally tries to handle 
leading zeros in numbers.

Comment 16 Alexander Larsson 2005-05-27 16:08:36 UTC

Created attachment 46957 [details] [review]
New version

This version actually skips the leading zeros, skips the utf8 stuff (not needed
to find ascii chars in utf8) and adds a sentinel that hopefully avoids
conflicts with libcs collation keys.

Comment 17 Christian Kirbach 2005-06-07 15:38:30 UTC

Raising priority and confirming because of the patches.

Comment 18 Matthias Clasen 2005-06-08 05:25:32 UTC

I've committed the patch now. Note that there are still some corner cases where
the sorting order is undefined, e.g. a01b2 vs a1b02