After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 793747 - g_utf8_collate_key_for_filename() corner cases with digits
g_utf8_collate_key_for_filename() corner cases with digits
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: i18n
2.54.x
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
: 754777 (view as bug list)
Depends on:
Blocks: 355152
 
 
Reported: 2018-02-23 12:19 UTC by Paul
Modified: 2018-05-24 20:14 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Screenshot of Nautilus sorting the test files by name (7.31 KB, image/png)
2018-02-23 12:19 UTC, Paul
  Details
Respect traditional C/POSIX collation when overridden through specific environment variables (3.25 KB, patch)
2018-03-07 09:10 UTC, Paul
none Details | Review

Description Paul 2018-02-23 12:19:20 UTC
Created attachment 368820 [details]
Screenshot of Nautilus sorting the test files by name

Moved here from the relevant Nautilus bug: https://gitlab.gnome.org/GNOME/nautilus/issues/264

Create some test files as follows:

$ touch 000001000010-0.jpg 000001000010-A.jpg 000001A00010-0.jpg 000003BBF000-0.jpg 00003bA1A000-0.jpg 00003BD22000-0.jpg 0000A4AC3000-0.jpg 000100001 000100001.jpg 000200001

View them at the command line and in Nautilus:

$ ls -1
000001000010-0.jpg
000001000010-A.jpg
000001A00010-0.jpg
000003BBF000-0.jpg
00003bA1A000-0.jpg
00003BD22000-0.jpg
0000A4AC3000-0.jpg
000100001
000100001.jpg
000200001
$ nautilus .
[see attached screenshot]

ls sorts files as one might expect. It is not case sensitive (unless you use a case sensitive locale, e.g. LANG=C), but sorts alphabetically.

Nautilus sorts the files in a bizarre order, regardless of which locale is used. Weird behaviours include:
* Longer but otherwise equal filenames sort before shorter ones
* Sometimes ignores runs of zeros, but not punctuation
* Seems to detect runs of digits and push them to the end

The actual behaviour is very complex and difficult to predict, though it must follow some internal logic. The end result is that files don't sort in any reasonable order. This impacts several Gnome applications, such as Eye of Gnome and Nautilus. Other applications, like Transmission, respect locale.
Comment 1 Philip Withnall 2018-02-23 15:05:44 UTC
What’s the output of the `locale` command for you?
Comment 2 Jan Alexander Steffens (heftig) 2018-02-23 16:58:08 UTC
g_utf8_collate_key_for_filename is probably the source of the weirdness here. It has a rather large comment:

  /*
   * How it works:
   *
   * Split the filename into collatable substrings which do
   * not contain [.0-9] and special-cased substrings. The collatable 
   * substrings are run through the normal g_utf8_collate_key() and the 
   * resulting keys are concatenated with keys generated from the 
   * special-cased substrings.
   *
   * Special cases: Dots are handled by replacing them with '\1' which 
   * implies that short dot-delimited substrings are before long ones, 
   * e.g.
   * 
   *   a\1a   (a.a)
   *   a-\1a  (a-.a)
   *   aa\1a  (aa.a)
   * 
   * Numbers are handled by prepending to each number d-1 superdigits 
   * where d = number of digits in the number and SUPERDIGIT is a 
   * character with an integer value higher than any digit (for instance 
   * ':'). This ensures that single-digit numbers are sorted before 
   * double-digit numbers which in turn are sorted separately from 
   * triple-digit numbers, etc. To avoid strange side-effects when 
   * sorting strings that already contain SUPERDIGITs, a '\2'
   * is also prepended, like this
   *
   *   file\21      (file1)
   *   file\25      (file5)
   *   file\2:10    (file10)
   *   file\2:26    (file26)
   *   file\2::100  (file100)
   *   file:foo     (file:foo)
   * 
   * This has the side-effect of sorting numbers before everything else (except
   * dots), but this is probably OK.
   *
   * Leading digits are ignored when doing the above. To discriminate
   * numbers which differ only in the number of leading digits, we append
   * the number of leading digits as a byte at the very end of the collation
   * key.
   *
   * To try avoid conflict with any collation key sequence generated by libc we
   * start each switch to a special cased part with a sentinel that hopefully
   * will sort before anything libc will generate.
   */
Comment 3 Paul 2018-02-23 23:17:45 UTC
Wow, that's a lot of assumptions. Presumably the original developer had in mind filenames like "Product Report 1.05.odt", which this algorithm does work for, but it makes a pig's breakfast of alphanumerics, especially hex (photos, emails, UUIDed files, etc.) and base-64 (e.g. YouTube videos). It's also completely non-i18n: only Western Arabic numbers are handled, not Arabic, Chinese, Korean, etc.

We have the extremely refined collation behaviour of glibc, but before we call it we run through a big mandatory for(switch(munge_numbers())).

Any idea why this was done in glib instead of being given to glibc?

> What’s the output of the `locale` command for you?

LANGUAGE=en_AU:en, LC_ALL not set, everything else en_AU.UTF-8. If I launch Nautilus with LANG=C, files actually do sort asciibetically *after* the number/period munging happens. But the munging doesn't honour locale, and there's no way to disable it.
Comment 4 Paul 2018-03-06 11:30:38 UTC
@Philip Withnall, has the NEEDINFO been answered to your satisfaction? (I can't set the status back to NEW for some reason.)

In summary, it's not a locale problem, rather g_utf8_collate_key_for_filename() has an incorrect design.
Comment 5 Philip Withnall 2018-03-06 12:18:25 UTC
*** Bug 754777 has been marked as a duplicate of this bug. ***
Comment 6 Philip Withnall 2018-03-06 12:20:39 UTC
(In reply to Paul from comment #4)
> @Philip Withnall, has the NEEDINFO been answered to your satisfaction? (I
> can't set the status back to NEW for some reason.)

Yes, thanks.

This is fairly low priority compared to quite a few of the other GLib bugs. Thankfully, we can fix it by modifying the implementation of g_utf8_collate_key_for_filename() without breaking API, since the API documentation for it doesn’t tie it to a specific algorithm. I suggest a way forward here is for Nautilus to fork a copy of g_utf8_collate_key_for_filename(), modify it to improve the heuristics, test it for a bit, and then we can fold the modified version back into GLib (with unit tests).
Comment 7 Paul 2018-03-07 09:10:16 UTC
Created attachment 369400 [details] [review]
Respect traditional C/POSIX collation when overridden through specific environment variables

Patch for glib2.0-2.54.1 as distributed in Ubuntu 17.10.

Per the API docs, this patch has no effect unless the user explicitly selects a collation environment variable (e.g. LC_COLLATE) corresponding to a C, POSIX, or equivalent locale. In that case, the collation will honour the traditional ASCIIbetical ordering.

Supports Linux, various Unices, and Windows. (macOS collation happens in the Carbon API.)
Comment 8 Paul 2018-03-07 09:10:42 UTC
Good point, in fact the API documentation doesn't even mandate treating runs of digits as numbers; it's just a "we would like". We could wrap the current algorithm in a GSetting test, but nothing in glib seems to use GSettings. So our best option is to munge only when collation is not C or POSIX (or the UTF-8 variants thereof).

I've attached a small patch that implements this. By the way, I'm pretty sure this patch also fixes a (statistically unlikely) gstring overflow bug.
Comment 9 GNOME Infrastructure Team 2018-05-24 20:14:45 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1344.