GNOME Bugzilla – Bug 793747
g_utf8_collate_key_for_filename() corner cases with digits
Last modified: 2018-05-24 20:14:45 UTC
Created attachment 368820 [details] Screenshot of Nautilus sorting the test files by name Moved here from the relevant Nautilus bug: https://gitlab.gnome.org/GNOME/nautilus/issues/264 Create some test files as follows: $ touch 000001000010-0.jpg 000001000010-A.jpg 000001A00010-0.jpg 000003BBF000-0.jpg 00003bA1A000-0.jpg 00003BD22000-0.jpg 0000A4AC3000-0.jpg 000100001 000100001.jpg 000200001 View them at the command line and in Nautilus: $ ls -1 000001000010-0.jpg 000001000010-A.jpg 000001A00010-0.jpg 000003BBF000-0.jpg 00003bA1A000-0.jpg 00003BD22000-0.jpg 0000A4AC3000-0.jpg 000100001 000100001.jpg 000200001 $ nautilus . [see attached screenshot] ls sorts files as one might expect. It is not case sensitive (unless you use a case sensitive locale, e.g. LANG=C), but sorts alphabetically. Nautilus sorts the files in a bizarre order, regardless of which locale is used. Weird behaviours include: * Longer but otherwise equal filenames sort before shorter ones * Sometimes ignores runs of zeros, but not punctuation * Seems to detect runs of digits and push them to the end The actual behaviour is very complex and difficult to predict, though it must follow some internal logic. The end result is that files don't sort in any reasonable order. This impacts several Gnome applications, such as Eye of Gnome and Nautilus. Other applications, like Transmission, respect locale.
What’s the output of the `locale` command for you?
g_utf8_collate_key_for_filename is probably the source of the weirdness here. It has a rather large comment: /* * How it works: * * Split the filename into collatable substrings which do * not contain [.0-9] and special-cased substrings. The collatable * substrings are run through the normal g_utf8_collate_key() and the * resulting keys are concatenated with keys generated from the * special-cased substrings. * * Special cases: Dots are handled by replacing them with '\1' which * implies that short dot-delimited substrings are before long ones, * e.g. * * a\1a (a.a) * a-\1a (a-.a) * aa\1a (aa.a) * * Numbers are handled by prepending to each number d-1 superdigits * where d = number of digits in the number and SUPERDIGIT is a * character with an integer value higher than any digit (for instance * ':'). This ensures that single-digit numbers are sorted before * double-digit numbers which in turn are sorted separately from * triple-digit numbers, etc. To avoid strange side-effects when * sorting strings that already contain SUPERDIGITs, a '\2' * is also prepended, like this * * file\21 (file1) * file\25 (file5) * file\2:10 (file10) * file\2:26 (file26) * file\2::100 (file100) * file:foo (file:foo) * * This has the side-effect of sorting numbers before everything else (except * dots), but this is probably OK. * * Leading digits are ignored when doing the above. To discriminate * numbers which differ only in the number of leading digits, we append * the number of leading digits as a byte at the very end of the collation * key. * * To try avoid conflict with any collation key sequence generated by libc we * start each switch to a special cased part with a sentinel that hopefully * will sort before anything libc will generate. */
Wow, that's a lot of assumptions. Presumably the original developer had in mind filenames like "Product Report 1.05.odt", which this algorithm does work for, but it makes a pig's breakfast of alphanumerics, especially hex (photos, emails, UUIDed files, etc.) and base-64 (e.g. YouTube videos). It's also completely non-i18n: only Western Arabic numbers are handled, not Arabic, Chinese, Korean, etc. We have the extremely refined collation behaviour of glibc, but before we call it we run through a big mandatory for(switch(munge_numbers())). Any idea why this was done in glib instead of being given to glibc? > What’s the output of the `locale` command for you? LANGUAGE=en_AU:en, LC_ALL not set, everything else en_AU.UTF-8. If I launch Nautilus with LANG=C, files actually do sort asciibetically *after* the number/period munging happens. But the munging doesn't honour locale, and there's no way to disable it.
@Philip Withnall, has the NEEDINFO been answered to your satisfaction? (I can't set the status back to NEW for some reason.) In summary, it's not a locale problem, rather g_utf8_collate_key_for_filename() has an incorrect design.
*** Bug 754777 has been marked as a duplicate of this bug. ***
(In reply to Paul from comment #4) > @Philip Withnall, has the NEEDINFO been answered to your satisfaction? (I > can't set the status back to NEW for some reason.) Yes, thanks. This is fairly low priority compared to quite a few of the other GLib bugs. Thankfully, we can fix it by modifying the implementation of g_utf8_collate_key_for_filename() without breaking API, since the API documentation for it doesn’t tie it to a specific algorithm. I suggest a way forward here is for Nautilus to fork a copy of g_utf8_collate_key_for_filename(), modify it to improve the heuristics, test it for a bit, and then we can fold the modified version back into GLib (with unit tests).
Created attachment 369400 [details] [review] Respect traditional C/POSIX collation when overridden through specific environment variables Patch for glib2.0-2.54.1 as distributed in Ubuntu 17.10. Per the API docs, this patch has no effect unless the user explicitly selects a collation environment variable (e.g. LC_COLLATE) corresponding to a C, POSIX, or equivalent locale. In that case, the collation will honour the traditional ASCIIbetical ordering. Supports Linux, various Unices, and Windows. (macOS collation happens in the Carbon API.)
Good point, in fact the API documentation doesn't even mandate treating runs of digits as numbers; it's just a "we would like". We could wrap the current algorithm in a GSetting test, but nothing in glib seems to use GSettings. So our best option is to munge only when collation is not C or POSIX (or the UTF-8 variants thereof). I've attached a small patch that implements this. By the way, I'm pretty sure this patch also fixes a (statistically unlikely) gstring overflow bug.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1344.