GNOME Bugzilla – Bug 699340
sort filenames case-insensitively
Last modified: 2015-01-12 23:35:31 UTC
In the directory comparison and source control views, Meld lists filenames in alphabetical case-sensitive order: [A-Z] all come before [a-z]. Instead, it should order them case-insensitively for consistency with Nautilus.
I went looking for what Nautilus does, and found the following order for these files: Makefile meld.1 meld.2 meld1 Meld.2 Meld1 ...which isn't any sane order that I can recognise. Maybe GTK has some utility functions for crazy GTK-specific sort.
Hm. I'm curious enough that I'm willing to go hunt down the code in Nautilus that determines this order. Stay tuned.
This was testing on a Gnome 2 box by the way; I'll have a look at Gnome 3 later, just in case the sort order has changed.
OK, it appears that Nautilus actually asks GLib for the sort order. Specifically it calls g_file_info_get_sort_order () - this happens in update_info_internal() in nautilus-file.c. I think it then breaks ties by doing some extra checks in compare_by_display_name(), also in nautilus-file.c. So presumably we should do the same if we want the same sort order as Nautilus. Not sure whether we need to do the tie-breaking. If so, perhaps that code itself should really move into GLib.
I just had a look at this. The get_sort_order stuff defined in gio is just property access, and doesn't appear to do any actual work. The real stuff happens in nautilus_file_set_display_name(), and relies on g_utf8_collate_key_for_filename() which does some magic in order to get better file orderings. Unfortunately, the collate_key stuff isn't bound in PyGTK 2. I'm assuming that we can get at this through GI, so assigning this as a GTK 3 thing.
A while back I saw a "case-insensitive" sort algorithm that roughly moved each character to a lower case byte, tagging with an additional byte at the end for each character. Which up to doubled the length of the string being sorted. (Exactly doubling if no non-ascii characters like é, assuming utf8 encoding.) The tags at the end ensured an order something like eéèêëEÉÈÊË for words otherwise identical. This encoding was only designed to work for character sets based on the latin alphabet. Then there are arabic and related alphabets, which often combine with latin-based text. For many other non-latin alphabets, probably sorting utf-16 would be a good alternative, since they generally don't use diacritical marks like accents. Don't know how much this comment helps ...
I've just spent a while playing around with this, and I'm going to just close it as WONTFIX. I think the nautilus ordering is fine, but in the context of a programmer's tool like Meld, I'm going to say that the traditional case-based ordering is going to be strongly preferred by most Meld users. The only thing I really feel like we're missing from the GLib-based sort is correct numeric ordering for foo.2 vs foo.10, but I can live with that. It doesn't help that the GLib-defined ordering is extra weird in that only the initial sort is case insensitive; the secondary sort (i.e., e vs é vs E) puts capitalised letters *last*. This isn't necessarily my final word on the matter, but I just don't feel like the alternative sort ordering is a good fit right now. In particular, if the current sort gets unicode-y ordering wrong then I'll definitely reconsider, but from my quick tests with latin alphabets it seemed fine to me.
OK - I'm the one who filed this bug in the first place and I can live with Kai's decision. Thanks for thinking about this at least! :)