GNOME Bugzilla – Bug 717960
Natural sorting of photo titles
Last modified: 2014-11-27 14:40:59 UTC
---- Reported by jim@yorba.org 2011-09-02 12:50:00 -0700 ---- Original Redmine bug id: 4091 Original URL: http://redmine.yorba.org/issues/4091 Searchable id: yorba-bug-4091 Original author: Jim Nelson Original description: Currently Shotwell sorts in lexiographic order, so the following photo titles sort like this: filename-1 filename-10 filename-11 filenames-2 It makes more sense to order like this filename-1 filename-2 filename-10 filename-11 Suggested downstream: https://bugs.launchpad.net/ubuntu/+source/shotwell/+bug/839734 ---- Additional Comments From shotwell-maint@gnome.bugs 2013-05-16 14:44:00 -0700 ---- ### History #### #1 Updated by Jim Nelson 11 months ago * **Target version** set to _0.14.0_ #### #2 Updated by Jim Nelson 11 months ago * **Category** set to _ux_ #### #3 Updated by Jim Nelson 10 months ago * **Target version** changed from _0.14.0_ to _0.15.0_ #### #4 Updated by Jim Nelson 8 months ago * **Target version** changed from _0.15.0_ to _0.16.0_ #### #5 Updated by Jim Nelson 6 months ago * **Target version** deleted (<strike>_0.16.0_</strike>) --- Bug imported by chaz@yorba.org 2013-11-25 21:54 UTC --- This bug was previously known as _bug_ 4091 at http://redmine.yorba.org/show_bug.cgi?id=4091 Unknown version " in product shotwell. Setting version to "!unspecified". Unknown milestone "unknown in product shotwell. Setting to default milestone for this product, "---". Setting qa contact to the default for this product. This bug either had no qa contact or an invalid one. Resolution set on an open status. Dropping resolution
I think I fixed this in https://github.com/tobiatesan/shotwell/tree/natural_sort Beware, though: my main purpouse for doing this was getting my feet wet with Vala (coming from C++) and Shotwell, so it is highly likely ugly and/or not very Vala-idiomatic. Feel free to address any concerns.
Created attachment 287000 [details] [review] The output of git --diff for the branch linked in my previous comment
Review of attachment 287000 [details] [review]: This is a good start to the problem. There is definitely some room for improvement here, but you're on the right track: * Since NaturalCmp is pure static with no instantiation, it makes sense to make this a namespace rather than a class. Our coding guidelines are a touch different for namespaces: there's no indentation for the namespace decl and the methods within, i.e.: namespace NaturalCmp { public int compare() { } } * "static const" is unnecessary in Vala; all consts are static. * Be aware of Yorba's coding guidelines (https://wiki.gnome.org/Apps/Geary/CodingConventions). In particular, we use spaces between operators, i.e. "AFIRST = 1", not "AFIRST=1". (And so on.) * In GLib (and therefore Vala), all strings are UTF-8, not ASCII. The problem is, if you do this: char c = str[1]; you're getting the second byte, not second UTF-8 character, in the string. (It's possible, but not guaranteed, they are the same.) To remedy this, look at string.get_index_of_nth_char(), string.get_next_char(), string.next_char(), and others. * Also note that those methods deal in the unichar type, not char. That's another important aspect of UTF-8 support. Since this is language-sensitive, this comparator should be dealing strictly in unichars, not chars. (That means no array dereferencing, i.e no str[0].) Also, you can replace is_number() with unichar.isdigit(). * The compare() method nests deep. Is there any way to refactor it so the if/else doesn't go so deep? I suspect this method could be coded to look something like this: int compare(string a, string b) { // check easy cases for a & b, i.e. zero-length strings. // loop both int index_a = 0, index_b = 0; unichar unichar_a, unichar_b; for (;;) { bool a_eos = !a.get_next_char(ref index_a, out unichar_a)); bool b_eos = !b.get_next_char(ref index_b, out unichar_b)); if (a_eos || b_eos) // return proper result when strings are not same length, or if they are // if unichar_a or unichar_b is numeric, pull in entire number for comparison if (unichar_a.isdigit()) number = read_number(ref index_a); if (unichar_b.isdigit()) number = read_number(ref index_b); // do unichar comparisons, returning result when terminal case found } } I know it's more complicated than this, but this is the approach I think is easier to maintain. * In particular, since this is a comparator, it would be really good if it wasn't recursive. We don't have to optimize quite yet, but recursion is a performance killer, especially if the task at hand is walking an array of bytes. These are my thoughts about what I'm seeing here. Since Shotwell doesn't have a test framework, we'll need to do something about the test code, but we'll deal with that later.
Review of attachment 287000 [details] [review]: Hi Jim, thanks for taking the time to review my patch. I incorporated your suggestions: https://github.com/tobiatesan/shotwell/compare/natural_sort I refactored it in a slightly different fashion than what you proposed, but it's still two-braces deep at most, and iterative. If you really want me to stick to your template, just ask and I'll do it quick. There is a specific use case that I'm having trouble with, though. I'm assuming that I should treat non-arabic numerals just as arabic numerals - as outlined in http://en.wikipedia.org/wiki/Numerals_in_Unicode Now, I'm not exactly fluent in Mandarin, but I'm pretty much positive that 十, 十一 are chinese numerals. However, '十'.isdigit() seems to return false (I'd assume it would return true based on http://references.valadoc.org/#!api=glib-2.0/unichar.isdigit), and (as a consequence) "十一" is not correctly read as 11. I would expect all the asserts() I commented out in the unit tests to pass, but... they don't. That you know of, is this a known bug in isdigit(), am I misunderstanding http://references.valadoc.org/#!api=glib-2.0/unichar.isdigit or am I just doing things wrong in general?
This looks much better. You didn't have to follow my template exactly, but I feel like you found a way forward that is easier to follow. Regarding the isdigit() issue, that call is mapped to g_unichar_isdigit(). If you think there's a bug with it, I would recommend filing a bug with the GLib project, as that's their code. We shouldn't be working around this in Shotwell unless it's vital to operation. We need to find a place for the test code. I don't see any code for building it, do you have a custom Makefile (or something) for that? We should find a place for all that to live, I don't think src/ is a good location. I'm thinking a test/ directory. Other people should be able to (in the least) cd to that directory, type make, and run the test. We could consider expanding that in the future to more tests. That's a big question we've considered in the past and not made any progress on.
Hi Jim, thanks for re-reviewing my patch :) > If you think there's a bug with it, I would recommend filing a bug with the > GLib project, as that's their code. Guess I will file a bug then. I felt it necessary to have another pair of eyes - especially a pair of /experienced/ eyes - on it before doing that, though, to rule out any obvious misunderstanding on my part. > We need to find a place for the test code. I don't see any code for building > it, do you have a custom Makefile (or something) for that? Well, for quick testing I just use this and do "make test": https://github.com/tobiatesan/NaturalCmp/blob/master/Makefile Obviously it's not a solution, but I wouldn't know about setting up a project-wide test environment. If you so like I can just move the tests into test/ or remove them altogether (they will always be on my github anyway if somebody needs them later on).
Eh, that's probably intended beaviour. Chinese numerals are /not/ the plain old arabic numbers in the decimal system rewritten with different glyphs, which seems to be the only thing isDigit recognizes. That's even in the example. My bad.
Ok, good, mystery solved. Yes, please make a /test directory and add the Makefile to that directory. When you're ready, let me know and I'll take a final look to pull into master.
Thanks a ton, Jim - updated my branch accordingly. Help yourself :) (If this can save you a couple of git pull and git merge, I opened a Github pull request as well)
Ok, this is better. Getting another look over the code, I spotted some more problems, unfortunately. Like so many things with Unicode, things aren't as simple as they are with ASCII. One thing I didn't notice before was the else block where you perform direct unichar comparisons for ordering, i.e.: a.get_char(0) > b.get_char(0) Unfortunately, this doesn't work for all languages. I don't have an example at hand, but I suspect there's a test case we could devise where this would fail. Also unfortunately, GLib doesn't offer a unichar comparison function for lexiographic comparisons, only string functions. So, the unichars would need to be converted back to strings for comparisons. Not fun. Second, I have concerns about performance. While I don't have any numbers handy, I see a lot of string copies and string iteration going on here. One thing we could try with your test program is comparing 100,000 natural sort comparisons with strcmp(). Your function doesn't have to be the same speed, but it would be interesting to know what level of difference we're talking about. We have users with 50,000 photos and more, and so this function will be hammered pretty hard. There's also something more complicated here. We really shouldn't be using strcmp() in the original code, we should be using collation keys, which provide lexiographic ordering (see g_utf8_collate). That's when I stumbled on an API I forgot about: g_utf8_collate_key_for_filename. Although these are not filenames being compared, this function does almost exactly what you're attempting to achieve here. The only issue is the way dots are handled, but I bet that's not an issue for most anyone. So, let's step back a second and see if that would work instead of your new function. It means using strcmp() to compare the result of each title collated with this function. There might be performance issues with this as well, but those can be managed by storing the collation key in the MediaSource objects. I know this means reversing a lot of work you've done so far, but I think this might be the better answer going forward. Can you take a look and see how it works out?
Sure, during the weekend I'll try to use g_utf8_collate_key_for_filename instead. If I have some spare time maybe I'll benchmark & profile the whole thing to see just how heavier than strcmp() it is. Thanks!
See how you like this: https://github.com/tobiatesan/shotwell/compare/saner_sorting https://github.com/GNOME/shotwell/pull/2 I have yet to benchmark it, but I would be /very very/ surprised if my class had any performance advantage over the method from GLib.
Well, here is the deal. I run gprof on both branches with a ~5000 elements library (not 50000, mind you!) and in both cases the overhead seems to be almost nil. In particular, my natural_cmp branch yelds this: 0.00 0.30 0.00 143062 0.00 0.00 natural_cmp_compare I'm gonna look into the problem with a.get_char(0) > b.get_char(0) later tonight, but performance doesn't seem THAT scary to me. While we are at it, I did this to get gprof data: https://github.com/tobiatesan/shotwell/compare/profiling Se if maybe it turn out useful in general.
I'm sorry - is there a good reason why a.get_char(0) > b.get_char(0) can't be replaced with public static unichar min (unichar a, unichar b) from http://references.valadoc.org/#!api=glib-2.0/unichar instead? "public static unichar min (unichar a, unichar b) Calculates the minimum of a and b." it sounds like what's needed, and I would certainly expect it to be unicode-aware and non-latin-characters-aware, although the documentation doesn't say that explicitly. I've done that in https://github.com/tobiatesan/shotwell/compare/natural_sort_with_minmax and, well, at the very least it works as expected and gprof output looks, again, very nice. Relevant line: 0.00 0.30 0.00 143062 0.00 0.00 natural_cmp_compare (BTW what I was actually doing with the profiling session was opening my ~5000 elements library and cycling through ascending/descending title sorting with some interspersed date sorting and folder switching to clear the catch, etc) Later in the weekend I'll write a little testbed, generate some sample data and do some actual measurements pitting each method side by side, but in the meanwhile performance doesn't seem to be an obstacle either way, /empirically/.
Okay: https://github.com/tobiatesan/cmpcomp/blob/master/test.vala This is probably not super precise, but should be precise enough. I get this, which is consistent with naturalcmp being O(2): len collate us natural us strcmp us pairs tested -------------------------------------------------------------- 1 627 416 216 256 3 859 865 211 256 6 1098 1325 224 256 11 1609 2006 231 256 20 2100 3077 258 256 37 3130 5505 322 256 70 4185 9797 308 256 135 6209 21655 336 256 264 10463 56335 341 256 Both ways are (probably somewhat obviously) heavier than plain old strcmp over a pair of ascii strings. Things start to be a bit worrying after n=100, but I don't think that a 100 char long title is a realistic case - that must be like 1% of all titles around. To get the idea, this is what 100 characters look like: "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore" It doubles around 64 chars, and 64 chars look like this: "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do" Aka a long-ish title. I don't have data, but I think most titles will be in the 6-20 chars range, which is where we have mostly comparable performance between naturalcmp and collate, HOWEVER they both are terrible compared to strcmp: len collate us natural us strcmp us pairs tested -------------------------------------------------------------- 1 627 416 216 256 3 859 865 211 256 6 1098 1325 224 256 11 1609 2006 231 256 20 2100 3077 258 256 Bottom line, I guess this is your design decision in the end: true natural sorting at a price, almost-natural-sorting at a lower price or just strcmp. Perhaps having natural sorting as an user-selectable option in the preferences could be a compromise? Also I guess there is room to optimize NaturalCmp a bit more to bring it a bit closer to collate - notice that collate /seems/ to look kind of quadratic (or anyway something between nlogn and n^2) with big enough inputs. This assuming my methodology makes sense and that there isn't an error in the actual code.
Actually, it just occurred to me that given the fact that strcmp is SO much lighter - I think it would be optimal if we could get away with calling collate() once and caching it in the object and then just doing strcmp Which is what is happening in https://github.com/tobiatesan/shotwell/compare/saner_sorting. P.s.: Sorry for the flooding.
No problem with the flooding. I appreciate how seriously you're taking this issue. Because Shotwell deals in large collections of objects, performance is important for us. I'm on vacation right now so I can't do a full run-through of your numbers. I'll be back next week and can look at them then. I guess my question is, what approach are you recommending here? You've run a lot of numbers here and done some thinking on this, obviously. What do you feel is the best approach? It's important to balance code complexity vs. performance vs. correctness.
Well, I guess things are like this: * generating a collate key + running strcmp on it later seems much, much cheaper, especially when you have to reuse the same collate key /some/ number of times (I did not look into which sorting algorithm we use here, but let's say /some/ number of times). So +1 for saner_sorting. * the collate function is mantained by the nice GLib people, plus takes care of any weird locales and special cases. +1 for saner_sorting * However, comparing doesn't seem THAT important looking at the gprof output. Then again I'm a MSVC guy and it is very possible that I have screwed up with gprof. So it's actually +2/k for saner_sorting, with some k. * There is the issue of the dot. The average user won't notice it, where "average" probably means > 95% of them. But somewhere, somehow the probability that some guy has dots in his titles and stuff gets weird is P=1 in the limit. I'm not sure if it's acceptable. it's the kind of thing that irks me in principle and drives me berserk when I encounter it in practice. So what did I do: I cloned the glibc code, and had a *very* quick look at the code. The function seems to just handle numbers, the dot and then call plain g_utf8_collate_key for doing the actual grunt work. In the weekend I can probably either hack around it so that it it doesn't do its special thing with the dot, /or/ hack around NaturalCmp so that it uses g_utf8_collate_key - either way ending up with a function that generates a collate key after taking care of the numbers and NOT doing anything special with the dot - and we have the best of both worlds. The choice would be then clear cut. So +inf for this future, hypotetical lump of code. /If/ for some reason this turns out to be impossible (which it /shouldn't/ but you never know) I guess it would be a matter of deciding whether we fear more an angry user with dots in his titles or a dozen very angry users with huge libraries, and if we can't make up our minds, stockpiling canned soup, locking the doors and tossing a coin. We'll see in a couple of days when I get around to doing it. In the meantime, thanks again for the help & guidance and be sure to enjoy your vacation!
Ooops, sorry, that was actually: ***I cloned the glibc code, and had a *very* quick look at the code for g_utf8_collate_key_for_filename.
Okay, here it is: https://github.com/tobiatesan/shotwell/commits/saner_sorting The goodies are in this commit: https://github.com/tobiatesan/shotwell/commit/35cbf3285fac165e53dd2959f18d6f99dc2bec85 The good news is that it works as expected, judging from the attached unit tests. It is O(n), and it does more or less what g_utf8_collate_key_for_filename minus the funky stuff with the dots. Compare: https://github.com/GNOME/glib/blob/master/glib/gunicollate.c#L501 The bad news is that I seem to have a pre-existing, unrelated problem with caching the generated key (regardless if I use this or g_utf8_something). I've run a git bisect on my branch and found out much to my chagrin that this commit breaks stuff: https://github.com/tobiatesan/shotwell/commit/ea6ad671dbb7023bd73152863f3775f3370010f8 Before that commit, "asdf", "b20", "b100" appear in this order. After applying that commit, I have "asdf", "b100", "b20", aka good ol' ASCII sorting. I have spent a couple of hours trying to figure out what's going on to no avail, there must be something that eludes me, probably in the order things are called. I'll have another look one of these evenings, although I suspect it's something very, very obvious that I can't see because I'm new to the codebase and I'm making some incorrect assumption (which gdb has yet to prove wrong on, it seems).
Have you made any progress on the order-of-operations issue? I might be able to offer some advice if you have any questions or let me know what you're finding.
Sorry, I still haven't got around to spend some (more) time on it. I trust I will be able to do so this weekend. I also trust it's in its essence a trivial problem (I mean, I'm only trying to store a string there, I'd say there's only so many things that can go wrong). I'm going from memory here, but last time I checked, /after/ applying that specific commit (b7023bd73), sorting was apparently ASCII-like and, looking into it with gdb, my collation keys where either empty strings or... weird characters, but in a meaningful /sequence/, as if some random offset were applied to each byte/character. Worry not, though - if I can't get around to understanding it I will eventually resort to pestering you on the mailing list. Thank you!
Actually, I was off track. Turns out that my function works fine in the test environment, but not inside Shotwell. Specifically, while the resulting preprocessed string (the output of my function) is of course identical, the result of calling collate_key() on it is /different/ - and different even when there is /no dot/, by a couple of trailing bytes, from the output of g_utf8_collate_key_for_filename. I suspect there is something locale-related going on. Please have patience while I sort this out.
Yes. Adding GLib.Intl.setlocale(GLib.LocaleCategory.ALL, ""); on top of test.vala is sufficient to break a number of tests. Silly, silly me.
I was missing a sentinel. Fixed: https://github.com/tobiatesan/shotwell/compare/saner_sorting
Are you ready for me to review this? Can you produce a patch?
Created attachment 290785 [details] [review] Patch for natural sorting using collate
Here you are, please.
Great work! Thanks for sticking with this. This is something Shotwell's needed for some time now. Pushed to master, commit 68f3f8
Wow. Just wow. All of a sudden I know how it feels to know that somebody, somewhere, in a timezone far far away, just might be screaming *at me* after running apt-get upgrade :) Thank you for your guidance and encouragement, really. I'll be super swamped till Christmas, but I hope I can attack some more tickets sometime soon :)