GNOME Bugzilla – Bug 724194
improve hostname asciiification
Last modified: 2014-02-21 12:58:05 UTC
The test-hostname test is failing on FreeBSD. Specifically: Müllers Computer gets converted to mllers-computer instead of the expected mullers-computer
I think this smells of a broken conversion to ASCII. It should be removing the diacritic, not the whole letter.
I *suspect* (but I still have to check) that my test environment doesn't have proper UTF-8 locales installed on the system. I think we should probably test for this brokenness in a direct way (ie: via iconv) and disable this testcase in that case. We do this for a few translation-related tests in GLib, for example. I'll look into it further.
Created attachment 268956 [details] [review] hostname-helper: decompose before transliteration iconv() is not always clever enough to know how to do transliteration, so when faced with characters like 'ü' the only thing it can do is to drop them. This is the case on FreeBSD, for example. We can give iconv() some help by doing a normalisation pass on the input string first. This normalises the 'ü' into its decomposed form of 'u' with a separate '¨' combining character. This has been checked to produce the desired result on FreeBSD and Linux with a non-German locale (ie: 'ü' -> 'u'). Unfortunately, this breaks Linux in German locales. iconv() there doesn't seem clever enough to handle 'ü' -> 'ue' when the 'ü' has been decomposed, and we end up with 'u'. Another solution is probably required.
The only thing that comes to mind is that we could try to do both approaches and take the longer one as the result... that's starting to get a bit ugly, though...
So I've been thinking a lot more about this bug. It's pretty interesting. On Linux, right now, in German locales, if someone copies this string: "Müller's PC" into g-c-c then they get muellers-pc as their hostname. but if they copy this string: "Müller's PC" then they get mullers-pc. (you can test for yourself with 'LANG=de_DE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT') Those two strings look the same (and indeed they are supposed to be treated equivalently) but one of them is in composed form and one is in decomposed form. The current algorithm treats them differently. So even on Linux, this code is currently broken -- we should do a normalisation step before conversion. This is arguably a bug in iconv and maybe GLib should be doing a normalisation step first in order to avoid it. That's probably too ugly to do with from g_convert() (which doesn't always take Unicode as input). So maybe we should have a nicer API for the relatively common case of "convert this UTF-8-with-accents string to its ASCII equivalent as per the given (or system) locale". This would definitely be nice to use from g_str_tokenize_and_fold() for the 'ascii alternates' as well. Thinking more about the original issue, though, even if we pick a particular normalisation (say composed, since this is what works nicely on Linux), FreeBSD's system iconv is deficient. We might be able to use decomposition as a way to trick it into giving us 'ü' -> 'u' but we're never going to get 'ü' -> 'ue' in German. It seems that we're not the first across this issue, either: http://lists.freebsd.org/pipermail/freebsd-bugs/2013-December/054593.html but we're in an slightly better situation in that GLib never advertised that //TRANSLIT is supported on g_convert(). Arguably, gnome-control-center is at fault here for trying to use an undocumented feature that only works on Linux. Another solution could be require GLib to link against GNU libiconv on FreeBSD (which is available via the package manager) in preference to using the deficient system version. I think I'm leaning toward a combination of all of these as the proper solution: - g-c-c should not use //TRANSLIT - in fact, g-c-c should not call g_convert() at all - we should have a proper "UTF8 to ASCII with transliteration" API in GLib - this new API would handle issues of normalisation - glib should preferentially use GNU libiconv on systems that have deficient libc implementations
So it turns out that GNU libiconv is actually worse than the native BSD one. It converts 'ü' into literally '"u'. I'm not sure how this is _ever_ supposed to be helpful... It's starting to seem like the only good iconv on earth is gconv.
Created attachment 269462 [details] [review] hostname-helper: use GLib transliteration API Use GLib's new transliteration API to avoid the shortcomings of iconv(). Also solve an existing problem where strings entered in decomposed forms wouldn't be properly transliterated. Always normalise before attempting transliteration.
Review of attachment 269462 [details] [review]: Looks good. Please don't forget to bump the glib required version in configure.ac as well.
Attachment 269462 [details] pushed as 4736b03 - hostname-helper: use GLib transliteration API Done, and version bumped. Thanks