Bug 724194 – improve hostname asciiification

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 724194 - improve hostname asciiification


Summary:	improve hostname asciiification


Status:	RESOLVED FIXED

Product:	gnome-control-center
Classification:	Core
Component:	shell
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Control-Center Maintainers
QA Contact:	Control-Center Maintainers

URL:
Whiteboard:

Depends on:	710142
Blocks:

Reported:	2014-02-11 23:47 UTC by Allison Karlitskaya (desrt)
Modified:	2014-02-21 12:58 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
hostname-helper: decompose before transliteration (2.69 KB, patch) 2014-02-12 19:30 UTC, Allison Karlitskaya (desrt)	none	Details \| Review
hostname-helper: use GLib transliteration API (1.42 KB, patch) 2014-02-17 19:39 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review

Description Allison Karlitskaya (desrt) 2014-02-11 23:47:36 UTC

The test-hostname test is failing on FreeBSD.  Specifically:

  Müllers Computer

gets converted to

  mllers-computer

instead of the expected

  mullers-computer

Comment 1 Bastien Nocera 2014-02-12 11:38:21 UTC

I think this smells of a broken conversion to ASCII. It should be removing the diacritic, not the whole letter.

Comment 2 Allison Karlitskaya (desrt) 2014-02-12 12:22:33 UTC

I *suspect* (but I still have to check) that my test environment doesn't have proper UTF-8 locales installed on the system.  I think we should probably test for this brokenness in a direct way (ie: via iconv) and disable this testcase in that case.

We do this for a few translation-related tests in GLib, for example.

I'll look into it further.

Comment 3 Allison Karlitskaya (desrt) 2014-02-12 19:30:28 UTC

Created attachment 268956 [details] [review]
hostname-helper: decompose before transliteration

iconv() is not always clever enough to know how to do transliteration,
so when faced with characters like 'ü' the only thing it can do is to
drop them.  This is the case on FreeBSD, for example.

We can give iconv() some help by doing a normalisation pass on the input
string first.  This normalises the 'ü' into its decomposed form of 'u'
with a separate '¨' combining character.

This has been checked to produce the desired result on FreeBSD and Linux
with a non-German locale (ie: 'ü' -> 'u').

Unfortunately, this breaks Linux in German locales.  iconv() there
doesn't seem clever enough to handle 'ü' -> 'ue' when the 'ü' has been
decomposed, and we end up with 'u'.  Another solution is probably
required.

Comment 4 Allison Karlitskaya (desrt) 2014-02-12 19:34:01 UTC

The only thing that comes to mind is that we could try to do both approaches and take the longer one as the result... that's starting to get a bit ugly, though...

Comment 5 Allison Karlitskaya (desrt) 2014-02-14 13:28:02 UTC

So I've been thinking a lot more about this bug. It's pretty interesting.

On Linux, right now, in German locales, if someone copies this string:

"Müller's PC"

into g-c-c then they get muellers-pc as their hostname.

but if they copy this string:

"Müller's PC"

then they get mullers-pc.

(you can test for yourself with 'LANG=de_DE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT')

Those two strings look the same (and indeed they are supposed to be treated equivalently) but one of them is in composed form and one is in decomposed form. The current algorithm treats them differently.

So even on Linux, this code is currently broken -- we should do a normalisation step before conversion.

This is arguably a bug in iconv and maybe GLib should be doing a normalisation step first in order to avoid it. That's probably too ugly to do with from g_convert() (which doesn't always take Unicode as input).

So maybe we should have a nicer API for the relatively common case of "convert this UTF-8-with-accents string to its ASCII equivalent as per the given (or system) locale". This would definitely be nice to use from g_str_tokenize_and_fold() for the 'ascii alternates' as well.

Thinking more about the original issue, though, even if we pick a particular normalisation (say composed, since this is what works nicely on Linux), FreeBSD's system iconv is deficient. We might be able to use decomposition as a way to trick it into giving us 'ü' -> 'u' but we're never going to get 'ü' -> 'ue' in German.

It seems that we're not the first across this issue, either:

http://lists.freebsd.org/pipermail/freebsd-bugs/2013-December/054593.html

but we're in an slightly better situation in that GLib never advertised that //TRANSLIT is supported on g_convert(). Arguably, gnome-control-center is at fault here for trying to use an undocumented feature that only works on Linux.

Another solution could be require GLib to link against GNU libiconv on FreeBSD (which is available via the package manager) in preference to using the deficient system version.

I think I'm leaning toward a combination of all of these as the proper solution:

- g-c-c should not use //TRANSLIT

- in fact, g-c-c should not call g_convert() at all

- we should have a proper "UTF8 to ASCII with transliteration" API in GLib

- this new API would handle issues of normalisation

- glib should preferentially use GNU libiconv on systems that have deficient
libc implementations

Comment 6 Allison Karlitskaya (desrt) 2014-02-16 16:12:19 UTC

So it turns out that GNU libiconv is actually worse than the native BSD one.  It converts 'ü' into literally '"u'.  I'm not sure how this is _ever_ supposed to be helpful...

It's starting to seem like the only good iconv on earth is gconv.

Comment 7 Allison Karlitskaya (desrt) 2014-02-17 19:39:28 UTC

Created attachment 269462 [details] [review]
hostname-helper: use GLib transliteration API

Use GLib's new transliteration API to avoid the shortcomings of iconv().

Also solve an existing problem where strings entered in decomposed forms
wouldn't be properly transliterated.  Always normalise before attempting
transliteration.

Comment 8 Bastien Nocera 2014-02-21 10:19:26 UTC

Review of attachment 269462 [details] [review]:

Looks good.

Please don't forget to bump the glib required version in configure.ac as well.

Comment 9 Allison Karlitskaya (desrt) 2014-02-21 12:58:02 UTC

Attachment 269462 [details] pushed as 4736b03 - hostname-helper: use GLib transliteration API

Done, and version bumped.

Thanks