GNOME Bugzilla – Bug 114068
proposal for change in G_BROKEN_FILENAMES API
Last modified: 2011-02-18 16:07:18 UTC
Example of situation: You are using ISO-8859-2 file names for your filesystem. You have to set G_BROKEN_FILENAMES. Now you have copied files from latest Redhat in UTF-8. Their names are mangled. My suggestion of G_BROKEN_FILENAMES behavior - evaluate filename "brokeness": - If locale is non-UTF-8: Try to recognize all names as UTF-8, if it fails, try it as locale specific. - If locale is UTF-8: Try to recognize all names as UTF-8, if it fails, try it to decode using G_BROKEN_FILENAMES envvar as locale name (e. g. ISO-8859-2). This behavior can significantly help to people planning to move their file systems to UTF-8. There is also chance to have two variables (and functions): G_BROKEN_FILENAMES and G_LOCALE_FILENAMES (or G_GUESS_FILENAME_CHARSET). It will not affect existing API. Do you think, that some of these ideas are acceptable?
What I think I'd like to see is G_FILENAME_ENCODING that takes precendence over G_BROKEN_FILENAMES when set.
Here is an (untested) patch for G_FILENAME_ENCODING.
Created attachment 19048 [details] [review] patch
I think it would be nice to have some value for G_FILENAME_ENCODING that means "encoding of locale" - maybe '@locale' - G_BROKEN_FILENAMES was, in retrospect a poor choice of names since it offended people. Other than the patch looks fine to me. We should file another bug (probably for 2.6) to add g_filename_get_display_name() - see: http://mail.gnome.org/archives/gtk-devel-list/2003-October/msg00058.html Since G_FILENAME_ENCODING won't really fix Stanislav's problem here. What you'd probably want to do is make it possible to have a list of encodings in G_FILENAME_ENCODING, and the first is used for g_filename_to_utf8(), but g_filename_get_display_name() tries them in sequence, so you could have: G_FILE_NAME_ENCODING=@locale,iso-8859-2
I have encountered identical problem in two another locations: 1) rhythmbox and MP3 tags - in Czech, depending on OS and version, people are using tags in CP1250 (Windows), ISO-8859-2 (misc Linuxes), UTF-8 (latest RH Linux). 2) GIMP and layer names between Windows and Linux (at least if importing from 1.2). Both address exactly the same problem as original report, and asks for system solution. It's relativelly easy to recognize charset in these cases - in proper testing order, we need to pick first not failed conversion. So my another API idea is: New variable (or default to hardwired locale-based table): G_GUESS_CHARSET=UTF-8,ISO-8859-2,CP1250 If not set, default locale-based table will be used. New constant table, based on locale (in most cases without charset name; collecting this will require submits from people from other countries): const char* g_guessed_charsets={ { "cs_CZ", "UTF-8,ISO-8859-2,CP1250" } { "de_DE", "UTF-8,ISO-8859-15,ISO-8859-1,CP????"} { "cn_CN", ?????? } And new functions: gbool g_guess_charset(gstring) or g_guessed_charset_to_utf8() or so Adopting filename problems: G_FILENAME_ENCODING: If set, use this encoding for file system names. If unset, use locale charset. G_BROKEN_FILENAMES: If set, use g_guessed_charset_to_utf8() for from filename conversion (slower). If unset, use g_filename_to_utf8() for from filename conversion (faster).
If you run into places where strings are being passed around without a defined encoding, *FLAME HARD* and get it fixed. Any GLib guessing API is just a won't-work-well workaround. (What if you load up someone else's XCF...) A locale => encoding to guess table could be useful in some cases, especially if there is more text... guessing file *contents* encoding is more practical. There's a table in appendix D of http://freedesktop.org/Standards/desktop-entry-spec that could be used, though it should probably be updated for ISO-8859-15, etc. [ BTW, g_filename_get_display_name() is already there as bug 96531 ]
If it is software bug, I can flame hard, but it is standard leak, it is a big problem. People are getting files from different sources with different encoding of ID3 tags, GIMP-1.2 layer names, CDDB titles etc., with no chance to really fix anything (except bad standard; but old files will stay here). The only solution is charset guesser. Hopefully, in many languages simple heuristics "correct is first valid" works well (if charset order is correct). Proposal for creating guessed charsets list: This table contains list of charsets used for string guessing for certain language. While guessing, program tries character sets from this lists and supposes, that proper charset is first not failed one. Remember this fact, when you are deciding on charset order. Note that this algorithm is unable to detect charset, if the string is valid in any of previously listed charsets. Maybe following form will result smaller table: { "cs_CZ,sk_SK,pl_PL,ro_RO", "UTF-8,ISO-8859-2,CP1250"} { "de_DE,fr_FR,...", "UTF-8,ISO-8859-1,CP????"} Note that more sophisticated charset guesser library already exists: http://trific.ath.cx/software/enca/
Committed, with the preparations for g_filename_get_display_name() outlined by Owen: G_FILENAME_ENCODING can be a list, and @locale is recognized.
I'm sorry for joining this discussion late. I don't know why this bug is closed. Reporter's suggestion was "guess the filename's encoding between UTF-8 and locale dependent encoding (list)." but there is no guess routine in function get_filename_charset. Function just select G_FILENAME_ENCODING's "first" candidate permanently. Filename charset shound not be static, and should be checked (or guessed) before every conversion because there may exist different encoded filenames in same system. For example, this routine is needed in function g_filename_to/from_utf8. if G_FILENAME_ENCODING=UTF-8,ISO-8859-2,CP1250 (cs_CZ) foreach UTF_8 ISO-8859-2 CP1250 in encoding do if convert $encoding to/from UTF-8 success exit. filename encoding is $encoding else next loop done if all try fail tag "invalid filename"
(previous g_filename_to/from_utf8 is g_filename_to_utf8) g_filename_from_utf8 is another problem because I'm not alone. Let's suppose that I'm using UTF-8 filename system and there is no filename encoding problem cause of good guessing system. I'm connected to ftp server with nautilus with gnome-vfs facility. Ftp server's file encoding is not UTF-8, but I can see well cause of good guessing system. (Currently, I can't see that filename cause of faulty closed this bug.) I want upload some file to ftp server with drag and drop. What "filename encoding" shoud be used for upload? How to determine that remote server's filename encoding is not UTF-8? GTK 2.0's claim, "let's use UTF-8 filename system" is not bad. But this claim can be good thing when the most of other people also use UTF-8 filename system. Almost all the existing system does not use UTF-8 filename system. (including every MS Windows machine) So, if we really want to UTF-8 utopia, at least GTK should provide an usable way to live together with native encoding world. Personally, I prefer Stanislav Brabec's solution for filename to utf8 problem. (available encoding list per locale) But I don't have good idea for filename from utf8 solution (especially in remote filename).
remote filenames are beyond the scope of g_filename_from/to_utf8.
Ok. Remote filename encoding problem is not good issue to discuss here. However, my first comment is about local filename. I'm in ko_KR locale and can't see EUC-KR and UTF-8 filename together nevertheless this bug marked resolved. There is no guessing routine at all.
g_filename_to/from_utf8() have have allow exact round trips. No guessing is possible. For guessing, g_filename_get_display_name() is proposed in bug 96531.