Bug 114068 – proposal for change in G_BROKEN_FILENAMES API

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 114068 - proposal for change in G_BROKEN_FILENAMES API


Summary:	proposal for change in G_BROKEN_FILENAMES API


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	2.2.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2003-05-30 23:15 UTC by Stanislav Brabec
Modified:	2011-02-18 16:07 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
patch (2.57 KB, patch) 2003-08-08 23:34 UTC, Matthias Clasen	none	Details \| Review

Description Stanislav Brabec 2003-05-30 23:15:00 UTC

Example of situation: You are using ISO-8859-2 file names for your
filesystem. You have to set G_BROKEN_FILENAMES. Now you have copied files
from latest Redhat in UTF-8. Their names are mangled.

My suggestion of G_BROKEN_FILENAMES behavior - evaluate filename "brokeness":

  - If locale is non-UTF-8:
    Try to recognize all names as UTF-8, if it fails, try it as locale
specific.
  - If locale is UTF-8:
    Try to recognize all names as UTF-8, if it fails, try it to decode
using G_BROKEN_FILENAMES envvar as locale name (e. g. ISO-8859-2).

This behavior can significantly help to people planning to move their file
systems to UTF-8.


There is also chance to have two variables (and functions):

G_BROKEN_FILENAMES and G_LOCALE_FILENAMES (or G_GUESS_FILENAME_CHARSET).

It will not affect existing API.

Do you think, that some of these ideas are acceptable?

Comment 1 Owen Taylor 2003-05-30 23:17:58 UTC

What I think I'd like to see is G_FILENAME_ENCODING that
takes precendence over G_BROKEN_FILENAMES when set.

Comment 2 Matthias Clasen 2003-08-08 23:33:14 UTC

Here is an (untested) patch for G_FILENAME_ENCODING.

Comment 3 Matthias Clasen 2003-08-08 23:34:24 UTC

Created attachment 19048 [details] [review]
patch

Comment 4 Owen Taylor 2003-11-05 16:49:09 UTC

I think it would be nice to have some value for G_FILENAME_ENCODING
that means "encoding of locale" - maybe '@locale' - 
G_BROKEN_FILENAMES was, in retrospect a poor choice of names
since it offended people.

Other than the patch looks fine to me.

We should file another bug (probably for 2.6) to add 
g_filename_get_display_name() - see:

http://mail.gnome.org/archives/gtk-devel-list/2003-October/msg00058.html

Since G_FILENAME_ENCODING won't really fix Stanislav's problem
here. What you'd probably want to do is make it possible
to have a list of encodings in G_FILENAME_ENCODING, and the
first is used for g_filename_to_utf8(), but 
g_filename_get_display_name() tries them in sequence, so you
could have:

 G_FILE_NAME_ENCODING=@locale,iso-8859-2

Comment 5 Stanislav Brabec 2003-11-05 17:25:54 UTC

I have encountered identical problem in two another locations:

1) rhythmbox and MP3 tags - in Czech, depending on OS and version,
people are using tags in CP1250 (Windows), ISO-8859-2 (misc Linuxes),
UTF-8 (latest RH Linux).

2) GIMP and layer names between Windows and Linux (at least if
importing from 1.2).

Both address exactly the same problem as original report, and asks for
system solution.

It's relativelly easy to recognize charset in these cases - in proper
testing order, we need to pick first not failed conversion.

So my another API idea is:

New variable (or default to hardwired locale-based table):
G_GUESS_CHARSET=UTF-8,ISO-8859-2,CP1250
If not set, default locale-based table will be used.

New constant table, based on locale (in most cases without charset
name; collecting this will require submits from people from other
countries):
const char* g_guessed_charsets={
{ "cs_CZ", "UTF-8,ISO-8859-2,CP1250" }
{ "de_DE", "UTF-8,ISO-8859-15,ISO-8859-1,CP????"}
{ "cn_CN", ?????? }

And new functions:
gbool g_guess_charset(gstring) or g_guessed_charset_to_utf8() or so

Adopting filename problems:

G_FILENAME_ENCODING: If set, use this encoding for file system names.
If unset, use locale charset.

G_BROKEN_FILENAMES: If set, use g_guessed_charset_to_utf8() for from
filename conversion (slower). If unset, use g_filename_to_utf8() for
from filename conversion (faster).

Comment 6 Owen Taylor 2003-11-05 17:42:42 UTC

If you run into places where strings are being passed
around without a defined encoding, *FLAME HARD* and
get it fixed. Any GLib guessing API is just a 
won't-work-well workaround. (What if you load up someone
else's XCF...)

A locale => encoding to guess table could be useful in 
some cases, especially if there is more text... guessing
file *contents* encoding is more practical. There's
a table in appendix D of 
http://freedesktop.org/Standards/desktop-entry-spec
that could be used, though it should probably be updated 
for ISO-8859-15, etc.

[ BTW, g_filename_get_display_name() is already there as 
  bug 96531 ]

Comment 7 Stanislav Brabec 2003-11-05 18:02:04 UTC

If it is software bug, I can flame hard, but it is standard leak, it
is a big problem.

People are getting files from different sources with different
encoding of ID3 tags, GIMP-1.2 layer names, CDDB titles etc., with no
chance to really fix anything (except bad standard; but old files will
stay here). The only solution is charset guesser.

Hopefully, in many languages simple heuristics "correct is first
valid" works well (if charset order is correct).

Proposal for creating guessed charsets list:

This table contains list of charsets used for string guessing for
certain language. While guessing, program tries character sets from
this lists and supposes, that proper charset is first not failed one.
Remember this fact, when you are deciding on charset order. Note that
this algorithm is unable to detect charset, if the string is valid in
any of previously listed charsets.

Maybe following form will result smaller table:

{ "cs_CZ,sk_SK,pl_PL,ro_RO", "UTF-8,ISO-8859-2,CP1250"}
{ "de_DE,fr_FR,...", "UTF-8,ISO-8859-1,CP????"}

Note that more sophisticated charset guesser library already exists:
http://trific.ath.cx/software/enca/

Comment 8 Matthias Clasen 2003-11-05 23:16:45 UTC

Committed, with the preparations for g_filename_get_display_name()
outlined by Owen: G_FILENAME_ENCODING can be a list, and @locale is
recognized.

Comment 9 Eungkyu Song 2004-05-18 22:05:51 UTC

I'm sorry for joining this discussion late.

I don't know why this bug is closed. Reporter's suggestion was "guess the
filename's encoding between UTF-8 and locale dependent encoding (list)." but
there is no guess routine in function get_filename_charset. Function just select
G_FILENAME_ENCODING's "first" candidate permanently. Filename charset shound not
be static, and should be checked (or guessed) before every conversion because
there may exist different encoded filenames in same system.

For example, this routine is needed in function g_filename_to/from_utf8.

if G_FILENAME_ENCODING=UTF-8,ISO-8859-2,CP1250 (cs_CZ)

foreach UTF_8 ISO-8859-2 CP1250 in encoding
do
    if convert $encoding to/from UTF-8 success
        exit. filename encoding is $encoding
    else
        next loop
done

if all try fail
    tag "invalid filename"

Comment 10 Eungkyu Song 2004-05-18 22:56:51 UTC

(previous g_filename_to/from_utf8 is g_filename_to_utf8)

g_filename_from_utf8 is another problem because I'm not alone.

Let's suppose that I'm using UTF-8 filename system and there is no filename
encoding problem cause of good guessing system. I'm connected to ftp server with
nautilus with gnome-vfs facility. Ftp server's file encoding is not UTF-8, but I
can see well cause of good guessing system. (Currently, I can't see that
filename cause of faulty closed this bug.) I want upload some file to ftp server
with drag and drop.

What "filename encoding" shoud be used for upload?
How to determine that remote server's filename encoding is not UTF-8?

GTK 2.0's claim, "let's use UTF-8 filename system" is not bad. But this claim
can be good thing when the most of other people also use UTF-8 filename system.
Almost all the existing system does not use UTF-8 filename system. (including
every MS Windows machine)

So, if we really want to UTF-8 utopia, at least GTK should provide an usable way
to live together with native encoding world.

Personally, I prefer Stanislav Brabec's solution for filename to utf8 problem.
(available encoding list per locale) But I don't have good idea for filename
from utf8 solution (especially in remote filename).

Comment 11 Matthias Clasen 2004-05-19 01:43:07 UTC

remote filenames are beyond the scope of g_filename_from/to_utf8.

Comment 12 Eungkyu Song 2004-05-19 05:36:35 UTC

Ok. Remote filename encoding problem is not good issue to discuss here.

However, my first comment is about local filename. I'm in ko_KR locale and can't
see EUC-KR and UTF-8 filename together nevertheless this bug marked resolved.

There is no guessing routine at all.

Comment 13 Owen Taylor 2004-05-19 16:57:34 UTC

g_filename_to/from_utf8() have have allow exact round trips. 
No guessing is possible. 

For guessing, g_filename_get_display_name() is proposed in 
bug 96531.