Bug 169591 – Non-UTF-8 encodings support

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 169591 - Non-UTF-8 encodings support


Summary:	Non-UTF-8 encodings support


Status:	RESOLVED FIXED

Product:	sound-juicer
Classification:	Applications
Component:	metadata
Version:	2.14.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Sound Juicer Maintainers
QA Contact:	Sound Juicer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-03-08 11:39 UTC by vladislav safronov
Modified:	2007-09-01 04:48 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
MUSICBRAINZ_DEBUG=1 sound-juicer >log (32.81 KB, text/plain) 2005-03-08 14:59 UTC, vladislav safronov		Details
patch to support 8bit encodings other than ISO88591 (3.13 KB, patch) 2005-03-17 18:29 UTC, George Fufutos	rejected	Details \| Review
encoding patch (1.93 KB, patch) 2007-01-06 04:23 UTC, Ka-Hing Cheung	none	Details \| Review
another patch that does freedb detection (2.48 KB, patch) 2007-01-06 19:51 UTC, Ka-Hing Cheung	committed	Details \| Review
screenshot (34.31 KB, image/png) 2007-05-15 20:19 UTC, Jean-François Fortin Tam		Details
sample cd cover with tracks listing (271.00 KB, image/jpeg) 2007-07-16 19:00 UTC, Jean-François Fortin Tam		Details

Description vladislav safronov 2005-03-08 11:39:02 UTC

musicbrainz, freedb/cddb sends informations about cyrillic CD disks
in "CP1251" encoding, not in UTF-8. That breaks all the results of ripping -
tags, filenames. Could you have a feture via env variable ot somewhat, to the 
custom encoding for musicbrainz database and make sound-juicer to convert all
incoming text string form musicbrainz from "CP1251" (or SJ_CDDB_BROKEN_ENCODING)
to UTF-8 before any later usage for writing tags/filenames etc.

Comment 1 vladislav safronov 2005-03-08 11:42:12 UTC

musicbrainz, freedb/cddb sends informations about cyrillic CD disks
in "CP1251" encoding, not in UTF-8. So that breaks all the results of ripping -
tags, filenames. Could you have a feature (env variable or somewhat) that allows
to set a custom encoding for musicbrainz database and makes sound-juicer to
convert all incoming text strings from musicbrainz from "CP1251" (or
SJ_CDDB_BROKEN_ENCODING) to UTF-8 before any later usage for writing
tags/filenames etc.

Comment 2 Ross Burton 2005-03-08 12:13:01 UTC

(note: MusicBrainz and FreeDB/CDDB have nothing to do with each other)

Sound-Juicer makes this call when creating the MusicBrainz object:

  mb_UseUTF8 (self->priv->mb, TRUE);

So all strings from MusicBrainz should be in UTF-8.  If this is not the case,
this is a bug in libmusicbrainz.

Please run SJ like this:

 MUSICBRAINZ_DEBUG=1 sound-juicer >log

Then access a CD with Cyrillic song names, close SJ, and attach the contents of
"log" to this bug.

Comment 3 vladislav safronov 2005-03-08 14:59:48 UTC

Created attachment 38415 [details]
MUSICBRAINZ_DEBUG=1 sound-juicer >log

MUSICBRAINZ_DEBUG=1 sound-juicer >log

with CD that has cyrillic titles.

Comment 4 vladislav safronov 2005-03-08 15:13:47 UTC

To make this log readable, i.e. correct UTF-8, I have to do following:

$ iconv -f UTF8 -t LATIN1 sj.log > sj.latin1.log
$ iconv -f CP1251 -t UTF8 sj.latin1.log > sj.utf8.log

now it's all cyrillic chars are correct.

/vlad

Comment 5 James Henstridge 2005-03-16 15:55:43 UTC

Looks like musicbrainz is proxying the data from freedb, but getting the
encoding wrong.  The problem would either be in what freedb is sending, or how
the musicbrainz server is interpreting it.

The best fix would be to get the problem albums imported into the musicbrainz
database, with the disc ID associated correctly.

Comment 6 vladislav safronov 2005-03-16 21:22:06 UTC

I checked up freedb, it sends the same broken encoding.
I know UTF8 solve all problems some day, but today non-latin disks
needs this patch or workaround for the problem.

Freedb simply makes conversion from 1 byte encoding to UTF8 and it does not
know what kind of encoding that was.

Comment 7 George Fufutos 2005-03-16 22:57:50 UTC

I would like to confirm the bug for ISO8859-7 (greek) encoded tags and
support the request of vladislav for a fix.

I would also like to report that my combination of 
sound-juicer (0.5.10.1) and rhythmbox (0.8.3) do not handle UTF8 correctly:
In sound-juicer i write: Αργύπνια (insomnia) and
in rhythmbox i get: ÎÏÎ³ÏÏÎ½Î¹Î±

I know that my program versions are not the latest, so I would just
like to ask for closer coordination of sound-juicer, rhythmbox and 
tagtool (have you heard this ?) with regard to tags.

thanks

Comment 8 vladislav safronov 2005-03-17 09:24:25 UTC

I also trying to say that "snow is white"
We only need some lines that convert incoming string from musizbrains
like in this example
--
$ iconv -f UTF8 -t LATIN1 sj.log > sj.latin1.log
$ iconv -f <WHAT_EVER_ORIGINAL_1_BYTE_ENCODING> -t UTF8 sj.latin1.log > sj.utf8.log
--

Comment 9 Christophe Fergeau 2005-03-17 09:28:29 UTC

Applications sending wrongly encoded entries to freedb should also be kicked
with a clue stick (or rather, their users should notify the app writer that
there's a bug in their app), I spent 5 minutes reading the FAQ on freedb, and it
seems pretty obvious from reading it that data sent to freedb should either be
ISO8859-1 or UTF-8...

Comment 10 vladislav safronov 2005-03-17 09:55:10 UTC

I know this not a SJ bug, this is Windows world legacy.
Sound Juicer is trying to be #1 in Gnome. 
But for non-LATIN1 people SJ is useless with this trouble.

Ok, don't fix SJ, it's up to you anyway, but please add some comments in C code 
where we could insert iconv code to fix the encoding problem.

Comment 11 Christophe Fergeau 2005-03-17 10:04:29 UTC

Don't get me wrong, I don't have *any* power to decide what sound-juicer will do
and won't do, that's up to Ross.
I was just commenting about that issue, because such bugs are always reported
against the app *reading* tags (it happened with rhythmbox/gstreamer too for
example), and the only way to fix that in tag consumer is to add ugly work
around to support broken (as in 'programs that don't follow the spec') tag writers.
So I was pointing out that even if work arounds on the tag consumer side may
make sense, what *really* must be fixed is the app writing the wrong tags. So I
wanted to make sure people are aware they shouldn't only report bugs against the
tag reading apps, but also (and it's more important imo) against the app writing
the tags.

Comment 12 vladislav safronov 2005-03-17 10:33:11 UTC

Ok, I understand.
Starting pinging musicbrainz. Seems like the problem is in their conversion
routines.

Comment 13 George Fufutos 2005-03-17 14:20:21 UTC

I would like to add that all of the following:
libmusicbrainz (used by sj)
libid3tag (used by gstreamer)
id3lib (used by tagtool)
assume that the tags are in either ISO88591 or UTF8.

vladislav: can you send the url for whatever you submitted at libmusicbrainz ?

Comment 14 vladislav safronov 2005-03-17 15:19:29 UTC

from http://www.musicbrainz.org/faq.html

Q: Can MusicBrainz handle international metadata?
A: Yes.  MusicBrainz uses UTF-8 for all its data, which means that all the data
is stored in UNICODE and supports lots of different languages.

Well on their Web search CD title looks ok, UTF8, but its limmusicbrainc returns
broken UTF8.

=== so I sent them:

Hi,
                                                                                
I read in your ToDo list
"Test UNICODE conversion support"
                                                                                
I would like to be a volonteer for the task.
Currently I have huge problem with Sound Juicer that fetchs
cd info from musicbrainz
                                                                                
-- fetch from musicbrainz
result: <?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc  = "http://purl.org/dc/elements/1.1/"
xmlns:mq  = "http://musicbrainz.org/mm/mq-1.1#"
xmlns:mm  = "http://musicbrainz.org/mm/mm-2.1#"
xmlns:az  = "http://www.amazon.com/gp/aws/landing.html#">
<mq:Result>
<mq:status>OK</mq:status>
<mm:albumList>
<rdf:Bag>
<rdf:li rdf:resource="freedb:genid1"/>
</rdf:Bag>
</mm:albumList>
</mq:Result>
<mm:Artist rdf:about="freedb:genid2">
<dc:title>Âàëåðèÿ</dc:title>
^^^^^^^^^^^^^^^^^ wrong
--
it claims to be UTF8, but it's not actually UTF8!!
it broken LATIN1 which could fixed only this way:
$ iconv -f UTF8 -t LATIN1 mb.log > mb.latin1.log
$ iconv -f <WHAT_EVER_ORIGINAL_1_BYTE_ENCODING> -t UTF8 mb.latin1.log >
+mb.utf8.log
                                                                                
                                                                               
// my Skills: perl, mod_perl, apache, sql, Mason, Linux
// very basic C++, basic Java.

Comment 15 vladislav safronov 2005-03-17 15:19:58 UTC

to helpwanted@musicbrainz.org

Comment 16 George Fufutos 2005-03-17 18:29:15 UTC

Created attachment 38859 [details] [review]
patch to support 8bit encodings other than ISO88591

Hi I made a quick patch for sound-juicer-0.5.15,

cp sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c
sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c.bak

and then apply the patch.

it uses the environment variable SJ_ENC, e.g. for greek:
setenv SJ_ENC ISO88597

Comment 17 George Fufutos 2005-03-17 19:15:02 UTC

Comment on attachment 38859 [details] [review]
patch to support 8bit encodings other than ISO88591

>--- sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c.bak      2005-03-17 17:18:29.743689072 +0200
>+++ sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c  2005-03-17 20:28:44.966311456 +0200
>@@ -38,6 +38,13 @@
> #include "sj-genres.h"
> #include "cd-drive.h"
> 
>+/** Use hack for titles encoded as ISO8859? */
>+#define ENC_LOCAL
>+
>+#ifdef ENC_LOCAL
>+#include <iconv.h>
>+#endif
>+
> struct SjMetadataMusicbrainzPrivate {
>   GError *construct_error;
>   musicbrainz_t mb;
>@@ -283,6 +290,54 @@
>   mb_SetProxy (priv->mb, priv->http_proxy, priv->http_proxy_port);
> }
> 
>+#ifdef ENC_LOCAL
>+static char *enc;
>+static iconv_t iconv_la;
>+static iconv_t iconv_gr;
>+
>+static void
>+mb_iconv_open()
>+{
>+  enc = getenv("SJ_ENC");
>+  if (enc == NULL) {
>+    enc = strdup("ISO88591");
>+  }
>+  iconv_la = iconv_open("ISO88591", "UTF8");
>+  iconv_gr = iconv_open("UTF8", enc);
>+}
>+
>+static void
>+mb_iconv_close()
>+{
>+  iconv_close(iconv_la);
>+  iconv_close(iconv_gr);
>+}
>+>+static void
>+mb_iconv_convert(char * data) {
>+  /** Same as in lookup_cd */
>+#define MB_BUFFER_SIZE 256
>+
>+  size_t bytes_in;
>+  size_t bytes_out;
>+  char *ptr_in;
>+  char *ptr_out;
>+  char data_iconv[MB_BUFFER_SIZE];
>+
>+  bytes_in  = MB_BUFFER_SIZE;
>+  bytes_out = MB_BUFFER_SIZE;
>+  ptr_in  = data;
>+  ptr_out = data_iconv;
>+  iconv(iconv_la, &ptr_in, &bytes_in, &ptr_out, &bytes_out);
>+
>+  bytes_in  = MB_BUFFER_SIZE; 
>+  bytes_out = MB_BUFFER_SIZE;
>+  ptr_in  = data_iconv;
>+  ptr_out = data;
>+  iconv(iconv_gr, &ptr_in, &bytes_in, &ptr_out, &bytes_out);
>+}
>+#endif
>+
> static gpointer
> lookup_cd (SjMetadata *metadata)
> {
>@@ -338,6 +393,10 @@
>     return priv->albums;
>   }
> 
>+#ifdef ENC_LOCAL
>+  mb_iconv_open();
>+#endif
>+
>   for (i = 1; i <= num_albums; i++) {
>     int num_tracks;
>     AlbumDetails *album;
>@@ -346,6 +405,9 @@
>     album = g_new0 (AlbumDetails, 1);
> 
>     if (mb_GetResultData(priv->mb, MBE_AlbumGetAlbumName, data, MB_BUFFER_SIZE)) {
>+#ifdef ENC_LOCAL
>+      mb_iconv_convert(data);
>+#endif
>       album->title = g_strdup (data);
>     } else {
>       album->title = g_strdup (_("Unknown Title"));
>@@ -358,6 +420,9 @@
>       album->artist = g_strdup (_("Various"));
>     } else {
>       if (data && mb_GetResultData1(priv->mb, MBE_AlbumGetArtistName, data, MB_BUFFER_SIZE, 1)) {
>+#ifdef ENC_LOCAL
>+        mb_iconv_convert(data);
>+#endif
>         album->artist = g_strdup (data);
>       } else {
>         album->artist = g_strdup (_("Unknown Artist"));
>@@ -384,10 +449,16 @@
>       track->number = j; /* replace with number lookup? */
> 
>       if (mb_GetResultData1(priv->mb, MBE_AlbumGetTrackName, data, MB_BUFFER_SIZE, j)) {
>+#ifdef ENC_LOCAL
>+        mb_iconv_convert(data);
>+#endif
>         track->title = g_strdup (data);
>       }
> 
>       if (mb_GetResultData1(priv->mb, MBE_AlbumGetArtistName, data, MB_BUFFER_SIZE, j)) {
>+#ifdef ENC_LOCAL
>+        mb_iconv_convert(data);
>+#endif
>         track->artist = g_strdup (data);
>       }
> 
>@@ -404,6 +475,10 @@
>     albums = g_list_append (albums, album);
>   }
> 
>+#ifdef ENC_LOCAL
>+  mb_iconv_close();
>+#endif
>+
>   /* For each album, we need to insert the duration data if necessary
>    * We need to query this here because otherwise we would flush the
>    * data queried from the server */

Comment 18 vladislav safronov 2005-03-17 20:05:30 UTC

Thanks for patch! I will check up it tommorow.

Also I found yet more way to fix the problem!
I registred at www.musicbrainz.org and imported CD info from freedb.org
with proper encoding through

http://www.musicbrainz.org/freedb/freedb.html

it worked for me!

Comment 19 James Henstridge 2005-03-18 03:35:07 UTC

Vladislav/George: if the musicbrainz server is sending badly transcoded data
back, then the musicbrainz server is what needs to be fixed.

When you do a musicbrainz CD lookup, the server does the following:
 1. check if the CD disc-id exists in the musicbrainz database, and return
    that info if so.
 2. perform a lookup in FreeDB, and return that data to the client if there
    are any results.

The data in the musicbrainz database is sent as valid UTF-8.  If the data from
FreeDB is correctly tagged with the encoding, then it should also come through
okay.  However if the FreeDB data is not tagged with an encoding, then it will
be transcoded from Latin-1 to UTF-8.

If you go through transcoding all the results from the musicbrainz query UTF-8
-> Latin 1 -> $ENCODING, then you'll fix up some results from freedb, but mangle
results from the musicbrainz database, or correctly tagged freedb entries.

Comment 20 Ross Burton 2005-09-28 10:41:04 UTC

I've been thinking about this and have a plan.  SJ should validate every string
which comes from the server and verify it is valid UTF-8.  If it isn't, then try
converting it to UTF-8 from the current locale (with g_locale_to_utf8).

Comment 21 Christophe Fergeau 2005-09-28 12:09:13 UTC

Most distributions use an UTF-8 locale by default these days, the
g_locale_to_utf8 wouldn't help at all on those systems. Additionally supporting
an environment variable to specify the encoding to use would be nice. Fwiw,
GStreamer already has a GST_TAG_ENCODING environment variable, maybe it could be
reused there.

Comment 22 Ross Burton 2005-09-28 12:51:15 UTC

That is true.  I can't use GST_TAG_ENCODING as the strings are also used for
filenames and display.

Comment 23 James Henstridge 2005-09-28 14:41:17 UTC

Ross: with this bug, the musicbrainz server _is_ sending us valid UTF-8 -- it's
just the wrong characters :)

The problem is that when it is providing a result that it looked up in FreeDB
which hasn't been correctly tagged with an encoding, then it assumes that it is
latin1, and does a latin1=>utf-8 conversion.

So if FreeDB contains a record encoded in koi8r for instance, but without
specifying the encoding, sound-juicer would receive that text passed through a
latin1=>utf-8 conversion.  I don't think there is much you could actually do in
this case.

Comment 24 Sergey V. Udaltsov 2006-09-03 16:55:01 UTC

Could SJ have some option (not necessarily in GUI, in gconf) which would allow specifying _explicitly_ the encoding conversion? People could set it temporarily for ripping problematic discs.
PS Just today I encountered the situation when someone put strings in cp866 :((

Comment 25 Jean-François Fortin Tam 2006-09-15 00:37:03 UTC

Is there no current workaround to this? I simply have a pile of CDs I borrowed from my chinese cousin, and I cannot rip them because they will be entirely messed up (ok just for the record, it is legal in Canada to make copies for yourself, even if you do not own the originals ;).

Could this be looked into for gnome 2.18? Pretty please? I have been looking for a ripper that "just works" for a long time, and this is the only flaw I see in SJ. Other than that, it really is a wonderful piece of software and I don't want to dump soundjuicer for such a silly problem.

Comment 26 Christophe Fergeau 2006-09-15 07:46:50 UTC

Adding those CDs to musicbrainz would let you rip correctly the CDs, and would benefit other users of musicbrainz with the same CDs ;)

Comment 27 Jean-François Fortin Tam 2006-09-15 20:14:40 UTC

Well IF I could do that, I would not be desperately hanging to this bug report! :)

Because I don't read/write Chinese, and I have no idea what these albums are. Pathetic huh? :)

Comment 28 Ross Burton 2006-09-25 09:25:12 UTC

Jeff: load the CD in SJ, look at the garbage, press CD->Submit and follow the Musicbrainz import process for the album.  Musicbrainz, if I recall correctly, will start the import for you.

Comment 29 Jean-François Fortin Tam 2006-09-26 00:09:23 UTC

Ross: I did that, and I end up on a musicbrainz page that tells me that the TOC is not recognized; hence, I have to enter it manually into the musicbrainz database. Am I wrong?

If that is the case, the problem is as I said above: I cannot enter the disc informations myself. I have no idea what this album is. I cannot read and/or write chinese characters.

What I see: http://img223.imageshack.us/img223/2746/manualsubmittalvz3.png

Comment 30 Ka-Hing Cheung 2007-01-05 00:26:03 UTC

Ross, I just tried to import a CD to musicbrainz, it says that the DiscID is not found, so I proceeded to add the release, and chose the "FreeDB Lookup" option. But the result is the same, the imported data is garbage. I _can_ manually convert all the fields by hand, but that's a pain, and there's no option I can see that would automatically do the conversion.

On the other hand, I recommend that SJ use g_locale_to_utf8, since that functions reads LC_CTYPE and is more "standard" than any other env var. People who are on a UTF8 locale desktop can still set LC_CTYPE without affecting the display language.

Ross, if no one is working on it right now, I can even rip out a patch this weekend following your idea in comment #20. I have 8 CDs sitting on my desk right now that I want to import :-)

Comment 31 Ross Burton 2007-01-05 09:46:53 UTC

Ka-Hing Cheung: that would be great.  I'm still not convinced that would actually work, as the text has been through too many layers to be useful (see comment #21).  However, if you manage to come up with a way of detecting and fixing this automatically, I'd love a patch.

Comment 32 Ka-Hing Cheung 2007-01-06 04:23:33 UTC

Created attachment 79510 [details] [review]
encoding patch

Here you go, this patch uses LC_CTYPE to attempt to convert data from musicbrainz. If you want to lookup a CD that's not your current locale, you can do:

LC_CTYPE=zh_TW sound-juicer

In my case my locale is en_US.UTF-8, but I want to rip a CD that's in BIG5. If your locale is UTF8 (such as zh_TW.BIG5), you would still need to manually set LC_CTYPE to a non-UTF8 locale.

Comment 33 Ka-Hing Cheung 2007-01-06 04:39:34 UTC

Forgot to say that this patch is created against 2.16.2, since I was not able to get through autogen.sh (complains about shifting too much...). Also, building sound-juicer with a different PREFIX scrollkeeper still tries to write to /usr.

Building SJ actually took more time than writing the patch :-)

Comment 34 Ross Burton 2007-01-06 11:49:37 UTC

Doesn't that patch attempt to convert the Musicbrainz data from ISO-8859-1 to UTF-8, despite the fact that most of the incoming data is UTF-8 already?

As this hack is only required for data proxied from freedb, it should only run when the data is from freedb.  #353181 contains details on how to detect this.

Yes, all of sj-metadata-musicbrainz needs to be refactored. :(

Comment 35 Ka-Hing Cheung 2007-01-06 19:17:54 UTC

Actually it's converting from UTF-8 to ISO-8859-1 and then from LC_CTYPE to UTF-8. This is done because even though the incoming data is UTF-8 only, all musicbrainz does to ensure that is by assuming any non-UTF8 data to be ISO-8859-1, so I am first undoing their hack.

If the data is already UTF-8, then the second step (LC_CTYPE to UTF-8) would fail, so it should not cause a regression in any way.

I will create another patch that does freedb detection in a few minutes.

Comment 36 Ka-Hing Cheung 2007-01-06 19:51:49 UTC

Created attachment 79558 [details] [review]
another patch that does freedb detection

One of the freedb CD that I has actually returns the correct encoding, that's very weird since it's disc 2 and disc 1 from the same album has the bogus encoding.

It would have taken less time if my cdrom isn't failing (would not recognize any CD at all for a while...)

Comment 37 Ka-Hing Cheung 2007-04-06 06:26:04 UTC

Hi Ross, any updates/comments for this patch?

Comment 38 Ross Burton 2007-05-13 11:10:56 UTC

Applied to svn, thanks!

Comment 39 Jean-François Fortin Tam 2007-05-15 20:19:29 UTC

Created attachment 88243 [details]
screenshot

still not fixed on my side, sadly. Tried with the 2.19.1 tarball. How can I determine if the problem is on my side, on the online DB side, or sound-juicer?

Comment 40 Ka-Hing Cheung 2007-05-15 22:30:55 UTC

Is LC_CTYPE set to the expected encoding? Can you copy and paste some of the strings that are shown, and what you expect them to be?

Comment 41 Jean-François Fortin Tam 2007-07-16 19:00:37 UTC

Created attachment 91868 [details]
sample cd cover with tracks listing

Sorry to dig up this old bug, but Ka-Hing did not respond to my email and I let the issue lie around for months.

how do I actually know what the value of LC_CTYPE is? (I just did a ./configure && make && ./src/sound-juicer). About the disk: I scanned it and attached the picture (because I cannot read :)

Comment 42 Ka-Hing Cheung 2007-07-17 01:56:47 UTC

Did you ever mail me? If so gmail must have eaten it, I am sorry about that.

LC_CTYPE needs to be set to the expected encoding, you can do something like:
$ LC_CTYPE=zh_TW.BIG5 sound-juicer

(or zh_CN.GB2312, that CD is published from China but the song names are in Traditional Chinese, with the publisher at the bottom in Simplified Chinese, weird!)

It would also help if you copy and paste what sound-juicer displayed if that command still doesn't work. You may need to install that locale if it's not available on your system, on debian it would be `dpkg-reconfigure locales'.

It's funny that you decided to update this bug on my birthday :-)

Comment 43 Jean-François Fortin Tam 2007-07-19 22:58:49 UTC

Hi Ka-Hing, sorry for the late reply. Yeah I did send you 2 emails after the first exchange, I guess they were caught in your spam filter :(

I tried with LC_CTYPE, with various ways:
jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_TW.BIG5
jeff@khloe:~/trunks/sound-juicer$ ./src/sound-juicer 
jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_CN.GB2312
jeff@khloe:~/trunks/sound-juicer$ ./src/sound-juicer 
jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_TW.BIG5 ./src/sound-juicer
(sound-juicer:24370): Gtk-WARNING **: Locale not supported by C library.
        Using the fallback 'C' locale.
(sound-juicer:24370): Gdk-WARNING **: locale not supported by C library

jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_CN.GB2312 ./src/sound-juicer
(sound-juicer:24385): Gtk-WARNING **: Locale not supported by C library.
        Using the fallback 'C' locale.
(sound-juicer:24385): Gdk-WARNING **: locale not supported by C library


In any case, no matter what I tried, the display remained exactly the same. I don't know if this is possible, but maybe I could make an ISO out of that disc and provide it to you (in private, for testing purposes) so you can analyze it? I don't know if that would help.

Comment 44 Ka-Hing Cheung 2007-07-20 01:48:40 UTC

You need to prepend "export" to your LC_CTYPE= command if you want it to stick, or you can specify LC_CTYPE=... src/sound-... on the same line, like you did at the end.

Like I suggested in the last comment, it seems like you don't have the locale you need configured. You can use my suggested command if you are on a debian based system, for other distros I am not sure what the command would be, but it would probably involve installing language support for Chinese.

Comment 45 Jean-François Fortin Tam 2007-07-20 18:37:14 UTC

Well I already have chinese language support installed (in ubuntu), but I know that chinese, japanese, korean & other asian characters show up fine in ubuntu even if you don't install the language support yourself (I have lots of filenames that use those and unicode id3 tags).

Just in case it can help, I'll be emailing you a link for a cd image I made with K3B from that disc. Maybe it is the culprit.

Comment 46 Ka-Hing Cheung 2007-07-20 21:48:12 UTC

Ahh, ubuntu changed the way locales are generated. You need to modify /var/lib/locales/supported.d/local, mine looks like:

$ cat /var/lib/locales/supported.d/local
en_US ISO-8859-1

en_US.UTF-8 UTF-8

zh_TW BIG5

zh_TW.UTF-8 UTF-8

zh_CN GB2312

$ 

Then, you need to run:

$ sudo dpkg-reconfigure locales

Now try again:

$ LC_CTYPE=zh_TW.BIG5 ./src/sound-juicer
$ LC_CTYPE=zh_CN.GB2312 ./src/sound-juicer

You shouldn't get error messages about locales at this point.

Come to think of it, my original idea about using LC_CTYPE is, while correct, seems to be a bit inconvenient as distributions omit to install non-UTF8 locales, and gives no user visible way to install them.

Comment 47 Jean-François Fortin Tam 2007-08-31 19:46:34 UTC

Ka-Hing, sorry for letting this issue on the backburner for so long again.
I just followed your instructions and... they work!

After modifiying /var/lib/locales/supported.d/local and generating locales,
using either "LC_CTYPE=zh_TW.BIG5 sound-juicer" or "LC_CTYPE=zh_CN.GB2312 sound-juicer" works, but without that, it doesn't work.

Now that I have proof that it can work, my question is, how can this be fixed without the users having to figure out things like that by themselves?

Comment 48 Ka-Hing Cheung 2007-09-01 01:20:01 UTC

There are couple ways to go about this:

1) make distros install the non-UTF8 locale as well when you install the language support
2) make GTK not complain and unset LC_CTYPE when it sees one that it doesn't recognize
3) make sound-juicer use another variable
4) make a nice encoding chooser menu like what you would see in a browser

I don't particularly like 3, because that introduces another non-standard variable. I don't particularly like 4 either, because that's a slippery slope for having a encoding menu in, oh, just about every single application.

Comment 49 Jean-François Fortin Tam 2007-09-01 02:42:10 UTC

Option 2 sounds good to me, as it would fix the problem instead of having lots of distributions repeat the error and having people file weird bugs on sound juicer, is that correct?

I guess a bug needs to be filed against gtk about this, but I think I am really not up to it (because I don't have enough technical knowledge). Unless a bug already exists for it?

Comment 50 Ka-Hing Cheung 2007-09-01 04:48:16 UTC

I don't know if GTK's behavior is a _bug_ or not. Anyway, it's up to Ross to decide.