GNOME Bugzilla – Bug 169591
Non-UTF-8 encodings support
Last modified: 2007-09-01 04:48:16 UTC
musicbrainz, freedb/cddb sends informations about cyrillic CD disks in "CP1251" encoding, not in UTF-8. That breaks all the results of ripping - tags, filenames. Could you have a feture via env variable ot somewhat, to the custom encoding for musicbrainz database and make sound-juicer to convert all incoming text string form musicbrainz from "CP1251" (or SJ_CDDB_BROKEN_ENCODING) to UTF-8 before any later usage for writing tags/filenames etc.
musicbrainz, freedb/cddb sends informations about cyrillic CD disks in "CP1251" encoding, not in UTF-8. So that breaks all the results of ripping - tags, filenames. Could you have a feature (env variable or somewhat) that allows to set a custom encoding for musicbrainz database and makes sound-juicer to convert all incoming text strings from musicbrainz from "CP1251" (or SJ_CDDB_BROKEN_ENCODING) to UTF-8 before any later usage for writing tags/filenames etc.
(note: MusicBrainz and FreeDB/CDDB have nothing to do with each other) Sound-Juicer makes this call when creating the MusicBrainz object: mb_UseUTF8 (self->priv->mb, TRUE); So all strings from MusicBrainz should be in UTF-8. If this is not the case, this is a bug in libmusicbrainz. Please run SJ like this: MUSICBRAINZ_DEBUG=1 sound-juicer >log Then access a CD with Cyrillic song names, close SJ, and attach the contents of "log" to this bug.
Created attachment 38415 [details] MUSICBRAINZ_DEBUG=1 sound-juicer >log MUSICBRAINZ_DEBUG=1 sound-juicer >log with CD that has cyrillic titles.
To make this log readable, i.e. correct UTF-8, I have to do following: $ iconv -f UTF8 -t LATIN1 sj.log > sj.latin1.log $ iconv -f CP1251 -t UTF8 sj.latin1.log > sj.utf8.log now it's all cyrillic chars are correct. /vlad
Looks like musicbrainz is proxying the data from freedb, but getting the encoding wrong. The problem would either be in what freedb is sending, or how the musicbrainz server is interpreting it. The best fix would be to get the problem albums imported into the musicbrainz database, with the disc ID associated correctly.
I checked up freedb, it sends the same broken encoding. I know UTF8 solve all problems some day, but today non-latin disks needs this patch or workaround for the problem. Freedb simply makes conversion from 1 byte encoding to UTF8 and it does not know what kind of encoding that was.
I would like to confirm the bug for ISO8859-7 (greek) encoded tags and support the request of vladislav for a fix. I would also like to report that my combination of sound-juicer (0.5.10.1) and rhythmbox (0.8.3) do not handle UTF8 correctly: In sound-juicer i write: Αργύπνια (insomnia) and in rhythmbox i get: ÎÏγÏÏνια I know that my program versions are not the latest, so I would just like to ask for closer coordination of sound-juicer, rhythmbox and tagtool (have you heard this ?) with regard to tags. thanks
I also trying to say that "snow is white" We only need some lines that convert incoming string from musizbrains like in this example -- $ iconv -f UTF8 -t LATIN1 sj.log > sj.latin1.log $ iconv -f <WHAT_EVER_ORIGINAL_1_BYTE_ENCODING> -t UTF8 sj.latin1.log > sj.utf8.log --
Applications sending wrongly encoded entries to freedb should also be kicked with a clue stick (or rather, their users should notify the app writer that there's a bug in their app), I spent 5 minutes reading the FAQ on freedb, and it seems pretty obvious from reading it that data sent to freedb should either be ISO8859-1 or UTF-8...
I know this not a SJ bug, this is Windows world legacy. Sound Juicer is trying to be #1 in Gnome. But for non-LATIN1 people SJ is useless with this trouble. Ok, don't fix SJ, it's up to you anyway, but please add some comments in C code where we could insert iconv code to fix the encoding problem.
Don't get me wrong, I don't have *any* power to decide what sound-juicer will do and won't do, that's up to Ross. I was just commenting about that issue, because such bugs are always reported against the app *reading* tags (it happened with rhythmbox/gstreamer too for example), and the only way to fix that in tag consumer is to add ugly work around to support broken (as in 'programs that don't follow the spec') tag writers. So I was pointing out that even if work arounds on the tag consumer side may make sense, what *really* must be fixed is the app writing the wrong tags. So I wanted to make sure people are aware they shouldn't only report bugs against the tag reading apps, but also (and it's more important imo) against the app writing the tags.
Ok, I understand. Starting pinging musicbrainz. Seems like the problem is in their conversion routines.
I would like to add that all of the following: libmusicbrainz (used by sj) libid3tag (used by gstreamer) id3lib (used by tagtool) assume that the tags are in either ISO88591 or UTF8. vladislav: can you send the url for whatever you submitted at libmusicbrainz ?
from http://www.musicbrainz.org/faq.html Q: Can MusicBrainz handle international metadata? A: Yes. MusicBrainz uses UTF-8 for all its data, which means that all the data is stored in UNICODE and supports lots of different languages. Well on their Web search CD title looks ok, UTF8, but its limmusicbrainc returns broken UTF8. === so I sent them: Hi, I read in your ToDo list "Test UNICODE conversion support" I would like to be a volonteer for the task. Currently I have huge problem with Sound Juicer that fetchs cd info from musicbrainz -- fetch from musicbrainz result: <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc = "http://purl.org/dc/elements/1.1/" xmlns:mq = "http://musicbrainz.org/mm/mq-1.1#" xmlns:mm = "http://musicbrainz.org/mm/mm-2.1#" xmlns:az = "http://www.amazon.com/gp/aws/landing.html#"> <mq:Result> <mq:status>OK</mq:status> <mm:albumList> <rdf:Bag> <rdf:li rdf:resource="freedb:genid1"/> </rdf:Bag> </mm:albumList> </mq:Result> <mm:Artist rdf:about="freedb:genid2"> <dc:title>Âàëåðèÿ</dc:title> ^^^^^^^^^^^^^^^^^ wrong -- it claims to be UTF8, but it's not actually UTF8!! it broken LATIN1 which could fixed only this way: $ iconv -f UTF8 -t LATIN1 mb.log > mb.latin1.log $ iconv -f <WHAT_EVER_ORIGINAL_1_BYTE_ENCODING> -t UTF8 mb.latin1.log > +mb.utf8.log // my Skills: perl, mod_perl, apache, sql, Mason, Linux // very basic C++, basic Java.
to helpwanted@musicbrainz.org
Created attachment 38859 [details] [review] patch to support 8bit encodings other than ISO88591 Hi I made a quick patch for sound-juicer-0.5.15, cp sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c.bak and then apply the patch. it uses the environment variable SJ_ENC, e.g. for greek: setenv SJ_ENC ISO88597
Comment on attachment 38859 [details] [review] patch to support 8bit encodings other than ISO88591 >--- sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c.bak 2005-03-17 17:18:29.743689072 +0200 >+++ sound-juicer-0.5.15/src/sj-metadata-musicbrainz.c 2005-03-17 20:28:44.966311456 +0200 >@@ -38,6 +38,13 @@ > #include "sj-genres.h" > #include "cd-drive.h" > >+/** Use hack for titles encoded as ISO8859? */ >+#define ENC_LOCAL >+ >+#ifdef ENC_LOCAL >+#include <iconv.h> >+#endif >+ > struct SjMetadataMusicbrainzPrivate { > GError *construct_error; > musicbrainz_t mb; >@@ -283,6 +290,54 @@ > mb_SetProxy (priv->mb, priv->http_proxy, priv->http_proxy_port); > } > >+#ifdef ENC_LOCAL >+static char *enc; >+static iconv_t iconv_la; >+static iconv_t iconv_gr; >+ >+static void >+mb_iconv_open() >+{ >+ enc = getenv("SJ_ENC"); >+ if (enc == NULL) { >+ enc = strdup("ISO88591"); >+ } >+ iconv_la = iconv_open("ISO88591", "UTF8"); >+ iconv_gr = iconv_open("UTF8", enc); >+} >+ >+static void >+mb_iconv_close() >+{ >+ iconv_close(iconv_la); >+ iconv_close(iconv_gr); >+} >+>+static void >+mb_iconv_convert(char * data) { >+ /** Same as in lookup_cd */ >+#define MB_BUFFER_SIZE 256 >+ >+ size_t bytes_in; >+ size_t bytes_out; >+ char *ptr_in; >+ char *ptr_out; >+ char data_iconv[MB_BUFFER_SIZE]; >+ >+ bytes_in = MB_BUFFER_SIZE; >+ bytes_out = MB_BUFFER_SIZE; >+ ptr_in = data; >+ ptr_out = data_iconv; >+ iconv(iconv_la, &ptr_in, &bytes_in, &ptr_out, &bytes_out); >+ >+ bytes_in = MB_BUFFER_SIZE; >+ bytes_out = MB_BUFFER_SIZE; >+ ptr_in = data_iconv; >+ ptr_out = data; >+ iconv(iconv_gr, &ptr_in, &bytes_in, &ptr_out, &bytes_out); >+} >+#endif >+ > static gpointer > lookup_cd (SjMetadata *metadata) > { >@@ -338,6 +393,10 @@ > return priv->albums; > } > >+#ifdef ENC_LOCAL >+ mb_iconv_open(); >+#endif >+ > for (i = 1; i <= num_albums; i++) { > int num_tracks; > AlbumDetails *album; >@@ -346,6 +405,9 @@ > album = g_new0 (AlbumDetails, 1); > > if (mb_GetResultData(priv->mb, MBE_AlbumGetAlbumName, data, MB_BUFFER_SIZE)) { >+#ifdef ENC_LOCAL >+ mb_iconv_convert(data); >+#endif > album->title = g_strdup (data); > } else { > album->title = g_strdup (_("Unknown Title")); >@@ -358,6 +420,9 @@ > album->artist = g_strdup (_("Various")); > } else { > if (data && mb_GetResultData1(priv->mb, MBE_AlbumGetArtistName, data, MB_BUFFER_SIZE, 1)) { >+#ifdef ENC_LOCAL >+ mb_iconv_convert(data); >+#endif > album->artist = g_strdup (data); > } else { > album->artist = g_strdup (_("Unknown Artist")); >@@ -384,10 +449,16 @@ > track->number = j; /* replace with number lookup? */ > > if (mb_GetResultData1(priv->mb, MBE_AlbumGetTrackName, data, MB_BUFFER_SIZE, j)) { >+#ifdef ENC_LOCAL >+ mb_iconv_convert(data); >+#endif > track->title = g_strdup (data); > } > > if (mb_GetResultData1(priv->mb, MBE_AlbumGetArtistName, data, MB_BUFFER_SIZE, j)) { >+#ifdef ENC_LOCAL >+ mb_iconv_convert(data); >+#endif > track->artist = g_strdup (data); > } > >@@ -404,6 +475,10 @@ > albums = g_list_append (albums, album); > } > >+#ifdef ENC_LOCAL >+ mb_iconv_close(); >+#endif >+ > /* For each album, we need to insert the duration data if necessary > * We need to query this here because otherwise we would flush the > * data queried from the server */
Thanks for patch! I will check up it tommorow. Also I found yet more way to fix the problem! I registred at www.musicbrainz.org and imported CD info from freedb.org with proper encoding through http://www.musicbrainz.org/freedb/freedb.html it worked for me!
Vladislav/George: if the musicbrainz server is sending badly transcoded data back, then the musicbrainz server is what needs to be fixed. When you do a musicbrainz CD lookup, the server does the following: 1. check if the CD disc-id exists in the musicbrainz database, and return that info if so. 2. perform a lookup in FreeDB, and return that data to the client if there are any results. The data in the musicbrainz database is sent as valid UTF-8. If the data from FreeDB is correctly tagged with the encoding, then it should also come through okay. However if the FreeDB data is not tagged with an encoding, then it will be transcoded from Latin-1 to UTF-8. If you go through transcoding all the results from the musicbrainz query UTF-8 -> Latin 1 -> $ENCODING, then you'll fix up some results from freedb, but mangle results from the musicbrainz database, or correctly tagged freedb entries.
I've been thinking about this and have a plan. SJ should validate every string which comes from the server and verify it is valid UTF-8. If it isn't, then try converting it to UTF-8 from the current locale (with g_locale_to_utf8).
Most distributions use an UTF-8 locale by default these days, the g_locale_to_utf8 wouldn't help at all on those systems. Additionally supporting an environment variable to specify the encoding to use would be nice. Fwiw, GStreamer already has a GST_TAG_ENCODING environment variable, maybe it could be reused there.
That is true. I can't use GST_TAG_ENCODING as the strings are also used for filenames and display.
Ross: with this bug, the musicbrainz server _is_ sending us valid UTF-8 -- it's just the wrong characters :) The problem is that when it is providing a result that it looked up in FreeDB which hasn't been correctly tagged with an encoding, then it assumes that it is latin1, and does a latin1=>utf-8 conversion. So if FreeDB contains a record encoded in koi8r for instance, but without specifying the encoding, sound-juicer would receive that text passed through a latin1=>utf-8 conversion. I don't think there is much you could actually do in this case.
Could SJ have some option (not necessarily in GUI, in gconf) which would allow specifying _explicitly_ the encoding conversion? People could set it temporarily for ripping problematic discs. PS Just today I encountered the situation when someone put strings in cp866 :((
Is there no current workaround to this? I simply have a pile of CDs I borrowed from my chinese cousin, and I cannot rip them because they will be entirely messed up (ok just for the record, it is legal in Canada to make copies for yourself, even if you do not own the originals ;). Could this be looked into for gnome 2.18? Pretty please? I have been looking for a ripper that "just works" for a long time, and this is the only flaw I see in SJ. Other than that, it really is a wonderful piece of software and I don't want to dump soundjuicer for such a silly problem.
Adding those CDs to musicbrainz would let you rip correctly the CDs, and would benefit other users of musicbrainz with the same CDs ;)
Well IF I could do that, I would not be desperately hanging to this bug report! :) Because I don't read/write Chinese, and I have no idea what these albums are. Pathetic huh? :)
Jeff: load the CD in SJ, look at the garbage, press CD->Submit and follow the Musicbrainz import process for the album. Musicbrainz, if I recall correctly, will start the import for you.
Ross: I did that, and I end up on a musicbrainz page that tells me that the TOC is not recognized; hence, I have to enter it manually into the musicbrainz database. Am I wrong? If that is the case, the problem is as I said above: I cannot enter the disc informations myself. I have no idea what this album is. I cannot read and/or write chinese characters. What I see: http://img223.imageshack.us/img223/2746/manualsubmittalvz3.png
Ross, I just tried to import a CD to musicbrainz, it says that the DiscID is not found, so I proceeded to add the release, and chose the "FreeDB Lookup" option. But the result is the same, the imported data is garbage. I _can_ manually convert all the fields by hand, but that's a pain, and there's no option I can see that would automatically do the conversion. On the other hand, I recommend that SJ use g_locale_to_utf8, since that functions reads LC_CTYPE and is more "standard" than any other env var. People who are on a UTF8 locale desktop can still set LC_CTYPE without affecting the display language. Ross, if no one is working on it right now, I can even rip out a patch this weekend following your idea in comment #20. I have 8 CDs sitting on my desk right now that I want to import :-)
Ka-Hing Cheung: that would be great. I'm still not convinced that would actually work, as the text has been through too many layers to be useful (see comment #21). However, if you manage to come up with a way of detecting and fixing this automatically, I'd love a patch.
Created attachment 79510 [details] [review] encoding patch Here you go, this patch uses LC_CTYPE to attempt to convert data from musicbrainz. If you want to lookup a CD that's not your current locale, you can do: LC_CTYPE=zh_TW sound-juicer In my case my locale is en_US.UTF-8, but I want to rip a CD that's in BIG5. If your locale is UTF8 (such as zh_TW.BIG5), you would still need to manually set LC_CTYPE to a non-UTF8 locale.
Forgot to say that this patch is created against 2.16.2, since I was not able to get through autogen.sh (complains about shifting too much...). Also, building sound-juicer with a different PREFIX scrollkeeper still tries to write to /usr. Building SJ actually took more time than writing the patch :-)
Doesn't that patch attempt to convert the Musicbrainz data from ISO-8859-1 to UTF-8, despite the fact that most of the incoming data is UTF-8 already? As this hack is only required for data proxied from freedb, it should only run when the data is from freedb. #353181 contains details on how to detect this. Yes, all of sj-metadata-musicbrainz needs to be refactored. :(
Actually it's converting from UTF-8 to ISO-8859-1 and then from LC_CTYPE to UTF-8. This is done because even though the incoming data is UTF-8 only, all musicbrainz does to ensure that is by assuming any non-UTF8 data to be ISO-8859-1, so I am first undoing their hack. If the data is already UTF-8, then the second step (LC_CTYPE to UTF-8) would fail, so it should not cause a regression in any way. I will create another patch that does freedb detection in a few minutes.
Created attachment 79558 [details] [review] another patch that does freedb detection One of the freedb CD that I has actually returns the correct encoding, that's very weird since it's disc 2 and disc 1 from the same album has the bogus encoding. It would have taken less time if my cdrom isn't failing (would not recognize any CD at all for a while...)
Hi Ross, any updates/comments for this patch?
Applied to svn, thanks!
Created attachment 88243 [details] screenshot still not fixed on my side, sadly. Tried with the 2.19.1 tarball. How can I determine if the problem is on my side, on the online DB side, or sound-juicer?
Is LC_CTYPE set to the expected encoding? Can you copy and paste some of the strings that are shown, and what you expect them to be?
Created attachment 91868 [details] sample cd cover with tracks listing Sorry to dig up this old bug, but Ka-Hing did not respond to my email and I let the issue lie around for months. how do I actually know what the value of LC_CTYPE is? (I just did a ./configure && make && ./src/sound-juicer). About the disk: I scanned it and attached the picture (because I cannot read :)
Did you ever mail me? If so gmail must have eaten it, I am sorry about that. LC_CTYPE needs to be set to the expected encoding, you can do something like: $ LC_CTYPE=zh_TW.BIG5 sound-juicer (or zh_CN.GB2312, that CD is published from China but the song names are in Traditional Chinese, with the publisher at the bottom in Simplified Chinese, weird!) It would also help if you copy and paste what sound-juicer displayed if that command still doesn't work. You may need to install that locale if it's not available on your system, on debian it would be `dpkg-reconfigure locales'. It's funny that you decided to update this bug on my birthday :-)
Hi Ka-Hing, sorry for the late reply. Yeah I did send you 2 emails after the first exchange, I guess they were caught in your spam filter :( I tried with LC_CTYPE, with various ways: jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_TW.BIG5 jeff@khloe:~/trunks/sound-juicer$ ./src/sound-juicer jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_CN.GB2312 jeff@khloe:~/trunks/sound-juicer$ ./src/sound-juicer jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_TW.BIG5 ./src/sound-juicer (sound-juicer:24370): Gtk-WARNING **: Locale not supported by C library. Using the fallback 'C' locale. (sound-juicer:24370): Gdk-WARNING **: locale not supported by C library jeff@khloe:~/trunks/sound-juicer$ LC_CTYPE=zh_CN.GB2312 ./src/sound-juicer (sound-juicer:24385): Gtk-WARNING **: Locale not supported by C library. Using the fallback 'C' locale. (sound-juicer:24385): Gdk-WARNING **: locale not supported by C library In any case, no matter what I tried, the display remained exactly the same. I don't know if this is possible, but maybe I could make an ISO out of that disc and provide it to you (in private, for testing purposes) so you can analyze it? I don't know if that would help.
You need to prepend "export" to your LC_CTYPE= command if you want it to stick, or you can specify LC_CTYPE=... src/sound-... on the same line, like you did at the end. Like I suggested in the last comment, it seems like you don't have the locale you need configured. You can use my suggested command if you are on a debian based system, for other distros I am not sure what the command would be, but it would probably involve installing language support for Chinese.
Well I already have chinese language support installed (in ubuntu), but I know that chinese, japanese, korean & other asian characters show up fine in ubuntu even if you don't install the language support yourself (I have lots of filenames that use those and unicode id3 tags). Just in case it can help, I'll be emailing you a link for a cd image I made with K3B from that disc. Maybe it is the culprit.
Ahh, ubuntu changed the way locales are generated. You need to modify /var/lib/locales/supported.d/local, mine looks like: $ cat /var/lib/locales/supported.d/local en_US ISO-8859-1 en_US.UTF-8 UTF-8 zh_TW BIG5 zh_TW.UTF-8 UTF-8 zh_CN GB2312 $ Then, you need to run: $ sudo dpkg-reconfigure locales Now try again: $ LC_CTYPE=zh_TW.BIG5 ./src/sound-juicer $ LC_CTYPE=zh_CN.GB2312 ./src/sound-juicer You shouldn't get error messages about locales at this point. Come to think of it, my original idea about using LC_CTYPE is, while correct, seems to be a bit inconvenient as distributions omit to install non-UTF8 locales, and gives no user visible way to install them.
Ka-Hing, sorry for letting this issue on the backburner for so long again. I just followed your instructions and... they work! After modifiying /var/lib/locales/supported.d/local and generating locales, using either "LC_CTYPE=zh_TW.BIG5 sound-juicer" or "LC_CTYPE=zh_CN.GB2312 sound-juicer" works, but without that, it doesn't work. Now that I have proof that it can work, my question is, how can this be fixed without the users having to figure out things like that by themselves?
There are couple ways to go about this: 1) make distros install the non-UTF8 locale as well when you install the language support 2) make GTK not complain and unset LC_CTYPE when it sees one that it doesn't recognize 3) make sound-juicer use another variable 4) make a nice encoding chooser menu like what you would see in a browser I don't particularly like 3, because that introduces another non-standard variable. I don't particularly like 4 either, because that's a slippery slope for having a encoding menu in, oh, just about every single application.
Option 2 sounds good to me, as it would fix the problem instead of having lots of distributions repeat the error and having people file weird bugs on sound juicer, is that correct? I guess a bug needs to be filed against gtk about this, but I think I am really not up to it (because I don't have enough technical knowledge). Unless a bug already exists for it?
I don't know if GTK's behavior is a _bug_ or not. Anyway, it's up to Ross to decide.