GNOME Bugzilla – Bug 97856
need to support more legacy charsets
Last modified: 2005-08-31 15:43:38 UTC
many id3 tag use local charset yet. (ogg use utf-8, so they are displayed properly) some work around needed. see http://bugzilla.gnome.org/show_bug.cgi?id=80037
i build and installed currunt cvs tree. but, mp3 id3 tags are displayed broken yet. (both of local charset and utf8)
Created attachment 18887 [details] first row is euc-kr id3tag and second row is utf8 id3tag, both are id3 v1
This is fairly difficult to fix. It's best to convert all your tags to id3v2, which specifies UCS16 as the standard coding, iirc. We'll try to come up with some workaround eventually though.
hmm. I heard id3_ucs4_utf8duplicate have some troubles. In liteamp 0.2.3.2 (http://kldp.net/project/showfiles.php? group_id=109 small mp3/ogg player for gnome2), first dup id3tag with id3_ucs4_latin1duplicate() and g_utf8_validate(latin1) or g_locale_convert(latin1) or g_convert(latin1) codeset defined by user.
liteamp project page : http://kldp.net/projects/liteamp (korean page) download url: http://download.kldp.net/liteamp/liteamp- 0.2.3.2.tar.gz gentoo portage and freebsd port available. ;)
The latest Rhythmbox CVS now attempts to convert from the user's locale encoding if the tag appears to be invalid.
Created attachment 19382 [details] [review] local id3 charset list from gconf
but g_locale_to_utf8 have some trouble in 'utf8 locale' system. in 'utf8 locale' system, nl_langinfo(CODESET) always returns 'UTF-8', so id3tag with local charset have broken conversion. so i make some workaround. in this patch, id3tag conversion with user defined charset in gconf key.
Ug. This patch kind of breaks the internal Rhythmbox dependencies, where monkey-media is at the bottom. I think the "right" way to fix this is to use some language/encoding autodetection framework, and possibly prompt the user. Of course the *really* right way is to take the people who write programs which put the locale charset in mp3s and repeatedly hit them over the head with the Unicode 4.0 code charts until they fix their programs.
*** Bug 120939 has been marked as a duplicate of this bug. ***
I think that prompting the user for the correct encoding of his tags is probable the easiest, still most effective way to solve the issue. Talking about suggestion to "ake the people who write programs which put the locale charset in mp3s and repeatedly hit them over the head with the Unicode 4.0 code charts" I would say is not a best sollution. Many international users do not use Unicode (UTF in particular) cause quite a few apps still don't handle it properly.
Please fix this bug quickly. I'm rebuilding rhythmbox everytime with Cha's patch. This bug is fairly important for non-latin people. We can't see any id3v1-tagged-song's title, album or artist.
First of all, rhythmbox doesn't depend on eel, so you'll have to replace the eel convienience functions. This change: - utf8 = id3_ucs4_utf8duplicate (ucs4); + utf8 = id3_ucs4_latin1duplicate (ucs4); breaks properly tagged files in UTF-8. Couldn't the legacy codeset be detected depending on the locale? If you could also provide: - locale used - legacy codeset used and an example file reproducing the problem. A screenshot showing both the broken and right displays would be appreciated.
I just stole the code for legacy charset support from libgnome-desktop, hacked it up a bit, and used it for monkey-media. It's committed to the latest arch/cvs. Let me know if it works for you!
IS is possible to allow the user to select the encoding to be used for tags? E.g., for Russian there are at least 3 commonly used encodings (windows-1251, koi8-r, utf8) - and most mp3 around use windows-1251 for tags, while I use utf8 as locale encoding on my machine. Of course, I can convert tags, but if I just want to take a quick look at a song I found, it'd be so much easier to just select the encoding in a preferences dialog than fire up a separate tag conversion program....
Why not let the user select the proper Encodings in Preferences and then use UTF-8 as fallback?
Alexander: My patch could be extended to have a GList of legacy charsets associated with each locale. I don't personally have enough expertise about the character sets involved and which would be best to try first to do the modifications myself. But the most recent changes should fix a majority of cases. This is an issue that can never be fixed entirely. If you don't know what charset some data is in, you can never be sure what you've retrieved from it is valid, even if it doesn't violate the character set definition. The real solution here is to use ID3V2 which specifies the charset. I'll leave this bug open for now as a reminder.
there is some trouble in id3v1 legacy charset support. first, id3_ucs4_utf8duplicate is not vaild (i told before). id3_ucs4_utf8duplicate always retuns valid utf8 string(but we do not want), because it makes utf8 string with slicing by one byte. (even if string is invalid utf8) and in get_encoding_from_locale, hashtable is set NULL, so can't find valid encoding from hashtable. and when get encoding from locale, must remove "UTF-8" locale, because already checked utf8 string in MP3_stream_info_impl_id3_tag_get_utf8 i'll submit patch.
Created attachment 21757 [details] [review] legacy id3v1 support fix.
Created attachment 21762 [details] [review] some cleanup.
Ah.. I have some mistaken. this problem are happend when id3_field_textencoding == ID3_FIELD_TEXTENCODING_ISO_8859_1 only, (because many id3 editors are set textencoding file to iso8859-1 only). so, i think that this problem solve with .. if (id3_field_textencoding == ID3_FIELD_TEXTENCODING_ISO_8859_1 || id3_field_textencoding == NULL ) call id3_ucs4_latin1duplicate else call id3_ucs4_utf8duplicate but there is no id3_field_gettextencoding function. how we can implement that?
i made new patch for workaround. just try convert with legacy charset, and finally use id3_ucs4_utf8duplicate. it works for most mp3 files (i tested mp3 files with tagging id3v2, itunes, winamp, etc..)
Created attachment 21787 [details] [review] new workaround for legacy id3 tag.
For anyone else following this bug, here's an abbreviated IRC log: <walters> ganadist: * commited walters@rhythmbox.org--2003b/rhythmbox--mainline--0.6--patch-343 <walters> ganadist: that has some of those fixes, but not the latin1 thing yet. <ganadist> thanks :) <ganadist> i tryed get textencoding information from id3taglib, but i can't. <walters> ganadist: ok. <walters> ganadist: i think what's happening is that libid3 thinks the file is in latin1 <walters> ganadist: so it converts it to utf8 for us <walters> ganadist: which succeeds, since you can interpret EUC-KR as latin1. <walters> ganadist: but obviously that's the wrong thing <ganadist> hmm. <walters> ganadist: i'm not sure how exactly to solve this. <walters> ganadist: since the returned string *is* valid UTF-8 <ganadist> yes. <walters> ganadist: probably the only real way to fix it would be to do some language-specific analysis <walters> ganadist: if the song name is "¿µ¿ø", well then you figure you got the encoding wrong. <walters> the reason your solution works is that it does no conversion on the string <ganadist> most legacy id3tags are some :( <walters> which is wrong too <ganadist> yes. i know. <walters> ok <walters> maybe in the short term what we could do is look for lots of non-letters in a row <walters> if that's true, then we try the _latin1_duplicate <walters> ganadist: what do you think? <ganadist> i think that is good :) <walters> ganadist: do you want to try implementing that? <ganadist> i'll try. <walters> ganadist: btw, if you can figure out how to peek directly at the buffers without id3 doing any conversion, that would be helpful here i think
i tried, but there is no way of validation for ucs4 string. unavoidably, just check with validatable charset like legacy charset or utf8, then trying convert with ucs4 later. PS: id3_ucs4_latin1duplicate() is just duplicated id3 tag frame(there is no decoding behavior) like this. so you can think that works like id3_ucs_rawduplicate() void id3_latin1_copy(id3_latin1_t *dest, id3_latin1_t const *src) { while ((*dest++ = *src++)) ; }
sorry, my misunderstanding. id3_length_t id3_latin1_encodechar(id3_latin1_t *latin1, id3_ucs4_t ucs4) { *latin1 = ucs4; if (ucs4 > 0x000000ffL) *latin1 = ID3_UCS4_REPLACEMENTCHAR; return 1; }
So basically libid3 gives us no way to access the raw tag data?
no. we can access by id3_ucs4_latin1duplicate() for legacy charset. id3_ucs4_utf8duplicate can return valid string only when id3tag is encoded ucs4, but id3tag can be any(legacy) charset encoded by many tagging programs unfortunately. and id3_ucs4_latin1duplicate can't return valid string only when id3tag is encoded ucs4. so i suggested decoding order in my patch. PS. utf8validation check twice before rb_unicodify and in rb_unicodify.
*** Bug 129298 has been marked as a duplicate of this bug. ***
*** Bug 131386 has been marked as a duplicate of this bug. ***
*** Bug 157270 has been marked as a duplicate of this bug. ***
Quick update on that bug: rhythmbox no longer uses monkey-media, so most of the patches that were discussed in that bug are no longer relevant. Rhythmbox now uses GStreamer for tag reading, GStreamer behaves in the following way: * for id3v2 tags, the charset is encoded in the tag, so GStreamer uses that encoding (I think it can only be ISO-8859-1, UTF-8 or UTF-16). If you have id3v2 tags using a different encoding, then you'd better fix those tags. * for id3v1 tags, the encoding is unspecified. GStreamer reads them in the following way: it tries to read them as UTF-8, if it doesn't work, it tries to do a current locale=>UTF-8 conversion, and if it still doesn't work, it does a ISO-8859-1=>UTF-8 conversion. So if you have "badly" encoded id3v1 tags, what you can do is to set your locale to be something like fr_FR.ISO-8859-15, import your files and save your library, the encoding should be correct then... It probably doesn't handle all the corner cases mentioned in this bug though, but is still a good first step. Can people being hit by these tag encoding issues comment whether it's enough for them or not ?
No, not enough. The problem is, even though for id3v2 tags, the charset *should be* encoded in the tag, most taggers (EasyTag, grip, soundjuicer, not to mention a whole slew of Windows encoders) do not set the encoding correctly - so the tag encoding bit says that it is ISO-8859-1 even though it is actually utf8 or worse, koi8-r. It is not a rhythmbox problem, but it would be good if rhythmbox provided a workaround, allowing use manually override the encoding, much as evolution allows you to manually select encoding of an email you have received. Yes, it is ugly -- but better than having unreadable tags. If you know of a good program taht can fix the tags by setting the encoding bit correctly, please let me know (this is a FAQ!)
Rhythmbox can't do much about that since this is all handled by gstreamer. I'd rather have a separate "fix my tags" program than hacking that around in rhythmbox imo. And obviously, the writer of tagging apps need to be hit with a clue stick...
This problem has been such a headache for me that I wrote a command line tool (in Java) to convert native encoded tags (v1 or v2) to UTF8 (v2). You can grab it at: http://www.cs.berkeley.edu/~zf/id3iconv/
if use gstreamer backend, it solved by bug #149274, but if use xine backend, it need more works.
The xine backend has more serious issues than that one ;) I'm closing this bug then, since the xine backend isn't really supported these days.