GNOME Bugzilla – Bug 149274
[PATCH] gst-plug mad gets id3v2 text tags in wrong encoding
Last modified: 2005-11-15 18:59:38 UTC
1.run Rhythmbox with gstreamer backend 2.load mp3s which are tagged in a locale-depending encoding 3.these tags get wrong display (All in ISO-8859-1 symbols) When a text tags of id3 is marked as "ISO-8859-1", it desn't really mean ISO-8859-1, but "locale-depending encoding". So a locale-to-utf8 conversion is needed. Id3v1 tags gets converted correctly in gst/tags/gstid3tag.c, but id3v2 are not. They just get a "ISO8859-1-to-utf8" conversion (by libid3tag), so that won't work correctly in non-ASCII locales.
Created attachment 30200 [details] [review] do "locale-to-utf8" conversion for id3v2 tags in mp3s The little patch will do the "locale-to-utf8" conversion for gst-plugin-mad when an ISO-8859-1 text tag are read
Thank you. It works for me. I hope it could be included in offical release.
According to the ID3 standard located at http://www.id3.org/id3v2.4.0-structure.txt a tag marked as "ISO-8859-1" really means "ISO-8859-1" and not "current locale". In the case of ID3v1 the locale was never specified, so GStreamer uses the hackish approach of looking at the locale. Note that this often fails however, because modern Linux distributions use UTF-8 by default. Which taggers do produce these broken files?
Benjamin, is this patch commited? If not should it be commited and this bug closed? If it is not commited and shouldn't be commited either should this bug be closed as wontfix?
Can any of the topicstarters please provide such a sample file as Benjamin requested? We'd love to apply this patch or a similar one, but we need to be sure that A) UTF-8 support doesn't break, B) the final patch fixes your bug and C) check for general sanity. Please provide a file.
Here is an mp3 for you to test. id3tag is on russian cp1251 (windows-1251) encoding. http://people.panama.by/pi/f0000873.mp3
There is a spec for ID3v2 and always has been, and files like the above are clearly broken IMHO. But even _if_ one was to accept the approach you are taking, then the logic should at least be: if (encoding == ID3V2_SPEC_ISO88591_REALLY_MEANING_LOCALE) { if (CURRENT_LOCALE == UTF-8) /* or any unicode really? */ { interpret_string_as_iso_8859_1(); } else { if (interpret_string_as_current_locale() == FAILED) interpret_string_as_iso_8859_1(); } } I am not convinced, however, that this is the right approach. This approach has at least two problems: * users tend to have files from multiple sources (friends, p2p networks, or files from the days they still used Windows etc.), all of which can be with strings encoded in whatever locale the source used. * users might switch locales (e.g. consciously, or when upgrading their distro, or when changing distros/OS etc.) I think the only semi-sane way to handle this problem is to have an environment variable like GSTREAMER_ID3_TAG_ENCODING that takes one or multiple ':'-separated encodings that are tried for ID3v2 strings in 'iso-8859-1' encoding. Or someone should just write a small and easy-to-use app that converts those broken tags into unicode tags ;) Cheers -Tim
Created attachment 36496 [details] [review] id3v2 recoding with libenca
Created attachment 36497 [details] [review] id3v1 recoding with libenca
this problem is very urgent for the languages of those using cyrillic symbols. So let us say for the Russian language there exists the entire "bouquet" of encodings. And not always encoding in the file coincides with locale encoding. I examined the published here patches and the initial code. And I create new patches. In them they are used the specialized library (libenca) for determining the initial encoding. For this altogether it is only necessary to know the language of tags (first two letters from locale name). This approach justified itself, and now I don't have problems with codings of tags id3v2. But problem with the tags id3v1 remained. Despite recoding in gst_tag_extract() function , tags show erroneously. I don't understand why. If you please, you will examine my patches, and, if it is possible, help with the solution of this problem. P.S. Sorry my bad english.
If anyone is asking me for an opinion (seems so, since I was added to the cc :)): These files are broken. Broken files should be hard but possible to support so there exists a way for people to get them to work, but it involves some time on their part. Just so they are annoyed enough that they fix their files. I also don't want a solution that makes stuff much more complicated just because of broken files. Adding a libenca dependency makes stuff much more complicated, especially for packagers. The best solution I can see is what __tim suggested: use an env var GST_ID3_TAG_ENCODING. If set (to for example "cp1251"), ID3v1 tags and ID3v2 in 8859-1 tags are tried to be interpreted in that encoding first. This is only a little more code, makes it possible to read tags and involves work to get the broken files going.
2 Benjamin, This problem is really important for those users with *broken* files (I think any user have some files like that). Users think: "If my files work in Windows why they don't work in Linux??". And we can't describe them that it is their falce and they must change id3tags in mp3 files. Some of them even don't know what id3tag is. Such "Look&Feel" problems casts shadow on whole Linux. Part 2. I'm as a packanger think that it is OK to use enca. But if you don't think so... OK. Even if we will use env var the methods will be the same (we should add recoding to the same functions). But our second patch (for id3v1) don't work. And we need some help with it. You can forget about enca and just look if we have added a recoding in necessary place?
I'll have something based on environment variables ready for you to test by tomorrow. Can you provide one or two files that have ID3v2 tags and exhibit this problem? (The file above only has an ID3v1 tag as far as I can see) Cheers -Tim
2 Benjamin: err... "teaching people in a hard way" may work if it isn't really lot of people and an awful lot of a hard way ahead -- I've exchanged several emails with grip author and maintainers of freedb.org on a similar topic. It all started as quite similar patches for grip to handle potentially different encodings of freedb data, ID3s and filesystem where the filenames get stored. The problem was they broke with grip being modified to comply with freedb's policy. The latter was fine in theory but in practice there's an awful lot of "latin1" information there which is really in e.g. windows-1251 charset. In fact, any cyrillic sample I could find was exactly that, filed in by some non-compliant Windows software. So please postpone your pedagogical talent till the time when it can be fruitful -- this one isn't. We have an ugly ignorant legacy problem, and it's not going to fade away anytime soon by itself. You're basically saying "hey use LaTeX" to a secretary which doesn't know what Windows is -- all she knows is "Word". Prudent people at Star Division invented another way around this problem. :) Sorry for the lengthy rant but the issue (and accompanying bugs and troubleshooting) have annoyed me quite a bit to maintain Packages That Work and not bother to update them if they break or even look at another e.g. players. People tend to follow the advice, too :) --- Recapping: I can attach current patch in our grip package, and here's link to another one applied to our xmms package (I happen to maintain both): http://prdownloads.sourceforge.net/rusxmms/xmms-1.2.10-recode-csa27.4.tar.bz2?download (project site: http://rusxmms.sourceforge.net) -- WBR, Mike
Really, what is wrong with teaching people about that stuff, ie giving them the env var, and writing a "smart-tag-fixer" program which will guess the encoding of the mp3s and fix them accordingly? If we add workarounds everywhere, the spec will never be enforced, imo there is a point in time where all the legacy cruft needs to be cleaned up, and now is probably a good time to start with the new tag libs available (I'm mainly thinking to gstreamer and taglib).
2 Tim-Philipp, Here is a file with id3v2 in cp1251 encoding: http://people.panama.by/pi/16-7b_-_poslednii_geroi_128_lame_cbr.mp3
Try again with current CVS. Depending on what your locale encoding is, you might need to set the appropriate environment variable, e.g. like this (if you use bash): % export GST_ID3_TAG_ENCODING=CP1251 % rhythmbox or % GST_ID3_TAG_ENCODING=CP1251 rhythmbox With the two test files provided I get (let's see how bugzilla handles unicode): % GST_ID3_TAG_ENCODING=CP1251 gst-launch-0.8 -t filesrc location=f0000873.mp3 ! decodebin ! audio/x-raw-int ! fakesink Running pipeline ... Found tag: found by element »id3demux0« Title: Баллада о борьбе Artist: Высоцкий Владимир Album: http://mp3.tula.ru Comment: Средь оплывших свечей и вечерн % GST_ID3_TAG_ENCODING=CP1251 gst-launch-0.8 -t filesrc location=16-7b_-_poslednii_geroi_128_lame_cbr.mp3 ! decodebin ! audio/x-raw-int ! fakesink Running pipeline ... Found tag: found by element »id3demux0« Title: 7Б - Последний герой Artist: Сборники Рок-музыки Album: Последний герой - Песни для г Genre: Other Date: 730851 Cheers -Tim
Let me clarify my above statement a bit. I'm all for adding workarounds if the workarounds make more files work. I'm against workarounds if they - break correct applications - work in an unexpected way. Guessing the tag encoding based on locale setting does both of this. 1) If an application correctly encodes "Die Ärzte" as ISO-8859-1 and you play it back in Russia, you definitely want to get "Die Ärzte", no matter what locale you chose. None of the patches I've seen so far can ensure this. 2) If I were an ordinary user I would never see the connection between changing the language of my desktop and Rhythmbox failing to import my files. If anyone can think of a patch that never fails for files encoded to spec, we'll use it instantly, but as long as we need to break the spec, don't do it by default.
*** Bug 160816 has been marked as a duplicate of this bug. ***
Fixed by Tim in CVS. Closing.
*** Bug 321383 has been marked as a duplicate of this bug. ***