GNOME Bugzilla – Bug 342364
[id3demux] Chinese id3 tags sometimes read incorrectly
Last modified: 2006-05-24 10:46:20 UTC
Please describe the problem: Some Unicode id3 tags with Chinese characters can be read by other id3 programs but not by GStreamer apps. For example, a file by the artist Faye Wong (王菲) shows up as gibberish çè² in GStreamer. See output below more info: $ id3v2 -l "王菲 - Live - 01.mp3" id3v1 tag info for 王菲 - Live - 01.mp3: Title : Artist: 王菲 Album : Live Year: , Genre: Pop (13) Comment: 00000C79 00000B32 0000F780 0 Track: 1 id3v2 tag info for 王菲 - Live - 01.mp3: TALB (Album/Movie/Show title): Live TRCK (Track number/Position in set): 01 TCON (Content type): Pop (13) COMM (Comments): ()[]: 00000C79 00000B32 0000F780 0 TLEN (Length): 316000 TPE1 (Lead performer(s)/Soloist(s)): 王菲 $ gst-launch-0.10 filesrc location="王菲 - Live - 01.mp3" ! id3demux ! fakesink -t Setting pipeline to PAUSED ... Pipeline is PREROLLING ... FOUND TAG : found by element "id3demux0". artist: çè² album: Live comment: 00000C79 00000B32 0000F780 0 track number: 1 genre: Pop duration: 316000000000 Pipeline is PREROLLED ... Setting pipeline to PLAYING ... New clock: GstSystemClock Got EOS from element "pipeline0". Execution ended after 182947000 ns. Setting pipeline to PAUSED ... Setting pipeline to READY ... Setting pipeline to NULL ... FREEING pipeline ... Steps to reproduce: Problem does not seem dependent on Chinese characters (other Faye Wong tags in Chinese work fine). So not sure what triggers the proble. But on the problematic tags, the problem is seen everytime in a GStreamer app. Actual results: Expected results: Does this happen every time? Yes Other information: If you want the actual mp3 file, email me at monochromatic_rainbow@yahoo.com
Created attachment 65875 [details] First 512kb of the problematic file given in the example above Includes the id3v2 tag
Created attachment 65876 [details] Last 512kb of the problematic file given in the example above Includes the id3v1 tag
Which other applications read this ID3v2 tag at the start correctly? The artist frame in the tag is simply broken/incorrect. It claims to be of encoding type 0 (=ISO-8859-1) while it really is encoding 3 (UTF-8). As far as I can tell there is no way to know that this is not correct, because ISO-8859-1 covers the entire range from 0x00-0xff so we can't even say "ooh, it doesn't look like valid ISO-8859-1, let's check whether it's UTF-8".
Sorry, I missed the reference to the 'id3v2' tool above. Indeed, that tool displays the tag fine here as well (so it doesn't take into account locales), will need to see how it does that.
Yeah, several command line id3 tools read the tag as Unicode (id3v2 being one). Playing around with this has also made me think that Gstreamer apps (like Rhythmbox and Nautilis) are using the id3v1 tag for display. And according to some notes I was reading in Easytag (the app I use to edit tags), it says that id3v1 tags are always saved as single-byte. So doesn't that mean basically the id3v1 tags will never fully support Chinese (which requires double-byte)? Thus, the problem is not so much about reading it wrong (I guess Gstreamer reads id3v1 as single-byte as it technically "should") but rather that it would be better to be using the id3v2 tag for display? At least, it seems that it should first look for id3v2 and if it doesn't exist then resort to the id3v1? Feel free to correct me if I have muddled up everything here...I'm just a user, not a developer.
> And according to some notes I was reading in Easytag (the app I use to > edit tags), it says that id3v1 tags are always saved as single-byte. > So doesn't that mean basically the id3v1 tags will never fully > support Chinese (which requires double-byte)? That is correct. ID3v1 was only really meant to hold Western European strings (ISO-8859-1/ASCII) and nothing else. How apps/readers/writers deal with that problem differs. It's pretty much a mess and pretty much unsolvable (GStreamer falls back on the encoding specified in GST_ID3_TAG_ENCODING for ID3v1 tags if it's not valid UTF-8, this is a hack though because there are so many ID3v1 tags with other charsets out there) In short: just don't use ID3v1 tags. > Yeah, several command line id3 tools read the tag as Unicode (id3v2 being > one). Playing around with this has also made me think that Gstreamer apps > (like Rhythmbox and Nautilis) are using the id3v1 tag for display. Gstreamer shouldn't prefer the ID3v1 tag, at least not by default. By default the 'id3demux' element should use the tags from the ID3v2 tag if it finds both an ID3v2 tag and an ID3v1 tag. If it doesn't do that, that's a bug :) However, GStreamer in fact does read the ID3v2 "wrongly" as well (where "wrongly = correctly according to spec"), you can see that from the debug log if you use gst-launch like this: $ GST_DEBUG=id3demux:5 .... (also, I'm only working with the beginning of the file, which doesn't have the ID3v1 tag). Needs more looking into ...
No, it's definitely taking the ID3v2 tag as it should, and that tag is improperly put together. It seems to me the only reason the id3v2 program seems to get this tag right is that it seems to ignore the indicated text encoding in the field entirely as far as I can tell. Also, if this tag was written by Easytag, it was a broken version, because version 1.99.11 can't read the field correctly either. In short, I don't think there's much we can do to 'handle' this file - it's just broken.
Yeah, I have started to realize that the source of my problems is Easytag. It seems that it doesn't set the encoding type properly. But I would note that I used v1.99.11 to make the tag. Anyways, I guess that just means 1.99.11 has a bug. In the meantime, I figured out I can save Unicode tags in Easytag with ISO-8859-1, then use id3iconv (dl off the web) to convert it properly to Unicode. Rhythmbox and Nautilus could read those converted tags fine. So, clearly, it's not a gstreamer bug.