GNOME Bugzilla – Bug 351794
[id3demux] try harder to extract wrongly marked strings
Last modified: 2006-08-22 15:55:14 UTC
With the xine-lib backend: $ ./metadata-test "/home/data/Documents/Movie Samples/orgasme.mp3" | grep Title Title: Répondeur orgasme With the GStreamer backend: Title: Répondeur orgasme
Created attachment 71094 [details] orgasme.mp3
The ID3v2 title text frame claims the text is encoded as ISO-8859-1 and that's how we interpret it. Whatever wrote the tag should have marked the frame as containing a UTF-8 string if it writes UTF-8 strings. AFAIK there aren't really good and generally reliable mechanisms to make a good guess that this is an UTF8 string labelled wrongly as an ISO-8859-1 string. We might be able to put in some hacks to get this one right, but then we will almost certainly get other correctly encoded tags wrong. There are limits to how much you can hack around broken tags ...
how come xine gets it right?
IIRC xine gets it right because it completely ignores the specified character encoding, which means it gets every other case wrong.
Huh, not true. If the string is already in valid UTF-8, but we the encoding given by the file is broken, we try pass it as UTF-8: if (enc && strcmp(enc, "UTF-8")) { /* Don't bother converting if it's already in UTF-8, but the encoding * is badly reported */ if (meta_info_validate_utf8(value)) { meta_info_set_unlocked_utf8(stream, info, value); return; } That seems to work in most cases. If it's not correct UTF-8, we then perform a conversion using the encoding reported in the file.
Sorry, I was thinking of VLC. Xine's approach might work well for us too, although it runs the risk of wrongly converting tags that are marked ISO8859-1 but happen to contain a string that validates as UTF-8 - although I don't think I've ever seen one.
Created attachment 71147 [details] [review] patch This patch seems to fix it by using xine's technique for strings marked 'ISO8859-1'. The question now is whether to apply it. Opinions?
> This patch seems to fix it by using xine's technique for strings marked > 'ISO8859-1'. > > The question now is whether to apply it. Opinions? Heh, I'm all for it, I've suggested that months ago when a similar bug came up and it was you who back then rejected the idea based on the fact that we will get some correctly encoded strings wrong for sure ;) (I was just arguing what I thought was what we agreed on). The type of strings we'll get wrong are rather unlikely, it's basically a 'special character' (umlaut/accent type) plus a 'special punctuation' in a row. I don't think we'll find this combination very often in normal tags (where special quotes for quoted speech etc. aren't used). The only half-way plausible string I can come up with is something like Blablé² or Blablé³ for album titles (I've very rarely seen superscripts being used in place of 'Volume 2' etc.). I say let's apply it and see if we get any reports about wrongly extracted strings.
ooh, you never! Damn revisionists! ;) OK, applied: * gst/id3demux/id3v2frames.c: (parse_text_identification_frame), (parse_insert_string_field): If strings in text fields are marked ISO8859-1, but contain valid UTF-8 already, then handle them as UTF-8 and ignore the encoding. (#351794)