After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 351794 - [id3demux] try harder to extract wrongly marked strings
[id3demux] try harder to extract wrongly marked strings
Product: GStreamer
Classification: Platform
Component: gst-plugins-good
git master
Other Linux
: Normal normal
: 0.10.5
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Reported: 2006-08-17 17:11 UTC by Bastien Nocera
Modified: 2006-08-22 15:55 UTC
See Also:
GNOME target: ---
GNOME version: ---

orgasme.mp3 (59.24 KB, application/octet-stream)
2006-08-17 17:12 UTC, Bastien Nocera
patch (1.59 KB, patch)
2006-08-18 13:58 UTC, Jan Schmidt
committed Details | Review

Description Bastien Nocera 2006-08-17 17:11:58 UTC
With the xine-lib backend:
$ ./metadata-test "/home/data/Documents/Movie Samples/orgasme.mp3" | grep Title
Title: Répondeur orgasme

With the GStreamer backend:
Title: Répondeur orgasme
Comment 1 Bastien Nocera 2006-08-17 17:12:35 UTC
Created attachment 71094 [details]
Comment 2 Tim-Philipp Müller 2006-08-17 17:39:58 UTC
The ID3v2 title text frame claims the text is encoded as ISO-8859-1 and that's how we interpret it. Whatever wrote the tag should have marked the frame as containing a UTF-8 string if it writes UTF-8 strings.

AFAIK there aren't really good and generally reliable mechanisms to make a good guess that this is an UTF8 string labelled wrongly as an ISO-8859-1 string. We might be able to put in some hacks to get this one right, but then we will almost certainly get other correctly encoded tags wrong. There are limits to how much you can hack around broken tags ...
Comment 3 Wim Taymans 2006-08-18 09:50:28 UTC
how come xine gets it right?
Comment 4 Jan Schmidt 2006-08-18 10:20:35 UTC
IIRC xine gets it right because it completely ignores the specified character encoding, which means it gets every other case wrong.
Comment 5 Bastien Nocera 2006-08-18 10:30:45 UTC
Huh, not true.

If the string is already in valid UTF-8, but we the encoding given by the file is broken, we try pass it as UTF-8:
    if (enc && strcmp(enc, "UTF-8")) {
      /* Don't bother converting if it's already in UTF-8, but the encoding
       * is badly reported */
      if (meta_info_validate_utf8(value)) {
        meta_info_set_unlocked_utf8(stream, info, value);

That seems to work in most cases. If it's not correct UTF-8, we then perform a conversion using the encoding reported in the file.
Comment 6 Jan Schmidt 2006-08-18 11:30:33 UTC
Sorry, I was thinking of VLC.

Xine's approach might work well for us too, although it runs the risk of wrongly converting tags that are marked ISO8859-1 but happen to contain a string that validates as UTF-8 - although I don't think I've ever seen one. 
Comment 7 Jan Schmidt 2006-08-18 13:58:35 UTC
Created attachment 71147 [details] [review]

This patch seems to fix it by using xine's technique for strings marked 'ISO8859-1'. 

The question now is whether to apply it. Opinions?
Comment 8 Tim-Philipp Müller 2006-08-22 12:47:15 UTC
> This patch seems to fix it by using xine's technique for strings marked
> 'ISO8859-1'. 
> The question now is whether to apply it. Opinions?

Heh, I'm all for it, I've suggested that months ago when a similar bug came up and it was you who back then rejected the idea based on the fact that we will get some correctly encoded strings wrong for sure ;)  (I was just arguing what I thought was what we agreed on).

The type of strings we'll get wrong are rather unlikely, it's basically a 'special character' (umlaut/accent type) plus a 'special punctuation' in a row. I don't think we'll find this combination very often in normal tags (where special quotes for quoted speech etc. aren't used).

The only half-way plausible string I can come up with is something like

  Blablé² or Blablé³

for album titles (I've very rarely seen superscripts being used in place of 'Volume 2' etc.).

I say let's apply it and see if we get any reports about wrongly extracted strings.

Comment 9 Jan Schmidt 2006-08-22 13:54:02 UTC
ooh, you never! Damn revisionists! ;)

OK, applied:

        * gst/id3demux/id3v2frames.c: (parse_text_identification_frame),
          If strings in text fields are marked ISO8859-1, but contain
          valid UTF-8 already, then handle them as UTF-8 and ignore
          the encoding. (#351794)