Bug 351794 – [id3demux] try harder to extract wrongly marked strings

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 351794 - [id3demux] try harder to extract wrongly marked strings


Summary:	[id3demux] try harder to extract wrongly marked strings


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-good
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.10.5
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-08-17 17:11 UTC by Bastien Nocera
Modified:	2006-08-22 15:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
orgasme.mp3 (59.24 KB, application/octet-stream) 2006-08-17 17:12 UTC, Bastien Nocera		Details
patch (1.59 KB, patch) 2006-08-18 13:58 UTC, Jan Schmidt	committed	Details \| Review

Description Bastien Nocera 2006-08-17 17:11:58 UTC

With the xine-lib backend:
$ ./metadata-test "/home/data/Documents/Movie Samples/orgasme.mp3" | grep Title
Title: Répondeur orgasme

With the GStreamer backend:
Title: RÃ©pondeur orgasme

Comment 1 Bastien Nocera 2006-08-17 17:12:35 UTC

Created attachment 71094 [details]
orgasme.mp3

Comment 2 Tim-Philipp Müller 2006-08-17 17:39:58 UTC

The ID3v2 title text frame claims the text is encoded as ISO-8859-1 and that's how we interpret it. Whatever wrote the tag should have marked the frame as containing a UTF-8 string if it writes UTF-8 strings.

AFAIK there aren't really good and generally reliable mechanisms to make a good guess that this is an UTF8 string labelled wrongly as an ISO-8859-1 string. We might be able to put in some hacks to get this one right, but then we will almost certainly get other correctly encoded tags wrong. There are limits to how much you can hack around broken tags ...

Comment 3 Wim Taymans 2006-08-18 09:50:28 UTC

how come xine gets it right?

Comment 4 Jan Schmidt 2006-08-18 10:20:35 UTC

IIRC xine gets it right because it completely ignores the specified character encoding, which means it gets every other case wrong.

Comment 5 Bastien Nocera 2006-08-18 10:30:45 UTC

Huh, not true.

If the string is already in valid UTF-8, but we the encoding given by the file is broken, we try pass it as UTF-8:
    if (enc && strcmp(enc, "UTF-8")) {
      /* Don't bother converting if it's already in UTF-8, but the encoding
       * is badly reported */
      if (meta_info_validate_utf8(value)) {
        meta_info_set_unlocked_utf8(stream, info, value);
        return;
      }

That seems to work in most cases. If it's not correct UTF-8, we then perform a conversion using the encoding reported in the file.

Comment 6 Jan Schmidt 2006-08-18 11:30:33 UTC

Sorry, I was thinking of VLC.

Xine's approach might work well for us too, although it runs the risk of wrongly converting tags that are marked ISO8859-1 but happen to contain a string that validates as UTF-8 - although I don't think I've ever seen one.

Comment 7 Jan Schmidt 2006-08-18 13:58:35 UTC

Created attachment 71147 [details] [review]
patch

This patch seems to fix it by using xine's technique for strings marked 'ISO8859-1'. 

The question now is whether to apply it. Opinions?

Comment 8 Tim-Philipp Müller 2006-08-22 12:47:15 UTC

> This patch seems to fix it by using xine's technique for strings marked
> 'ISO8859-1'. 
> 
> The question now is whether to apply it. Opinions?

Heh, I'm all for it, I've suggested that months ago when a similar bug came up and it was you who back then rejected the idea based on the fact that we will get some correctly encoded strings wrong for sure ;)  (I was just arguing what I thought was what we agreed on).


The type of strings we'll get wrong are rather unlikely, it's basically a 'special character' (umlaut/accent type) plus a 'special punctuation' in a row. I don't think we'll find this combination very often in normal tags (where special quotes for quoted speech etc. aren't used).

The only half-way plausible string I can come up with is something like

  Blablé² or Blablé³

for album titles (I've very rarely seen superscripts being used in place of 'Volume 2' etc.).


I say let's apply it and see if we get any reports about wrongly extracted strings.

Comment 9 Jan Schmidt 2006-08-22 13:54:02 UTC

ooh, you never! Damn revisionists! ;)

OK, applied:

        * gst/id3demux/id3v2frames.c: (parse_text_identification_frame),
        (parse_insert_string_field):
          If strings in text fields are marked ISO8859-1, but contain
          valid UTF-8 already, then handle them as UTF-8 and ignore
          the encoding. (#351794)