Bug 607619 – [typefind] utf-16 text file mistakenly identified as layer 1 mpeg audio

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 607619 - [typefind] utf-16 text file mistakenly identified as layer 1 mpeg audio


Summary:	[typefind] utf-16 text file mistakenly identified as layer 1 mpeg audio


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.10.36
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-01-21 01:44 UTC by cupp
Modified:	2011-11-28 16:01 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Text file mistakenly reported as media file of unknown MIME type; codec is fruitlessly sought to play this file as media. (5.29 KB, text/plain) 2010-01-21 01:44 UTC, cupp		Details
typefind: typefind UTF-16 and UTF-32 (10.07 KB, patch) 2011-09-19 17:36 UTC, Vincent Penquerc'h	reviewed	Details \| Review
typefind: typefind UTF-16 and UTF-32 (4.04 KB, patch) 2011-09-30 19:32 UTC, Vincent Penquerc'h	none	Details \| Review
typefind: typefind UTF-16 and UTF-32 (6.82 KB, patch) 2011-10-03 11:15 UTC, Vincent Penquerc'h	committed	Details \| Review

Description cupp 2010-01-21 01:44:06 UTC

Created attachment 151907 [details]
Text file mistakenly reported as media file of unknown MIME type; codec is fruitlessly sought to play this file as media.

On launch, Rhythmbox scans the music library and reports import errors.  Five data configuration files are labeled "The MIME type of this file cannot bed determined"[1], one MIDI file is labeled "additional plugins are required to play this file: audio/riff midi-decoder"[2], and one text file is labeled "The MIME type of this file cannot be determined."[3]

Notes: this music library is maintained in common on various platforms, including multiple Linx / Fedora 11 machines and at least one Windows XP machine.  The user desires to maintain a common library on all systems.
[1] The data files are from the Nero[tm] media editing suite (on Windows XP). Rhythmbox should identify them as non-media files and ignore.  (There are dozens of other non-media files, such as Windows *.URL Internet shortcuts and Linux *.desktop launchers, and various other data files, which are all ignored.)
[2] This one MIDI file is reported repeatedly.  Yet, nineteen other MIDI files in the music library are not reported, and do play.
[3] There are tens of dozens of text files, all of which are ignored except this one.  This one text file is repeatedly discovered every time Rhythmbox is launched.   This text file is attached for analysis.

Comment 1 Jonathan Matthew 2010-01-21 01:57:02 UTC

Please report one problem per bug.  You've got three separate issues here that should be tracked separately.

To fix the first problem, we'd need to add a GStreamer typefind function to identify Nero data files.  Please open a new bug against GStreamer and attach a sample file, and if possible provide a link to a format description.

I'm not sure what's happening with the MIDI file you mention, but it sounds like it's a slightly different file format.  Please open a new bug against GStreamer, providing output from 'gst-launch -tv filesrc location=file.midi ! decodebin ! fakesink silent=true'.

The utf-16 text file you attached seems to look enough like a layer 1 MPEG audio file to fool the MPEG audio typefinder, but mpegaudioparse knows better.  Since this is the problem we actually have some information on, I'm moving this to GStreamer and retitling appropriately.

Comment 2 Sebastian Dröge (slomo) 2010-01-21 13:20:40 UTC

That file is still detected as MP3 with latest GIT.

Comment 3 Vincent Penquerc'h 2011-09-19 17:36:15 UTC

Created attachment 196978 [details] [review]
typefind: typefind UTF-16 and UTF-32

This avoids the runaway MP3 typefinder from globbing the world
every time it thinks there's something it might possibly be
tempted to parse.

Comment 4 Tim-Philipp Müller 2011-09-20 09:56:00 UTC

Review of attachment 196978 [details] [review]:

::: gst/typefind/gsttypefindfunctions.c
@@ +212,3 @@
+   barely recognizing valid MP3 as MP3. If someone wants to decipher
+   the MP3 typefinder and improve it, it is over there |
+                                                       V */

What's the point of this mp3 typefinder rant?

Yes, the mp3 typefinder does go for audio/mpeg caps with a low probability (I hope) if there's as little as a single sync code at the start. That's intentional.

Do you have any evidence that it is "barely recognizing valid MP3 as MP3" in general?

@@ +233,3 @@
+    {4, "\xff\xfe\x00\x00", "utf32le", 10},
+    {4, "\x00\x00\xfe\xff", "utf32be", 20}
+  };

No need for a typedef here, since it's not used anywhere else. Also, can avoid relocations by doing:

static const struct {
  size_t bomlen;
  const char bom[4];
  const char encoding[8];
  ...
} tester[] = {
  ...
};

@@ +250,3 @@
+
+  /* find a large enough size that works */
+  while (len < (1 << 18)) {

Make a #define for that please, and write it in terms of kB or so, e.g. 8*1024 or whatever.

@@ +268,3 @@
+    converted =
+        g_convert ((const gchar *) data, len, "utf8", tester[n].encoding, NULL,
+        NULL, &error);

Hrm, I wonder if there's something better we can do here that avoids allocs.. (let's keep in mind that we're going to be running this four times for every blob of data that's not immediately identified with a 100% probability).

For UTF-32 that's probably not too hard, might need some more work for UTF-16 though given the API provided by glib (basically want something like g_utf8_get_char_validated() but for UTF-16), I think.

@@ +279,3 @@
+  }
+
+  if (prob >= 0)

greater or *equal* 0 ?

@@ +4653,3 @@
+      vgm_exts, "Vgm\x20", 4, GST_TYPE_FIND_MAXIMUM);
+  TYPE_FIND_REGISTER_START_WITH (plugin, "audio/x-sap", GST_RANK_SECONDARY,
+      sap_exts, "SAP\x0d\x0a" "AUTHOR\x20", 12, GST_TYPE_FIND_MAXIMUM);

Can we do the gst-indent updates in a separate patch?

Comment 5 Vincent Penquerc'h 2011-09-30 19:32:47 UTC

Created attachment 197902 [details] [review]
typefind: typefind UTF-16 and UTF-32

This avoids the MP3 typefinder from getting the highest score
every time it thinks there's something it might possibly be
able to parse.

Comment 6 Vincent Penquerc'h 2011-09-30 19:39:10 UTC

> What's the point of this mp3 typefinder rant?

No real point, removed.

> Do you have any evidence that it is "barely recognizing valid MP3 as MP3" in
> general?

I've had a couple bugs where it was failing to recognize MP3 as such (one of those was due to an ID3 tag with padding, can't recall the other). I've a couple others where it was recognizing other stuff as MP3. I don't know if it's really the MP3 format itself being the reason, or the typefinder though.

> Hrm, I wonder if there's something better we can do here that avoids allocs..
> (let's keep in mind that we're going to be running this four times for every
> blob of data that's not immediately identified with a 100% probability).

Fair point. g_convert internally ends up calling iconv, and the manpage suggests the output buffer cannot be NULL, so we'd have to reimplement UTF-{16,32} parsing manually. A medium way would be to call iconv ourselves, with a buffer allocated once only, instead of 4 times. Or realloc a static buffer that would "leak", but only one alloc per process.

> greater or *equal* 0 ?

It was initialized to -1, so >= 0 means it had been modified.
it doesn't make much sense though, you're right, so now > 0.

> Can we do the gst-indent updates in a separate patch?

Gah, didn't realize gst-indent had done that. Removed.

Points not replied to here are done.

Comment 7 Sebastian Dröge (slomo) 2011-10-03 09:07:18 UTC

We should probably implement our own, minimal UTF16/32 parsing here to check if all the bytes we have are in the valid ranges instead of doing actual conversion. (Also I'd vote for text/utf16 and text/utf32 with an endianness field ;) )

For the BOM table you should use fixed-size arrays instead of char* pointers to reduce relocations.

Comment 8 Vincent Penquerc'h 2011-10-03 11:15:20 UTC

Created attachment 198071 [details] [review]
typefind: typefind UTF-16 and UTF-32

This avoids the MP3 typefinder from getting the highest score
every time it thinks there's something it might possibly be
able to parse.

Comment 9 Vincent Penquerc'h 2011-11-28 16:01:22 UTC

commit e67aa28de9df1deae94697dcc32ef8c3e7c35fa4
Author: Vincent Penquerc'h <vincent.penquerch@collabora.co.uk>
Date:   Fri Sep 30 20:00:50 2011 +0100

    typefind: typefind UTF-16 and UTF-32
    
    This avoids the MP3 typefinder from getting the highest score
    every time it thinks there's something it might possibly be
    able to parse.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=607619