Bug 770355 – id3v2: Fix parsing extended header and string lists in UTF-16

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 770355 - id3v2: Fix parsing extended header and string lists in UTF-16


Summary:	id3v2: Fix parsing extended header and string lists in UTF-16


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.11.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-08-24 22:11 UTC by GstBlub
Modified:	2018-01-28 14:47 UTC

See Also:	792983
GNOME target:	---
GNOME version:	---

Attachments
id3v2: Fix handling IDv3 tags with extended headers (1.42 KB, patch) 2016-08-24 22:12 UTC, GstBlub	committed	Details \| Review
id3v2: Fix splitting strings (2.91 KB, patch) 2016-08-24 22:12 UTC, GstBlub	none	Details \| Review
id3v2: Fix splitting strings (2.90 KB, patch) 2016-08-24 22:29 UTC, GstBlub	none	Details \| Review
id3v2: Fix splitting strings (3.24 KB, patch) 2016-08-24 22:45 UTC, GstBlub	committed	Details \| Review
Failing ID3v2 tag (97.66 KB, application/octet-stream) 2016-08-26 21:10 UTC, GstBlub		Details
tests: tag: add unit test for ID3v2 string list parsing for UTF-16 strings (2.46 KB, patch) 2016-12-24 10:17 UTC, Tim-Philipp Müller	committed	Details \| Review

Description GstBlub 2016-08-24 22:11:09 UTC

I found two bugs trying to play a MP3 file with a IDv3 header:

1. First it would not create a tag list at all even though the file definitely contained IDv3 tags.  It turned out that it also had an extended header.  According to the IDv3 spec, the extended header size field excludes itself, meaning it specifies the size of the extended header after this field.  Adding 4 bytes to this value fixed the offset to the first tag as expected.

2. Once #1 was fixed, the file failed parsing a COMM tag that was encoded using UTF16.  The parse_split_strings() function normally splits the data at \0 characters, but when adding the strings to the array with parse_insert_string_field(), it would include that null character as part of the length.  In the case of UTF-16 this completely broke when encountering an empty string (with or without BOM(s)), because we called g_utf16_to_utf8() with a size of 1 (which was the splitting null character) and it returned n_read of 0, causing it to fail parsing the entire ID3v2 header.

Comment 1 GstBlub 2016-08-24 22:12:09 UTC

Created attachment 334103 [details] [review]
id3v2: Fix handling IDv3 tags with extended headers

Comment 2 GstBlub 2016-08-24 22:12:31 UTC

Created attachment 334104 [details] [review]
id3v2: Fix splitting strings

Comment 3 GstBlub 2016-08-24 22:29:50 UTC

Created attachment 334106 [details] [review]
id3v2: Fix splitting strings

Comment 4 GstBlub 2016-08-24 22:45:47 UTC

Created attachment 334107 [details] [review]
id3v2: Fix splitting strings

Comment 5 Tim-Philipp Müller 2016-08-26 19:34:54 UTC

Thanks for the patches. Any chance you could add unit tests for this to -base/tests/check/libs/tag.c or -good/test/check/elements/id3demux.c ?

Comment 6 GstBlub 2016-08-26 20:02:04 UTC

(In reply to Tim-Philipp Müller from comment #5)
> Thanks for the patches. Any chance you could add unit tests for this to
> -base/tests/check/libs/tag.c or -good/test/check/elements/id3demux.c ?

Unfortunately, I don't really have the time to work on that anytime soon :-(

Comment 7 Tim-Philipp Müller 2016-08-26 20:12:40 UTC

Ok, could you make some of the tags available then that caused these problems?

Comment 8 GstBlub 2016-08-26 21:10:22 UTC

Created attachment 334245 [details]
Failing ID3v2 tag

This is the ID3v2 tag only, you should be able to just append any MP3/AAC to it.

It decodes fine with id3v2 --list id3.bin or exiftool -v3 -l id3.bin

  | - Tag 'COMM' (54 bytes):
  |     9fa6: 01 65 6e 67 ff fe 00 00 ff fe 4e 00 61 00 69 00 [.eng......N.a.i.]
  |     9fb6: 6d 00 20 00 4d 00 50 00 33 00 20 00 4d 00 75 00 [m. .M.P.3. .M.u.]
  |     9fc6: 73 00 69 00 63 00 20 00 4c 00 69 00 62 00 72 00 [s.i.c. .L.i.b.r.]
  |     9fd6: 61 00 72 00 79 00                               [a.r.y.]

Note how the COMM tag string list starts with a UTF16 BOM (ff fe), directly followed by a null character (00 00), which indicates the string list separator.  UTF8 strings happened to work because the data gets g_strndup() first and then the g_utf8_validate() call uses -1 as second argument.  However, in the case of ISO8859 the g_utf8_validate() call would always fail (due to the null character being included in the data size) and kick it into string_utf8_dup().

It also has the extended header that wasn't handled correctly.

Comment 9 Tim-Philipp Müller 2016-12-24 10:06:17 UTC

Comment on attachment 334107 [details] [review]
id3v2: Fix splitting strings

Thanks for the patch!

It mostly looks right, but there are some things I don't understand yet, and the outcome is also not entirely correct yet.

>When parsing null-terminated strings, do not include the
>terminating null byte.  Depending on the encoding used, either
>g_utf8_validate() failed due to this, or worse the call to
>g_utf16_to_utf8() would return 0 items read on an empty (0 length)
>string

This makes sense to me.

> causing it to fail parsing certain tags

Frames you mean here?

> rendering the entire file unplayable.

OOC, why would the file be unplayable if the tag can't be parsed? I tried, and GStreamer just skips the tag and continues playing the file with your sample tag.


>--- a/gst-libs/gst/tag/id3v2frames.c
>+++ b/gst-libs/gst/tag/id3v2frames.c
>@@ -1103,9 +1104,20 @@ parse_insert_string_field (guint8 encoding, gchar * data, gint data_size,
>       /* Sometimes we see strings with multiple BOM markers at the start.
>        * In that case, we assume the innermost one is correct. If that fails
>        * to produce valid UTF-8, we try the other endianness anyway */
>-      while (data_size > 2 && find_utf16_bom (data, &data_endianness)) {
>+      while (data_size >= 2 && find_utf16_bom (data, &data_endianness)) {
>         data += 2;              /* skip BOM */
>         data_size -= 2;
>+        have_bom = TRUE;
>+      }
>+
>+      if (data_size < 2) {
>+        field = g_strdup ("");
>+        break;
>+      }
>+
>+      if (have_bom) {
>+        data -= 2;
>+        data_size += 2;
>       }

This last if (have_bom) +/- 2 bytes bit I don't understand. The BOM is already skipped in the while() loop above, why are we skipping it here again?

Also, with this what happens with your sample tag is that we extract the string, but we get a BOM converted to UTF-8 at the beginning of the extracted/converted UTF-8 string, which isn't right.

I think this just skips the terminator after the first BOM and then includes the BOM for the second string in the conversion so it mostly works by accident.

Without the if (have_bom) block everything works correctly.

Comment 10 Tim-Philipp Müller 2016-12-24 10:17:39 UTC

Created attachment 342449 [details] [review]
tests: tag: add unit test for ID3v2 string list parsing for UTF-16 strings

Comment 11 Tim-Philipp Müller 2016-12-25 10:55:26 UTC

Thanks, pushed this now (without the have_bom code I mentioned above):

commit 4a8640b4c93f3c23163de039bf847a5bd23fe878
Author: Tim-Philipp Müller <tim@centricular.com>
Date:   Sat Dec 24 10:15:24 2016 +0000

    tests: tag: add unit test for ID3v2 UTF-16 string list parsing
    
    https://bugzilla.gnome.org/show_bug.cgi?id=770355

commit fae81e57c77e295235e739dcb944fd2dd9a27eab
Author: Tim-Philipp Müller <tim@centricular.com>
Date:   Sat Dec 24 14:32:34 2016 +0000

    tests: tag: add test for ID3v2 extended header parsing
    
    https://bugzilla.gnome.org/show_bug.cgi?id=770355

commit cd101ce8f02b2aaee40dcdfb63a9b8794b897006
Author: Thomas Bluemel <tbluemel@control4.com>
Date:   Wed Aug 24 11:39:39 2016 -0600

    id3v2: fix splitting strings in ISO-8859-1 and UTF-16 formats
    
    When parsing NUL-terminated strings, do not include the terminating
    NUL byte(s).  Depending on the encoding used, either g_utf8_validate()
    failed due to this, or worse the call to g_utf16_to_utf8() would
    return 0 items read on an empty string, causing it to fail parsing
    certain frames.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=770355

commit da6070054f471216adac2c5879031fcebfa592d2
Author: Thomas Bluemel <tbluemel@control4.com>
Date:   Wed Aug 24 10:33:14 2016 -0600

    id3v2: fix handling of tags with extended headers
    
    The extended header size value does not include itself.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=770355


Please re-open if you believe that was needed, thanks!

Comment 12 Tim-Philipp Müller 2016-12-25 10:56:57 UTC

Comment on attachment 334107 [details] [review]
id3v2: Fix splitting strings

Pushed without the have_bom stuff with the extra 2-byte skip.