GNOME Bugzilla – Bug 770355
id3v2: Fix parsing extended header and string lists in UTF-16
Last modified: 2018-01-28 14:47:21 UTC
I found two bugs trying to play a MP3 file with a IDv3 header: 1. First it would not create a tag list at all even though the file definitely contained IDv3 tags. It turned out that it also had an extended header. According to the IDv3 spec, the extended header size field excludes itself, meaning it specifies the size of the extended header after this field. Adding 4 bytes to this value fixed the offset to the first tag as expected. 2. Once #1 was fixed, the file failed parsing a COMM tag that was encoded using UTF16. The parse_split_strings() function normally splits the data at \0 characters, but when adding the strings to the array with parse_insert_string_field(), it would include that null character as part of the length. In the case of UTF-16 this completely broke when encountering an empty string (with or without BOM(s)), because we called g_utf16_to_utf8() with a size of 1 (which was the splitting null character) and it returned n_read of 0, causing it to fail parsing the entire ID3v2 header.
Created attachment 334103 [details] [review] id3v2: Fix handling IDv3 tags with extended headers
Created attachment 334104 [details] [review] id3v2: Fix splitting strings
Created attachment 334106 [details] [review] id3v2: Fix splitting strings
Created attachment 334107 [details] [review] id3v2: Fix splitting strings
Thanks for the patches. Any chance you could add unit tests for this to -base/tests/check/libs/tag.c or -good/test/check/elements/id3demux.c ?
(In reply to Tim-Philipp Müller from comment #5) > Thanks for the patches. Any chance you could add unit tests for this to > -base/tests/check/libs/tag.c or -good/test/check/elements/id3demux.c ? Unfortunately, I don't really have the time to work on that anytime soon :-(
Ok, could you make some of the tags available then that caused these problems?
Created attachment 334245 [details] Failing ID3v2 tag This is the ID3v2 tag only, you should be able to just append any MP3/AAC to it. It decodes fine with id3v2 --list id3.bin or exiftool -v3 -l id3.bin | - Tag 'COMM' (54 bytes): | 9fa6: 01 65 6e 67 ff fe 00 00 ff fe 4e 00 61 00 69 00 [.eng......N.a.i.] | 9fb6: 6d 00 20 00 4d 00 50 00 33 00 20 00 4d 00 75 00 [m. .M.P.3. .M.u.] | 9fc6: 73 00 69 00 63 00 20 00 4c 00 69 00 62 00 72 00 [s.i.c. .L.i.b.r.] | 9fd6: 61 00 72 00 79 00 [a.r.y.] Note how the COMM tag string list starts with a UTF16 BOM (ff fe), directly followed by a null character (00 00), which indicates the string list separator. UTF8 strings happened to work because the data gets g_strndup() first and then the g_utf8_validate() call uses -1 as second argument. However, in the case of ISO8859 the g_utf8_validate() call would always fail (due to the null character being included in the data size) and kick it into string_utf8_dup(). It also has the extended header that wasn't handled correctly.
Comment on attachment 334107 [details] [review] id3v2: Fix splitting strings Thanks for the patch! It mostly looks right, but there are some things I don't understand yet, and the outcome is also not entirely correct yet. >When parsing null-terminated strings, do not include the >terminating null byte. Depending on the encoding used, either >g_utf8_validate() failed due to this, or worse the call to >g_utf16_to_utf8() would return 0 items read on an empty (0 length) >string This makes sense to me. > causing it to fail parsing certain tags Frames you mean here? > rendering the entire file unplayable. OOC, why would the file be unplayable if the tag can't be parsed? I tried, and GStreamer just skips the tag and continues playing the file with your sample tag. >--- a/gst-libs/gst/tag/id3v2frames.c >+++ b/gst-libs/gst/tag/id3v2frames.c >@@ -1103,9 +1104,20 @@ parse_insert_string_field (guint8 encoding, gchar * data, gint data_size, > /* Sometimes we see strings with multiple BOM markers at the start. > * In that case, we assume the innermost one is correct. If that fails > * to produce valid UTF-8, we try the other endianness anyway */ >- while (data_size > 2 && find_utf16_bom (data, &data_endianness)) { >+ while (data_size >= 2 && find_utf16_bom (data, &data_endianness)) { > data += 2; /* skip BOM */ > data_size -= 2; >+ have_bom = TRUE; >+ } >+ >+ if (data_size < 2) { >+ field = g_strdup (""); >+ break; >+ } >+ >+ if (have_bom) { >+ data -= 2; >+ data_size += 2; > } This last if (have_bom) +/- 2 bytes bit I don't understand. The BOM is already skipped in the while() loop above, why are we skipping it here again? Also, with this what happens with your sample tag is that we extract the string, but we get a BOM converted to UTF-8 at the beginning of the extracted/converted UTF-8 string, which isn't right. I think this just skips the terminator after the first BOM and then includes the BOM for the second string in the conversion so it mostly works by accident. Without the if (have_bom) block everything works correctly.
Created attachment 342449 [details] [review] tests: tag: add unit test for ID3v2 string list parsing for UTF-16 strings
Thanks, pushed this now (without the have_bom code I mentioned above): commit 4a8640b4c93f3c23163de039bf847a5bd23fe878 Author: Tim-Philipp Müller <tim@centricular.com> Date: Sat Dec 24 10:15:24 2016 +0000 tests: tag: add unit test for ID3v2 UTF-16 string list parsing https://bugzilla.gnome.org/show_bug.cgi?id=770355 commit fae81e57c77e295235e739dcb944fd2dd9a27eab Author: Tim-Philipp Müller <tim@centricular.com> Date: Sat Dec 24 14:32:34 2016 +0000 tests: tag: add test for ID3v2 extended header parsing https://bugzilla.gnome.org/show_bug.cgi?id=770355 commit cd101ce8f02b2aaee40dcdfb63a9b8794b897006 Author: Thomas Bluemel <tbluemel@control4.com> Date: Wed Aug 24 11:39:39 2016 -0600 id3v2: fix splitting strings in ISO-8859-1 and UTF-16 formats When parsing NUL-terminated strings, do not include the terminating NUL byte(s). Depending on the encoding used, either g_utf8_validate() failed due to this, or worse the call to g_utf16_to_utf8() would return 0 items read on an empty string, causing it to fail parsing certain frames. https://bugzilla.gnome.org/show_bug.cgi?id=770355 commit da6070054f471216adac2c5879031fcebfa592d2 Author: Thomas Bluemel <tbluemel@control4.com> Date: Wed Aug 24 10:33:14 2016 -0600 id3v2: fix handling of tags with extended headers The extended header size value does not include itself. https://bugzilla.gnome.org/show_bug.cgi?id=770355 Please re-open if you believe that was needed, thanks!
Comment on attachment 334107 [details] [review] id3v2: Fix splitting strings Pushed without the have_bom stuff with the extra 2-byte skip.