After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 707933 - matroskademux: Wrong UTF8 detection causes wrong detection of subtitle encoding
matroskademux: Wrong UTF8 detection causes wrong detection of subtitle encoding
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-good
1.0.10
Other Linux
: Normal blocker
: 1.2.1
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2013-09-11 18:30 UTC by Seán de Búrca
Modified: 2013-10-03 15:51 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Screenshot of subtitles in wrong encoding; correct text is "Mairéad" (339.00 KB, image/png)
2013-09-11 18:30 UTC, Seán de Búrca
  Details
Fixes an off-by-one error in UTF-8 validation (894 bytes, patch)
2013-09-11 19:14 UTC, Seán de Búrca
committed Details | Review
matroska-demux: Make sure that subtitle buffers are \0-terminated (1.40 KB, patch)
2013-09-20 08:19 UTC, Sebastian Dröge (slomo)
none Details | Review
matroska-demux: Make sure that subtitle buffers are \0-terminated (1.45 KB, patch)
2013-09-20 08:23 UTC, Sebastian Dröge (slomo)
committed Details | Review
Patch to move the null termination check before UTF-8 validation (2.25 KB, patch)
2013-09-27 10:55 UTC, Matej Knopp
committed Details | Review

Description Seán de Búrca 2013-09-11 18:30:37 UTC
Created attachment 254728 [details]
Screenshot of subtitles in wrong encoding; correct text is "Mairéad"

I recently encoded a matroska video with hand-edited subtitles muxed in. The subtitles appear as S_TEXT/UTF8 in the list of tracks in the mkv, and both the SRT used in muxing and an extracted SRT stream are valid UTF-8. However, any non-ASCII character displayed is garbled. The attached screenshot is from totem, but the bug is reproducible with gst-launch-1.0 playbin.
Comment 1 Seán de Búrca 2013-09-11 18:56:46 UTC
The problem appears to be in gst-plugins-good, related to gst/matroska/matroska-demux.c:gst_matroska_demux_check_subtitle_buffer(). Setting GST_SUBTITLE_ENCODING="UTF-8" causes the subtitles to be correctly displayed.
Comment 2 Seán de Búrca 2013-09-11 19:14:14 UTC
Created attachment 254734 [details] [review]
Fixes an off-by-one error in UTF-8 validation

The issue appears to be that map.size includes the terminating NUL for the string, but g_utf8_validate() will return FALSE if it hits NUL before reading max_len bytes. The attached patch fixes the issue for me.
Comment 3 Sebastian Dröge (slomo) 2013-09-12 07:21:05 UTC
commit 9d3dbd6581f86e824627d5fe51391a099f853b7a
Author: Seán de Búrca <leftmostcat@gmail.com>
Date:   Wed Sep 11 13:11:58 2013 -0600

    matroskademux: Fix off-by-one in validation of UTF-8
    
    https://bugzilla.gnome.org/show_bug.cgi?id=707933
Comment 4 Tim-Philipp Müller 2013-09-19 20:11:50 UTC
I'm not 100% sure if this is correct in all cases.

If I'm not mistaken, this could fail in cases where there is no 0-terminator in the data and the last character is a multi-byte character.

I don't see any code in matroskademux that ensures that these buffers contain a 0-terminator, nor do I see anything in the matroska specs saying whether there should be a terminator or not. I would have expected no terminator to be the normal case.
Comment 5 Seán de Búrca 2013-09-19 21:26:21 UTC
If we can't guarantee that it's null-terminated, should it then be something like the following?

sub_size = map.data[map.size - 1] != '\0' ? map.size : map.size - 1;

Then just use sub_size instead of map.size for validation.
Comment 6 Sebastian Dröge (slomo) 2013-09-20 08:09:29 UTC
g_utf8_validate() will only read max_len bytes and does not care about the \0-terminator if max_len!=-1. If max_len==-1 it will read until the \0-terminator.

Which problem do you see here exactly? That after validation we might miss the \0-terminator in the end, and other string functions could fail because of that?
Comment 7 Sebastian Dröge (slomo) 2013-09-20 08:11:11 UTC
We are using g_convert_with_fallback() on that string after validation and call it with the size in bytes too. So the return value of that (which is then used only) should be \0-terminated and everything should be fine :)
Comment 8 Sebastian Dröge (slomo) 2013-09-20 08:11:59 UTC
Actually ignore that, if it is valid UTF8 we directly use it... so we need to make sure that there's a \0-terminator at the end.
Comment 9 Sebastian Dröge (slomo) 2013-09-20 08:19:55 UTC
Created attachment 255368 [details] [review]
matroska-demux: Make sure that subtitle buffers are \0-terminated
Comment 10 Sebastian Dröge (slomo) 2013-09-20 08:23:03 UTC
Created attachment 255369 [details] [review]
matroska-demux: Make sure that subtitle buffers are \0-terminated
Comment 11 Sebastian Dröge (slomo) 2013-09-20 09:20:13 UTC
commit d8841b4832b4f6481d12c270ce44581e403d31b0
Author: Sebastian Dröge <slomo@circular-chaos.org>
Date:   Fri Sep 20 10:19:22 2013 +0200

    matroska-demux: Make sure that subtitle buffers are \0-terminated
    
    https://bugzilla.gnome.org/show_bug.cgi?id=707933
Comment 12 Matej Knopp 2013-09-27 10:55:48 UTC
Created attachment 255916 [details] [review]
Patch to move the null termination check before UTF-8 validation

The subtitle buffer is currently validated for UTF-8 before the NULL terminator is appended to it.
Comment 13 Sebastian Dröge (slomo) 2013-09-27 12:38:39 UTC
commit 40c0586c174c6b53f4170ad697832ca2e189d3b7
Author: Matej Knopp <matej.knopp@gmail.com>
Date:   Fri Sep 27 12:53:06 2013 +0200

    matroskademux: move the check for subtitle buffer being null terminated before validating UTF-8
    
    https://bugzilla.gnome.org/show_bug.cgi?id=707933