Bug 707933 – matroskademux: Wrong UTF8 detection causes wrong detection of subtitle encoding

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 707933 - matroskademux: Wrong UTF8 detection causes wrong detection of subtitle encoding


Summary:	matroskademux: Wrong UTF8 detection causes wrong detection of subtitle encoding


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-good
Version:	1.0.10
Hardware:	Other Linux

Importance:	Normal blocker
Target Milestone:	1.2.1
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-09-11 18:30 UTC by Seán de Búrca
Modified:	2013-10-03 15:51 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Screenshot of subtitles in wrong encoding; correct text is "Mairéad" (339.00 KB, image/png) 2013-09-11 18:30 UTC, Seán de Búrca		Details
Fixes an off-by-one error in UTF-8 validation (894 bytes, patch) 2013-09-11 19:14 UTC, Seán de Búrca	committed	Details \| Review
matroska-demux: Make sure that subtitle buffers are \0-terminated (1.40 KB, patch) 2013-09-20 08:19 UTC, Sebastian Dröge (slomo)	none	Details \| Review
matroska-demux: Make sure that subtitle buffers are \0-terminated (1.45 KB, patch) 2013-09-20 08:23 UTC, Sebastian Dröge (slomo)	committed	Details \| Review
Patch to move the null termination check before UTF-8 validation (2.25 KB, patch) 2013-09-27 10:55 UTC, Matej Knopp	committed	Details \| Review

Description Seán de Búrca 2013-09-11 18:30:37 UTC

Created attachment 254728 [details]
Screenshot of subtitles in wrong encoding; correct text is "Mairéad"

I recently encoded a matroska video with hand-edited subtitles muxed in. The subtitles appear as S_TEXT/UTF8 in the list of tracks in the mkv, and both the SRT used in muxing and an extracted SRT stream are valid UTF-8. However, any non-ASCII character displayed is garbled. The attached screenshot is from totem, but the bug is reproducible with gst-launch-1.0 playbin.

Comment 1 Seán de Búrca 2013-09-11 18:56:46 UTC

The problem appears to be in gst-plugins-good, related to gst/matroska/matroska-demux.c:gst_matroska_demux_check_subtitle_buffer(). Setting GST_SUBTITLE_ENCODING="UTF-8" causes the subtitles to be correctly displayed.

Comment 2 Seán de Búrca 2013-09-11 19:14:14 UTC

Created attachment 254734 [details] [review]
Fixes an off-by-one error in UTF-8 validation

The issue appears to be that map.size includes the terminating NUL for the string, but g_utf8_validate() will return FALSE if it hits NUL before reading max_len bytes. The attached patch fixes the issue for me.

Comment 3 Sebastian Dröge (slomo) 2013-09-12 07:21:05 UTC

commit 9d3dbd6581f86e824627d5fe51391a099f853b7a
Author: Seán de Búrca <leftmostcat@gmail.com>
Date:   Wed Sep 11 13:11:58 2013 -0600

    matroskademux: Fix off-by-one in validation of UTF-8
    
    https://bugzilla.gnome.org/show_bug.cgi?id=707933

Comment 4 Tim-Philipp Müller 2013-09-19 20:11:50 UTC

I'm not 100% sure if this is correct in all cases.

If I'm not mistaken, this could fail in cases where there is no 0-terminator in the data and the last character is a multi-byte character.

I don't see any code in matroskademux that ensures that these buffers contain a 0-terminator, nor do I see anything in the matroska specs saying whether there should be a terminator or not. I would have expected no terminator to be the normal case.

Comment 5 Seán de Búrca 2013-09-19 21:26:21 UTC

If we can't guarantee that it's null-terminated, should it then be something like the following?

sub_size = map.data[map.size - 1] != '\0' ? map.size : map.size - 1;

Then just use sub_size instead of map.size for validation.

Comment 6 Sebastian Dröge (slomo) 2013-09-20 08:09:29 UTC

g_utf8_validate() will only read max_len bytes and does not care about the \0-terminator if max_len!=-1. If max_len==-1 it will read until the \0-terminator.

Which problem do you see here exactly? That after validation we might miss the \0-terminator in the end, and other string functions could fail because of that?

Comment 7 Sebastian Dröge (slomo) 2013-09-20 08:11:11 UTC

We are using g_convert_with_fallback() on that string after validation and call it with the size in bytes too. So the return value of that (which is then used only) should be \0-terminated and everything should be fine :)

Comment 8 Sebastian Dröge (slomo) 2013-09-20 08:11:59 UTC

Actually ignore that, if it is valid UTF8 we directly use it... so we need to make sure that there's a \0-terminator at the end.

Comment 9 Sebastian Dröge (slomo) 2013-09-20 08:19:55 UTC

Created attachment 255368 [details] [review]
matroska-demux: Make sure that subtitle buffers are \0-terminated

Comment 10 Sebastian Dröge (slomo) 2013-09-20 08:23:03 UTC

Created attachment 255369 [details] [review]
matroska-demux: Make sure that subtitle buffers are \0-terminated

Comment 11 Sebastian Dröge (slomo) 2013-09-20 09:20:13 UTC

commit d8841b4832b4f6481d12c270ce44581e403d31b0
Author: Sebastian Dröge <slomo@circular-chaos.org>
Date:   Fri Sep 20 10:19:22 2013 +0200

    matroska-demux: Make sure that subtitle buffers are \0-terminated
    
    https://bugzilla.gnome.org/show_bug.cgi?id=707933

Comment 12 Matej Knopp 2013-09-27 10:55:48 UTC

Created attachment 255916 [details] [review]
Patch to move the null termination check before UTF-8 validation

The subtitle buffer is currently validated for UTF-8 before the NULL terminator is appended to it.

Comment 13 Sebastian Dröge (slomo) 2013-09-27 12:38:39 UTC

commit 40c0586c174c6b53f4170ad697832ca2e189d3b7
Author: Matej Knopp <matej.knopp@gmail.com>
Date:   Fri Sep 27 12:53:06 2013 +0200

    matroskademux: move the check for subtitle buffer being null terminated before validating UTF-8
    
    https://bugzilla.gnome.org/show_bug.cgi?id=707933