GNOME Bugzilla – Bug 707933
matroskademux: Wrong UTF8 detection causes wrong detection of subtitle encoding
Last modified: 2013-10-03 15:51:03 UTC
Created attachment 254728 [details] Screenshot of subtitles in wrong encoding; correct text is "Mairéad" I recently encoded a matroska video with hand-edited subtitles muxed in. The subtitles appear as S_TEXT/UTF8 in the list of tracks in the mkv, and both the SRT used in muxing and an extracted SRT stream are valid UTF-8. However, any non-ASCII character displayed is garbled. The attached screenshot is from totem, but the bug is reproducible with gst-launch-1.0 playbin.
The problem appears to be in gst-plugins-good, related to gst/matroska/matroska-demux.c:gst_matroska_demux_check_subtitle_buffer(). Setting GST_SUBTITLE_ENCODING="UTF-8" causes the subtitles to be correctly displayed.
Created attachment 254734 [details] [review] Fixes an off-by-one error in UTF-8 validation The issue appears to be that map.size includes the terminating NUL for the string, but g_utf8_validate() will return FALSE if it hits NUL before reading max_len bytes. The attached patch fixes the issue for me.
commit 9d3dbd6581f86e824627d5fe51391a099f853b7a Author: Seán de Búrca <leftmostcat@gmail.com> Date: Wed Sep 11 13:11:58 2013 -0600 matroskademux: Fix off-by-one in validation of UTF-8 https://bugzilla.gnome.org/show_bug.cgi?id=707933
I'm not 100% sure if this is correct in all cases. If I'm not mistaken, this could fail in cases where there is no 0-terminator in the data and the last character is a multi-byte character. I don't see any code in matroskademux that ensures that these buffers contain a 0-terminator, nor do I see anything in the matroska specs saying whether there should be a terminator or not. I would have expected no terminator to be the normal case.
If we can't guarantee that it's null-terminated, should it then be something like the following? sub_size = map.data[map.size - 1] != '\0' ? map.size : map.size - 1; Then just use sub_size instead of map.size for validation.
g_utf8_validate() will only read max_len bytes and does not care about the \0-terminator if max_len!=-1. If max_len==-1 it will read until the \0-terminator. Which problem do you see here exactly? That after validation we might miss the \0-terminator in the end, and other string functions could fail because of that?
We are using g_convert_with_fallback() on that string after validation and call it with the size in bytes too. So the return value of that (which is then used only) should be \0-terminated and everything should be fine :)
Actually ignore that, if it is valid UTF8 we directly use it... so we need to make sure that there's a \0-terminator at the end.
Created attachment 255368 [details] [review] matroska-demux: Make sure that subtitle buffers are \0-terminated
Created attachment 255369 [details] [review] matroska-demux: Make sure that subtitle buffers are \0-terminated
commit d8841b4832b4f6481d12c270ce44581e403d31b0 Author: Sebastian Dröge <slomo@circular-chaos.org> Date: Fri Sep 20 10:19:22 2013 +0200 matroska-demux: Make sure that subtitle buffers are \0-terminated https://bugzilla.gnome.org/show_bug.cgi?id=707933
Created attachment 255916 [details] [review] Patch to move the null termination check before UTF-8 validation The subtitle buffer is currently validated for UTF-8 before the NULL terminator is appended to it.
commit 40c0586c174c6b53f4170ad697832ca2e189d3b7 Author: Matej Knopp <matej.knopp@gmail.com> Date: Fri Sep 27 12:53:06 2013 +0200 matroskademux: move the check for subtitle buffer being null terminated before validating UTF-8 https://bugzilla.gnome.org/show_bug.cgi?id=707933