GNOME Bugzilla – Bug 753480
matroskademux: text/x-raw subtitle tracks ouputs are escaped
Last modified: 2018-11-03 15:03:01 UTC
Hi when demuxing from a matroska a track encoded as S_TEXT/UTF8, the stream comes as xml escaped (< becomes < etc.) below are sample pipeline which allow to reproduce the problem. % gst-launch-1.0 videotestsrc is-live=true do-timestamp=true ! buffertotstxt ! text/x-raw ! identity dump=true ! matroskamux streamable=true ! filesink location=lol.mkv Setting pipeline to PAUSED ... Pipeline is live and does not need PREROLL ... Setting pipeline to PLAYING ... New clock: GstSystemClock 00000000 (0x1cb66f0): 3c 30 3a 30 30 3a 30 30 2e 30 30 35 37 38 37 35 <0:00:00.0057875 00000010 (0x1cb6700): 33 39 3e 39> 00000000 (0x7f4aa00322c0): 3c 30 3a 30 30 3a 30 30 2e 30 33 39 31 32 30 38 <0:00:00.0391208 00000010 (0x7f4aa00322d0): 37 32 3e 72> 00000000 (0x1cb66f0): 3c 30 3a 30 30 3a 30 30 2e 30 37 32 34 35 34 32 <0:00:00.0724542 00000010 (0x1cb6700): 30 35 3e 05> % gst-launch-1.0 filesrc location=lol.mkv ! matroskaparse ! matroskademux ! text/x-raw ! fakesink dump=true Setting pipeline to PAUSED ... Pipeline is PREROLLING ... Pipeline is PREROLLED ... Setting pipeline to PLAYING ... 00000000 (0x7f6fac008510): 26 6c 74 3b 30 3a 30 30 3a 30 30 2e 30 30 35 37 <0:00:00.0057 New clock: GstSystemClock 00000010 (0x7f6fac008520): 38 37 35 33 39 26 67 74 3b 87539> 00000000 (0x7f6fac005fd0): 26 6c 74 3b 30 3a 30 30 3a 30 30 2e 30 33 39 31 <0:00:00.0391 00000010 (0x7f6fac005fe0): 32 30 38 37 32 26 67 74 3b 20872> 00000000 (0x7f6fac008510): 26 6c 74 3b 30 3a 30 30 3a 30 30 2e 30 37 32 34 <0:00:00.0724 00000010 (0x7f6fac008520): 35 34 32 30 35 26 67 74 3b 54205> extracting the tracks with mkvextract gives the correct encoding, so I think this is rather on the demux side % tracks lol.mkv 0:lol.srt Extracting track 0 with the CodecID 'S_TEXT/UTF8' to the file 'lol.srt'. Container format: SRT text subtitles Progress: 100% % cat lol.srt 1 00:00:00,005 --> 00:00:00,038 <0:00:00.005787539>
So the matroska demuxer considers that when the subtitle codec is subtitle-utf8, it is encoded as pango markup [1], hence it decide to escape the text [2]. I think the demuxer should not consider the stream as pango markup by default, moreover the matroskamuxer caps is text/x-raw,format=utf8. [1] http://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/matroska/matroska-demux.c?id=1.5.2#n5694 [2] http://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/matroska/matroska-demux.c?id=1.5.2#n3061
Created attachment 317344 [details] [review] patch for matroskademux : use utf8 format instead of pango-markup The matroska demux force the pango-markup format output of text/x-raw subutitles and modify the content of the raw text (originaly in utf8 format). This may cause problems when trying to use it as non-pango text. It seems more logical that the demuxer gives the same format on output the original format it contains and so provide the same format as matroskamux. Modifying matroskademux to ouput utf8 format instead of pango-markup does not require a lot of modifications and will not break compatibility with other plugins since all plugins using pango-markup as input (good/bad/ugly) also accept utf8 and convert it internally when necessary (textrender, textoverlay, subparse, srtenc, kateenc, webvttenc) We studied the possibility to use both format on the demuxer, but matroksademux use static caps for the sinks, determinated by the source streams format. Turning it into dynamic and use negociation to determine the sink format only for text/x-raw may lead to deep change on the way the demuxer works and seems to be an overkill solution. This is why using "text/x-raw, format=utf8" instead of "text/x-raw, format=pango-markup" option seems to be more adventageous.
From memory, the problem is that we don't know from the beginning whether there will be markup in the text or not, so that's why we always output pango-markup and escape, unless we detect that it's already escaped. What exactly is the problem with outputting pango-markup? Is the fact that it outputs pango-markup caps an issue for you, or are you saying there is something wrong with how the demuxer does escaping internally?
The problem is that we would like to get the exact same text stream on the demuxer output than the one sent on the muxer input. Escaping the text modify it. Moreover, if the text is already escaped, it will be escaped a second time In [4]: GLib.markup_escape_text("<foo>") Out[4]: '<foo>' In [5]: GLib.markup_escape_text("<foo>") Out[5]: '&lt;foo&gt;'
When the sutitle codec is S_TEXT/UTF8, the demuxer has no way to guess the format. So matroskademux assume its in pango. when the line contains a markup tag (<b>,<i> .. <span), I can see different cases: - The text is actual pango markup. after the first markup is encounter the text is returned as is (which is the expected behavior). Before, the text is escaped, this might cause issues with escaped characters: for instance "&" will be re-encoded as &amp; - The text is srt (which is stored as S_TEXT/UTF8 [1]), srt markup is close to pango, the main difference is that <font> is used instead of <span> and that characters doesn't require to be escaped (but player support may vary), so you can have things like actual "&", escaped character "&", even single "<" or ">" Though the srt format doesn't seems to be properly defined. So I guess that decoding is kind of "best effort" - The text is actual plain text. As our caps is set to pango, we have to escape it as it will be interpreted as pango. So I guess the current solution is best solution t provide *pango* markup. However, I think that ideally we could provide either raw (utf-8) or pango format using caps negotiation. But the current caps "selection" used fixed format caps, so this might require some workaround for this case. [1] http://www.matroska.org/technical/specs/subtitles/srt.html
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/210.