After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 753480 - matroskademux: text/x-raw subtitle tracks ouputs are escaped
matroskademux: text/x-raw subtitle tracks ouputs are escaped
Status: RESOLVED OBSOLETE
Product: GStreamer
Classification: Platform
Component: gst-plugins-good
1.4.5
Other Linux
: Normal normal
: git master
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2015-08-10 16:45 UTC by Pierre Lamot
Modified: 2018-11-03 15:03 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
patch for matroskademux : use utf8 format instead of pango-markup (2.53 KB, patch)
2015-12-14 11:09 UTC, Emmauel Bouillot
none Details | Review

Description Pierre Lamot 2015-08-10 16:45:34 UTC
Hi

when demuxing from a matroska a track encoded as S_TEXT/UTF8, the stream comes as xml escaped (< becomes &lt; etc.)
below are sample pipeline which allow to reproduce the problem. 


% gst-launch-1.0 videotestsrc is-live=true do-timestamp=true ! buffertotstxt ! text/x-raw ! identity dump=true ! matroskamux streamable=true ! filesink location=lol.mkv
Setting pipeline to PAUSED ...
Pipeline is live and does not need PREROLL ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
00000000 (0x1cb66f0): 3c 30 3a 30 30 3a 30 30 2e 30 30 35 37 38 37 35  <0:00:00.0057875
00000010 (0x1cb6700): 33 39 3e                                         39>             
00000000 (0x7f4aa00322c0): 3c 30 3a 30 30 3a 30 30 2e 30 33 39 31 32 30 38  <0:00:00.0391208
00000010 (0x7f4aa00322d0): 37 32 3e                                         72>             
00000000 (0x1cb66f0): 3c 30 3a 30 30 3a 30 30 2e 30 37 32 34 35 34 32  <0:00:00.0724542
00000010 (0x1cb6700): 30 35 3e                                         05>            


% gst-launch-1.0 filesrc location=lol.mkv ! matroskaparse ! matroskademux ! text/x-raw ! fakesink dump=true                                                             
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
00000000 (0x7f6fac008510): 26 6c 74 3b 30 3a 30 30 3a 30 30 2e 30 30 35 37  &lt;0:00:00.0057
New clock: GstSystemClock
00000010 (0x7f6fac008520): 38 37 35 33 39 26 67 74 3b                       87539&gt;       
00000000 (0x7f6fac005fd0): 26 6c 74 3b 30 3a 30 30 3a 30 30 2e 30 33 39 31  &lt;0:00:00.0391
00000010 (0x7f6fac005fe0): 32 30 38 37 32 26 67 74 3b                       20872&gt;       
00000000 (0x7f6fac008510): 26 6c 74 3b 30 3a 30 30 3a 30 30 2e 30 37 32 34  &lt;0:00:00.0724
00000010 (0x7f6fac008520): 35 34 32 30 35 26 67 74 3b                       54205&gt;       


extracting the tracks with mkvextract gives the correct encoding, so I think this is rather on the demux side

% tracks lol.mkv 0:lol.srt
Extracting track 0 with the CodecID 'S_TEXT/UTF8' to the file 'lol.srt'. Container format: SRT text subtitles
Progress: 100%
% cat lol.srt             
1
00:00:00,005 --> 00:00:00,038
<0:00:00.005787539>
Comment 1 Pierre Lamot 2015-08-11 09:05:55 UTC
So the matroska demuxer considers that when the subtitle codec is subtitle-utf8, it is encoded as pango markup [1], hence it decide to escape the text [2].

I think the demuxer should not consider the stream as pango markup by default, moreover the matroskamuxer caps is text/x-raw,format=utf8. 


[1] http://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/matroska/matroska-demux.c?id=1.5.2#n5694
[2] http://cgit.freedesktop.org/gstreamer/gst-plugins-good/tree/gst/matroska/matroska-demux.c?id=1.5.2#n3061
Comment 2 Emmauel Bouillot 2015-12-14 11:09:54 UTC
Created attachment 317344 [details] [review]
patch for matroskademux : use utf8 format instead of pango-markup

The matroska demux force the pango-markup format output of text/x-raw subutitles and modify the content of the raw text (originaly in utf8 format). This may cause problems when trying to use it as non-pango text. It seems more logical that the demuxer gives the same format on output the original format it contains and so provide the same format as matroskamux. 

Modifying matroskademux to ouput utf8 format instead of pango-markup does not require a lot of modifications and will not break compatibility with other plugins since all plugins using pango-markup as input (good/bad/ugly) also accept utf8 and convert it internally when necessary (textrender, textoverlay, subparse, srtenc, kateenc, webvttenc)

We studied the possibility to use both format on the demuxer, but matroksademux use static caps for the sinks, determinated by the source streams format. Turning it into dynamic and use negociation to determine the sink format only for text/x-raw may lead to deep change on the way the demuxer works and seems to be an overkill solution.

This is why using "text/x-raw, format=utf8" instead of "text/x-raw, format=pango-markup" option seems to be more adventageous.
Comment 3 Tim-Philipp Müller 2015-12-14 11:20:00 UTC
From memory, the problem is that we don't know from the beginning whether there will be markup in the text or not, so that's why we always output pango-markup and escape, unless we detect that it's already escaped. What exactly is the problem with outputting pango-markup? Is the fact that it outputs pango-markup caps an issue for you, or are you saying there is something wrong with how the demuxer does escaping internally?
Comment 4 Emmauel Bouillot 2015-12-14 13:29:14 UTC
The problem is that we would like to get the exact same text stream on the demuxer output than the one sent on the muxer input. Escaping the text modify it.

Moreover, if the text is already escaped, it will be escaped a second time

In [4]: GLib.markup_escape_text("<foo>")
Out[4]: '&lt;foo&gt;'
In [5]: GLib.markup_escape_text("&lt;foo&gt;")
Out[5]: '&amp;lt;foo&amp;gt;'
Comment 5 Pierre Lamot 2015-12-18 16:55:29 UTC
When the sutitle codec is S_TEXT/UTF8, the demuxer has no way to guess the
format. So matroskademux assume its in pango. when the line contains a markup
tag (<b>,<i> .. <span),

I can see different cases:

- The text is actual pango markup. after the first markup is encounter the text
  is returned as is (which is the expected behavior). Before, the text is
  escaped, this might cause issues with escaped characters: for instance
  "&amp;" will be re-encoded as &amp;amp;

- The text is srt (which is stored as S_TEXT/UTF8 [1]), srt markup is close to
  pango, the main difference is that <font> is used instead of <span> and that
  characters doesn't require to be escaped (but player support may vary), so you
  can have things like actual "&", escaped character "&amp;", even single "<" or
  ">"

  Though the srt format doesn't seems to be properly defined. So I guess that
  decoding is kind of "best effort"

- The text is actual plain text. As our caps is set to pango, we have to escape
  it as it will be interpreted as pango.


So I guess the current solution is best solution t provide *pango*
markup. However, I think that ideally we could provide either raw (utf-8) or pango
format using caps negotiation. But the current caps "selection" used fixed
format caps, so this might require some workaround for this case.

[1] http://www.matroska.org/technical/specs/subtitles/srt.html
Comment 6 GStreamer system administrator 2018-11-03 15:03:01 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/issues/210.