GNOME Bugzilla – Bug 169936
[subparse] support for SAMI subtitles
Last modified: 2006-05-15 15:32:07 UTC
Please describe the problem: gst-plugins' subparse plugin is not support SAMI subtitle type(these file named xxx.smi) Steps to reproduce: 1. open some movie file with sami subtitle using totem. 2. 3. Actual results: print this error message. ERROR (0x8688910 - 308484:15:53.418517000) subparse(28884) gstsubparse.c(653):gst_subparse_format_autodetect:<subparse0> The input is not a valid/supported subtitle file Expected results: render movie file with subtitle Does this happen every time? yes Other information: SAMI subtitle parser can get in sub_read_line_sami() at http://www1.mplayerhq.hu/cgi-bin/cvsweb.cgi/main/subreader.c?rev=1.141
Just as a note, we cannot copy the mplayer implementation directly. We need our own implementation, from scratch, for copyright reason (GPL vs. LGPL).
Well it's a html format so it shouldn't be hard to parse, but I havn't actually been able to find any .smi files with anything in them that gstreamer would actually be interested in. Can you attach your .smi file for us?
Created attachment 38694 [details] regular SAMI subtitle ragular SAMI subtitle type. It contains an english subtitle.
Created attachment 38695 [details] complex SAMI subtitle. it has two subtitles stream in one file, english and korean. SAMI subtitle has no charset encoding information. so it uses legacy charset like id3tag. it'll need some workaround like gst_tag_extract_id3v1_string() from gst-plugins/gst/tags/gstid3tag.c or need new function described bug #169943
*** Bug 307871 has been marked as a duplicate of this bug. ***
Created attachment 52515 [details] [review] simple sami parse patch i made a patch that parsing SAMI subtitle simply. currently, it just parse subtitle string and timestamps.
Created attachment 52519 [details] [review] simple sami parse patch (updated) updated patch (process properly.)
Created attachment 52520 [details] [review] simple sami parse patch (updated) missed header diff, and fix some mistake in escape_amp()
Doesn't glib have some escaping functions?
AFAIK it only has escaping functions (g_markup_escape_text() etc.). If you use GMarkupParser it will un-escape those entities for you, but I don't think the unescaping code is actually exported. Dunno if we want to support &#NNN; and &#xNN; type of escapes as well (unicode code point in decimal/hex was it?) ... ;) Cheers -Tim
Created attachment 52537 [details] [review] simple sami parse patch (updated) avode g_string_append() when get_next_line() returns NULL, and parse <font color=> tag. (but, can't show colored subtitle, is it pango-gsttextoverlay plugin issue?)
Created attachment 52583 [details] [review] simple sami parse patch (updated) change to use g_markup_xxx functions()
Wouldn't it be better to use libxml's HTML parser? I would think you could use htmlCreatePushParserCtxt from http://xmlsoft.org/html/libxml-HTMLparser.html#htmlCreatePushParserCtxt to make a SAX reader. Then you would use htmlParseChunk on sucessive blocks of the input stream to push data into the parser. A SAX handler would be installed (the first arg of htmlCreatePushParserCtxt), that would push out buffers. Given the complexity of this format I'd be very suspicious of a homemade parser.
Created attachment 64092 [details] [review] simple sami parse patch (updated) update for gstreamer 0.10 and parse with libxml2 by andy's advice.
Created attachment 64117 [details] [review] simple sami parse patch (updated) add <ruby> tag support. (it can work after apply patch in #339405) add <br> tag support. (it can work after apply patch in #339405)
Created attachment 64128 [details] [review] simple sami parse patch (updated) skip text in <title> tag. layout ruby text properly. (but multiline ruby text can't handle yet.)
if (!g_strncasecmp("title", aname, 5)) { [and many other places] g_strncasecmp is deprecated and doesn't do what you want here. You should use libxml's xmlStrncasecmp instead.
Created attachment 64141 [details] [review] simple sami parse patch (updated) g_strncasecmp -> xmlStrncmp when dispose, clean samiContext
0) You are awesome! 1) Your patch should use gstreamer coding style; you can guarantee this by running the gst-indent script, from gstreamer/tools/, on your file before making the diff. 2) You should not be using statically-linked data, it prevents two instances of the plugin from being run at once; you should put that data into the ParserState->user_data. 3) Consider using GString for accumulating data, it makes the code clearer 4) You don't do any error handling; what happens when you feed it a malformed subtitle? Thanks!
Created attachment 64213 [details] [review] simple sami parse patch (updated) indent with gst-indent. dynamic allocated samiContext. change to using GString. and I tested some malformed subtitles. and It worked well too. (malformed tag pair, invalid color value, etc)
OK at this point just a few nitpicks: 1) Avoid declaring variables in the middle of a scope, e.g. has_italic in pop_state(), value in start_sync -- we still compile on sun's forte 2) The idiom of iterating over attributes would be more clearly written as: for (i = 0; (atts[i] != NULL); i+=2) { const xmlChar *key, *value; key = atts[i]; value = atts[i+1]; 3) According to http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#characters, the string passed to characters_sami is not guaranteed to be null-terminated; you need to use the g_string_append_len. 4) When configuring GStreamer without XML support, the sami parser shouldn't be built because there is no libxml. Please surround the sami code in #ifndef GST_DISABLE_LOADSAVE_REGISTRY. Other than that, looking very good. This will be nice to have :-)
Created attachment 64270 [details] [review] simple sami parse patch (updated) applied andy's suggestions.
OK, looks good to me. Thanks, Young-Ho! I've marked it for committing after the -base release tomorrow.
Created attachment 64332 [details] [review] 64270: simple sami parse patch (updated) I found a bug which can't detect some sami file because that is starting with whitespace characters. so I replace strncmp() to xmlStrcasestr().
Created attachment 64607 [details] [review] simple sami parse patch (updated) avoid append next node text in result buf
Nice work, will finish reviewing it tomorrow if possible. Could you enlighten me on two things: (a) there are several /* FIXME */ comments in the code - could you expand on those - what exactly needs fixing there? (like /* FIXME: does not work on Wednesdays */) (b) could you add some comments for the various members of the samiContext structure, for posterity, so it's clearer how this is all supposed to work? :)
this is answers for above questions. (a) there is two FIXME for ruby markup rendering. they will be reimplemented after close bug 322391 (b) each member means.. buf : buffer for collect content rubybuf : buffer for collect ruby content resultbuf : when open next 'sync' tag, move from buf for avoid to append next content has_result : flag for ready to push state: in many sami files, many tags are not closed. so when open tag, parser will append tag flag in state member, and when open next "sync" tag, close opened tag force. see sami_context_push_state() and sami_context_pop_state() htmlctxt: html parser context in_title : flag for avoid to append title content to buf time1: previous start attribute in sync tag time2: current start attribute in sync tag
Created attachment 65401 [details] [review] simple sami parse patch (updated) expand FIXME comment that depend bug 322391. many sami subtitle assign silver color, but X rgb db has no such color. so replace to #c0c0c0 value.
Committed to CVS, thanks for the patch and sorry everything took so long: 2006-05-15 Tim-Philipp Müller <tim at centricular dot net> Patch by: Young-Ho Cha <ganadist at chollian net> * gst/subparse/samiparse.c: (handle_start_font): Need to map "silver" colour explicitly. 2006-05-15 Tim-Philipp Müller <tim at centricular dot net> Patch by: Young-Ho Cha <ganadist at chollian net> * gst/subparse/Makefile.am: * gst/subparse/gstsubparse.c: (gst_sub_parse_dispose), (parser_state_dispose), (gst_sub_parse_data_format_autodetect), (gst_sub_parse_format_autodetect), (feed_textbuf), (gst_subparse_type_find), (plugin_init): * gst/subparse/gstsubparse.h: * gst/subparse/samiparse.c: * gst/subparse/samiparse.h: Add support for SAMI subtitles (#169936). I've taken the liberty to split out the SAMI parsing code into a separate file (I think in the long run it's more maintainable like that). Also, I've changed the code to use a new media type application/x-subtitle-sami for SAMI subtitles (this gives us more flexibility and among other things keeps open the option to split the code into a separate plugin in case anyone ever thinks that that should be done because of the libxml2 dependency etc.).
thank you very much for accept patch. but I found you missed one thing. As I noticed comment #24, some sami files can start with whitespace characters. so I changed to use xmlStrcasestr() instead of strncmp(). should you fix it please?
Created attachment 65497 [details] [review] quick fix for detecting sami format. some sami files can start with whitespace. so if use strncmp(), it can't detect. so change to xmlStrcasestr() instead of strncmp()
Created attachment 65505 [details] [review] fix for detecting sami format move check function to samiparse.[ch]
The type detection should still work even if we don't actually compile with SAMI support enabled. However, the typefinder can't use libxml2 functions either (like with the previous patch), so: 2006-05-15 Tim-Philipp Müller <tim at centricular dot net> * gst/subparse/gstsubparse.c: (gst_sub_parse_data_format_autodetect): Don't use libxml functions for typefinding. Thanks for noticing.