GNOME Bugzilla – Bug 172848
[subparse] subtitles with special chars are displayed as "???????"
Last modified: 2006-03-24 17:58:44 UTC
In dutch and many other languages, chars like é, è, ë are very common. When they occur in a subtitle, they are displayed as ??????. The line in the srt-file is: 151 00:16:28,333 --> 00:16:31,627 De officiële verklaring was een faling van de nieren maar The screenshot is attached The font I am using is sans bold, 14px
Created attachment 39766 [details] screenshot
So apparently, encoding is not fixed in SRT files, which sucks. Users will need to specify encoding, and we need to convert it...
isn't it possible to autodetect the encoding? like file does: michael@mayco:/mnt/extra/Films - TV/Stargate SG-1/Season 8$ file Stargate\ SG-1\ -\ 8x01-02\ -\ New\ Order.srt Stargate SG-1 - 8x01-02 - New Order.srt: ISO-8859 text, with CRLF line terminators michael@mayco:/mnt/extra/Films - TV/Stargate SG-1/Season 8$
You can autodetect encoding, but it's merely an approximation, afaik.
Hmm, but i guess even an approximation is better as displaying "?????". Doesn't gedit or some other texteditor have such code?
Yes, sure, I'll have a look. I'm just saying there's more involved than a simple functioncall in glib. :).
I also have this isue if the srt file is encoded in ISO-8859-15, no problem if it's UTF-8.
It's not really feasible to detect the character encoding in .srt files, at least not with a LOT of effort. Basically we can only detect 'valid UTF-8' or not. If it's not valid UTF-8, it can be about anything else, but we don't know what. The problem is that almost all other common character encodings use the entire 8-bit range, so we can't know whether a text is, say, ISO-8859-15 or ISO-8859-2 or whatever. Also, we get fed text only in very small chunks, which makes detection even harder. I suppose what we can do is similar to what we do with character encodings in ID3v1 tags: - check if it's UTF-8 - if it's not UTF-8, check - whether a certain environment variable is set to force an encoding - if no encoding is forced on us, check what the current locale's charset is: - it it's non-UTF-8, assume it's that encoding - if it's UTF-8, assume ISO-8859-15
Created attachment 61900 [details] [review] possible patch Possible patch, got to think about this some more and test it a bit.
I tested it with the 0.10.5 version (Ubuntu Dapper) and it seems to work right! Great job, thanks a lot. This bug was *verry* annoying for me.
Thanks for testing, committed with minor/cosmetic changes: 2006-03-24 Tim-Philipp Müller <tim at centricular dot net> * gst/subparse/gstsubparse.c: (convert_encoding), (gst_sub_parse_change_state): * gst/subparse/gstsubparse.h: Text subtitle files may or may not be UTF-8. If it's not, we don't really want to see '?' characters in place of non-ASCII characters like accented characters. So let's assume the input is UTF-8 until we come across text that is clearly not. If it's not UTF-8, we don't really know what it is, so try the following: (a) see whether the GST_SUBTITLE_ENCODING environment variable is set; if not, check (b) if the current locale encoding is non-UTF-8, and use that if it is, or (c) assume ISO-8859-15 if the current locale encoding is UTF-8 and the environment variable was not set to any particular encoding. Not perfect, but better than nothing (and better than before, I think) (fixes #172848).