After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 172848 - [subparse] subtitles with special chars are displayed as "???????"
[subparse] subtitles with special chars are displayed as "???????"
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-base
git master
Other Linux
: Normal normal
: 0.10.6
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2005-04-06 19:18 UTC by Michaël Arnauts
Modified: 2006-03-24 17:58 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
screenshot (249.38 KB, image/png)
2005-04-06 19:19 UTC, Michaël Arnauts
  Details
possible patch (3.13 KB, patch)
2006-03-24 10:25 UTC, Tim-Philipp Müller
committed Details | Review

Description Michaël Arnauts 2005-04-06 19:18:35 UTC
In dutch and many other languages, chars like é, è, ë are very common. When they
occur in a subtitle, they are displayed as ??????.

The line in the srt-file is:
151

00:16:28,333 --> 00:16:31,627

De officiële verklaring was een faling van de nieren maar

The screenshot is attached

The font I am using is sans bold, 14px
Comment 1 Michaël Arnauts 2005-04-06 19:19:34 UTC
Created attachment 39766 [details]
screenshot
Comment 2 Ronald Bultje 2005-04-14 11:05:34 UTC
So apparently, encoding is not fixed in SRT files, which sucks. Users will need
to specify encoding, and we need to convert it...
Comment 3 Michaël Arnauts 2005-06-05 18:35:30 UTC
isn't it possible to autodetect the encoding? like file does:
michael@mayco:/mnt/extra/Films - TV/Stargate SG-1/Season 8$ file Stargate\ SG-1\
-\ 8x01-02\ -\ New\ Order.srt
Stargate SG-1 - 8x01-02 - New Order.srt: ISO-8859 text, with CRLF line terminators
michael@mayco:/mnt/extra/Films - TV/Stargate SG-1/Season 8$
Comment 4 Ronald Bultje 2005-06-05 19:36:55 UTC
You can autodetect encoding, but it's merely an approximation, afaik.
Comment 5 Michaël Arnauts 2005-06-05 19:42:22 UTC
Hmm, but i guess even an approximation is better as displaying "?????". Doesn't
gedit or some other texteditor have such code?
Comment 6 Ronald Bultje 2005-06-05 20:27:51 UTC
Yes, sure, I'll have a look. I'm just saying there's more involved than a simple
functioncall in glib. :).
Comment 7 Guillaume Desmottes 2006-02-18 17:04:31 UTC
I also have this isue if the srt file is encoded in ISO-8859-15, no problem if it's UTF-8.
Comment 8 Tim-Philipp Müller 2006-02-20 10:57:33 UTC
It's not really feasible to detect the character encoding in .srt files, at least not with a LOT of effort. Basically we can only detect 'valid UTF-8' or not. If it's not valid UTF-8, it can be about anything else, but we don't know what. The problem is that almost all other common character encodings use the entire 8-bit range, so we can't know whether a text is, say, ISO-8859-15 or ISO-8859-2 or whatever.

Also, we get fed text only in very small chunks, which makes detection even harder.

I suppose what we can do is similar to what we do with character encodings in ID3v1 tags:

  - check if it's UTF-8
  - if it's not UTF-8, check
      - whether a certain environment variable is set to force an encoding
      - if no encoding is forced on us, check what the current locale's
        charset is:
          - it it's non-UTF-8, assume it's that encoding
          - if it's UTF-8, assume ISO-8859-15

Comment 9 Tim-Philipp Müller 2006-03-24 10:25:40 UTC
Created attachment 61900 [details] [review]
possible patch


Possible patch, got to think about this some more and test it a bit.
Comment 10 Guillaume Desmottes 2006-03-24 12:30:39 UTC
I tested it with the 0.10.5 version (Ubuntu Dapper) and it seems to work right!

Great job, thanks a lot. This bug was *verry* annoying for me.
Comment 11 Tim-Philipp Müller 2006-03-24 17:58:44 UTC
Thanks for testing, committed with minor/cosmetic changes:

 2006-03-24  Tim-Philipp Müller  <tim at centricular dot net>

        * gst/subparse/gstsubparse.c: (convert_encoding),
        (gst_sub_parse_change_state):
        * gst/subparse/gstsubparse.h:
          Text subtitle files may or may not be UTF-8. If it's not, we
          don't really want to see '?' characters in place of non-ASCII
          characters like accented characters. So let's assume the input
          is UTF-8 until we come across text that is clearly not. If it's
          not UTF-8, we don't really know what it is, so try the following:
          (a) see whether the GST_SUBTITLE_ENCODING environment variable
          is set; if not, check (b) if the current locale encoding is
          non-UTF-8, and use that if it is, or (c) assume ISO-8859-15 if
          the current locale encoding is UTF-8 and the environment variable
          was not set to any particular encoding. Not perfect, but better
          than nothing (and better than before, I think) (fixes #172848).