Bug 172848 – [subparse] subtitles with special chars are displayed as "???????"

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 172848 - [subparse] subtitles with special chars are displayed as "???????"


Summary:	[subparse] subtitles with special chars are displayed as "???????"


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.10.6
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-04-06 19:18 UTC by Michaël Arnauts
Modified:	2006-03-24 17:58 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
screenshot (249.38 KB, image/png) 2005-04-06 19:19 UTC, Michaël Arnauts		Details
possible patch (3.13 KB, patch) 2006-03-24 10:25 UTC, Tim-Philipp Müller	committed	Details \| Review

Description Michaël Arnauts 2005-04-06 19:18:35 UTC

In dutch and many other languages, chars like é, è, ë are very common. When they
occur in a subtitle, they are displayed as ??????.

The line in the srt-file is:
151

00:16:28,333 --> 00:16:31,627

De officiële verklaring was een faling van de nieren maar

The screenshot is attached

The font I am using is sans bold, 14px

Comment 1 Michaël Arnauts 2005-04-06 19:19:34 UTC

Created attachment 39766 [details]
screenshot

Comment 2 Ronald Bultje 2005-04-14 11:05:34 UTC

So apparently, encoding is not fixed in SRT files, which sucks. Users will need
to specify encoding, and we need to convert it...

Comment 3 Michaël Arnauts 2005-06-05 18:35:30 UTC

isn't it possible to autodetect the encoding? like file does:
michael@mayco:/mnt/extra/Films - TV/Stargate SG-1/Season 8$ file Stargate\ SG-1\
-\ 8x01-02\ -\ New\ Order.srt
Stargate SG-1 - 8x01-02 - New Order.srt: ISO-8859 text, with CRLF line terminators
michael@mayco:/mnt/extra/Films - TV/Stargate SG-1/Season 8$

Comment 4 Ronald Bultje 2005-06-05 19:36:55 UTC

You can autodetect encoding, but it's merely an approximation, afaik.

Comment 5 Michaël Arnauts 2005-06-05 19:42:22 UTC

Hmm, but i guess even an approximation is better as displaying "?????". Doesn't
gedit or some other texteditor have such code?

Comment 6 Ronald Bultje 2005-06-05 20:27:51 UTC

Yes, sure, I'll have a look. I'm just saying there's more involved than a simple
functioncall in glib. :).

Comment 7 Guillaume Desmottes 2006-02-18 17:04:31 UTC

I also have this isue if the srt file is encoded in ISO-8859-15, no problem if it's UTF-8.

Comment 8 Tim-Philipp Müller 2006-02-20 10:57:33 UTC

It's not really feasible to detect the character encoding in .srt files, at least not with a LOT of effort. Basically we can only detect 'valid UTF-8' or not. If it's not valid UTF-8, it can be about anything else, but we don't know what. The problem is that almost all other common character encodings use the entire 8-bit range, so we can't know whether a text is, say, ISO-8859-15 or ISO-8859-2 or whatever.

Also, we get fed text only in very small chunks, which makes detection even harder.

I suppose what we can do is similar to what we do with character encodings in ID3v1 tags:

  - check if it's UTF-8
  - if it's not UTF-8, check
      - whether a certain environment variable is set to force an encoding
      - if no encoding is forced on us, check what the current locale's
        charset is:
          - it it's non-UTF-8, assume it's that encoding
          - if it's UTF-8, assume ISO-8859-15

Comment 9 Tim-Philipp Müller 2006-03-24 10:25:40 UTC

Created attachment 61900 [details] [review]
possible patch


Possible patch, got to think about this some more and test it a bit.

Comment 10 Guillaume Desmottes 2006-03-24 12:30:39 UTC

I tested it with the 0.10.5 version (Ubuntu Dapper) and it seems to work right!

Great job, thanks a lot. This bug was *verry* annoying for me.

Comment 11 Tim-Philipp Müller 2006-03-24 17:58:44 UTC

Thanks for testing, committed with minor/cosmetic changes:

 2006-03-24  Tim-Philipp Müller  <tim at centricular dot net>

        * gst/subparse/gstsubparse.c: (convert_encoding),
        (gst_sub_parse_change_state):
        * gst/subparse/gstsubparse.h:
          Text subtitle files may or may not be UTF-8. If it's not, we
          don't really want to see '?' characters in place of non-ASCII
          characters like accented characters. So let's assume the input
          is UTF-8 until we come across text that is clearly not. If it's
          not UTF-8, we don't really know what it is, so try the following:
          (a) see whether the GST_SUBTITLE_ENCODING environment variable
          is set; if not, check (b) if the current locale encoding is
          non-UTF-8, and use that if it is, or (c) assume ISO-8859-15 if
          the current locale encoding is UTF-8 and the environment variable
          was not set to any particular encoding. Not perfect, but better
          than nothing (and better than before, I think) (fixes #172848).