Bug 740784 – subparse: fails to detect UTF-8 encoding

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 740784 - subparse: fails to detect UTF-8 encoding


Summary:	subparse: fails to detect UTF-8 encoding


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	git master
Assigned To:	Reynaldo H. Verdejo Pinochet
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-11-26 21:11 UTC by Athanasios Oikonomou
Modified:	2018-11-03 11:33 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
sample srt with bom and without bom (40.66 KB, application/zip) 2014-11-26 21:11 UTC, Athanasios Oikonomou		Details
subparse: avoid false negative with g_utf8_validate (1.05 KB, patch) 2014-11-28 16:35 UTC, Reynaldo H. Verdejo Pinochet	reviewed	Details \| Review
subparse: avoid false negative with g_utf8_validate (1.05 KB, patch) 2014-11-29 00:38 UTC, Reynaldo H. Verdejo Pinochet	none	Details \| Review
sample srt without bom that fails in the middle of the file (37.26 KB, application/zip) 2014-11-29 20:10 UTC, Athanasios Oikonomou		Details
subparse: avoid false negative with g_utf8_validate (1.35 KB, patch) 2014-12-01 15:12 UTC, Reynaldo H. Verdejo Pinochet	none	Details \| Review
avoid false negative with g_utf8_validate and set detected encoding (1.14 KB, patch) 2014-12-01 21:54 UTC, Athanasios Oikonomou	needs-work	Details \| Review
subparse: avoid false negatives dealing with UTF-8 (2.18 KB, patch) 2014-12-04 04:44 UTC, Reynaldo H. Verdejo Pinochet	reviewed	Details \| Review
subparse: avoid false negatives dealing with UTF-8 (2.18 KB, patch) 2015-04-19 21:10 UTC, Reynaldo H. Verdejo Pinochet	reviewed	Details \| Review

Description Athanasios Oikonomou 2014-11-26 21:11:44 UTC

Created attachment 291594 [details]
sample srt with bom and without bom

Subparse fails to detect UTF-8 encoding when file does not contain BOM.

INFO subparse gstsubparse.c:465:convert_encoding:<subparse0> invalid UTF-8!

But when we set GST_SUBTITLE_ENCODING to UTF-8 subtitles displayed correctly.

INFO subparse gstsubparse.c:465:convert_encoding:<subparse0> invalid UTF-8!
LOG subparse gstsubparse.c:495:convert_encoding:<subparse0> successfully converted 4096 characters from UTF-8 to UTF-8

The exacly same subtitle with BOM (only BOM header is the difference between those files) has no issues.

Comment 1 Reynaldo H. Verdejo Pinochet 2014-11-28 16:35:33 UTC

Created attachment 291730 [details] [review]
subparse: avoid false negative with g_utf8_validate

Problem is at g_utf8_validate() when passing over
a null terminated string without telling it not
to include it in the test. This patch fixes the
issue for me.

Comment 2 Reynaldo H. Verdejo Pinochet 2014-11-28 16:42:32 UTC

/include it/include the NULL/

Comment 3 Tim-Philipp Müller 2014-11-28 17:32:54 UTC

Comment on attachment 291730 [details] [review]
subparse: avoid false negative with g_utf8_validate

Aha, as I suspected on IRC then :)

Patch looks generally fine, but I think we should trim off terminating zeroes elsewhere and in general, so maybe early in convert_encoding() or before calling convert_encoding()?

It's called a NUL in this case btw, NULL is for pointers.

Comment 4 Tim-Philipp Müller 2014-11-28 17:36:39 UTC

And bonus points for a unit test addition to tests/check/elements/subparse.c :)

Comment 5 Reynaldo H. Verdejo Pinochet 2014-11-29 00:22:36 UTC

+1 for the test addition but I don't think we should trim
NUL terminators elsewhere or at all TBH. Purpose of the
patch is just to pass g_utf8_validate() without having to
conditionally pass -1 for len to avoid the false negative.

My opinion is that changing behavior like trimming all NUL
terminators would, falls outside the scope of the small fix
and I'm actually rather unsure as to whether it wouldn't end
up breaking things elsewhere.

Comment 6 Tim-Philipp Müller 2014-11-29 00:27:12 UTC

By 'elsewhere' I don't mean everywhere in GStreamer, I mean "a couple of lines earlier".

Comment 7 Reynaldo H. Verdejo Pinochet 2014-11-29 00:38:47 UTC

Created attachment 291758 [details] [review]
subparse: avoid false negative with g_utf8_validate

Fix commit message on NULL/NUL

Comment 8 Reynaldo H. Verdejo Pinochet 2014-11-29 00:43:40 UTC

(In reply to comment #6)
> By 'elsewhere' I don't mean everywhere in GStreamer, I mean "a couple of lines
> earlier".

I understand, but the current patch is not trimming any \0,
just fixing the length passed to g_utf8_validate() If you
insist I can do it though. Just wanted to state my opinion on it.

Comment 9 Tim-Philipp Müller 2014-11-29 01:04:36 UTC

The question is "what is the correct data here, is it supposed to include a NUL terminator or not here?". The answer to this is "no", that's why I suggested you fix up the data in this location if needed. How sure are you g_convert*() will handle it fine in all circumstances and not do the same thing in some?

Comment 10 Athanasios Oikonomou 2014-11-29 20:10:19 UTC

Created attachment 291805 [details]
sample srt without bom that fails in the middle of the file

Hello, 


Applying the proposed patched doen't fix the problem with the attached sample.

subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line '137'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line '00:13:22,771 --> 00:13:24,764'
subparse gstsubparse.c:843:parse_subrip_time: parsing timestamp '00:13:22,771'
subparse gstsubparse.c:843:parse_subrip_time: parsing timestamp '00:13:24,764'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line 'Ήδη τον ζορίζεις να'
subparse gstsubparse.c:465:convert_encoding:<subparse0> invalid UTF-8!
subparse gstsubparse.c:486:convert_encoding:<subparse0> could not convert string from 'ANSI_X3.4-1968' to UTF-8: Invalid byte sequence in conversion input
subparse gstsubparse.c:495:convert_encoding:<subparse0> successfully converted 4096 characters from ANSI_X3.4-1968 to UTF-8 , using ISO-8859-15 as fallback
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line 'τα κάνει όλα μόνος I?I?I?;'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line ''
subparse gstsubparse.c:1566:handle_buffer:<subparse0> Sending text 'Ήδη τον ζορίζεις να
τα κάνει όλα μόνος I&#x84;I?I?;', 0:13:22.771000000 + 0:00:01.993000000
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line '138'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line '00:13:25,295 --> 00:13:28,077'
subparse gstsubparse.c:843:parse_subrip_time: parsing timestamp '00:13:25,295'
subparse gstsubparse.c:843:parse_subrip_time: parsing timestamp '00:13:28,077'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line 'I€I? I»IµI?; I I?I?I±I?I?I¬I? IZIµ I¶I?II?I¶IµI? I?I?I?I?I?I±.'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line 'I?I?I¬I»IµIYIµ I?I? I±I?I?I?I?I?I?I·I?I?!'
subparse gstsubparse.c:1535:handle_buffer:<subparse0> Parsing line ''
subparse gstsubparse.c:1566:handle_buffer:<subparse0> Sending text 'I€I? I»IµI&#x82;; I&#x9f; I?I&#x80;I±I?I&#x80;I¬I&#x82; IZIµ I¶I?I&#x81;I?I¶IµI? I&#x84;I?I&#x80;I?I&#x84;I±.
I&#x94;I?I¬I»IµIYIµ I&#x84;I? I±I?I&#x84;I?I?I?I?I·I&#x84;I?!', 0:13:25.295000000 + 0:00:02.782000000


It works up to the 137 subtitle without problem.

Comment 11 Tim-Philipp Müller 2014-11-30 15:23:27 UTC

Two more thoughts:

a) the incoming data might be UTF-16 or UTF-32. In this case trailing NULs are 2 bytes or 4 bytes, and removing a single NUL byte at the end would might a character in half. This is an argument for doing the trimming of NULs just before the gst_utf8_validate() and not earlier.

b) there could be multiple trailing zeroes

So how about a while (len > 0 && str[len-1] == '\0') --len just before the _validate()?

Comment 12 Athanasios Oikonomou 2014-11-30 15:46:17 UTC

What happens when the buffer contains multiple NULs? 

Maybe it's better to check up to the first occurence of NUL? 

gsize check_len = strnlen(str, len);
len = check_len < len ? check_len : len;

Comment 13 Reynaldo H. Verdejo Pinochet 2014-12-01 15:12:48 UTC

Created attachment 291917 [details] [review]
subparse: avoid false negative with g_utf8_validate

Something like this then? I understand "consumed" should still
match the size of the original run, NULs included.

Also tested this patch with a section of the newly attached srt
FILE starting around sub #135. It works.

Comment 14 Athanasios Oikonomou 2014-12-01 21:54:43 UTC

Created attachment 291949 [details] [review]
avoid false negative with g_utf8_validate and set detected encoding

With the attached patch I have no issues with attached subtitles.

There difference is that we set detected encoding, so that next buffer to handle with so that next buffer to checked directly against gst_convert_to_utf8, instead of validating every buffer.

Comment 15 Tim-Philipp Müller 2014-12-03 14:23:54 UTC

Comment on attachment 291949 [details] [review]
avoid false negative with g_utf8_validate and set detected encoding

Thanks for the patch! I don't think it's really right though.

>   /* Otherwise check if it's UTF8 */
>   if (self->valid_utf8) {
>+    while (len > 0 && str[len] == '\0') --len;

This is an off-by-one, no?

>     if (g_utf8_validate (str, len, NULL)) {
>       GST_LOG_OBJECT (self, "valid UTF-8, no conversion needed");
>+      self->detected_encoding = g_strdup ("UTF-8");

I don't think this is correct. Example:

First string: "Hello!"
Second: "My name is $iso_8859_15_characters".

Comment 16 Reynaldo H. Verdejo Pinochet 2014-12-03 14:32:50 UTC

Athanasios, latest patch I attached should fix the issue.
Please double check. I already did and it works here.

Comment 17 Athanasios Oikonomou 2014-12-03 17:00:45 UTC

Tim, the idea of setting detected_encoding was to use gst_convert_to_utf8 since we have valid UTF-8 data. Usually one srt subtitle is encoded using only one encoding, so why to validate every buffer? In case gst_convert_to_uf8 fail will go again on validate (and most probably validate will fail too!).

  /* First try any detected encoding */
  if (self->detected_encoding) {
    ret =
        gst_convert_to_utf8 (str, len, self->detected_encoding, consumed, &err);

    if (!err)
      return ret;


Reynaldo, You are right! I just tested and both samples work great! 

So please push Reynaldo's patch.

Comment 18 Athanasios Oikonomou 2014-12-03 18:55:40 UTC

OOOpppsss! Reynaldo, it still hapens! Sorry...

subparse1>[00m Parsing line '137'
subparse1>[00m Parsing line '00:13:22,771 --> 00:13:24,764'
subparse.c:848:parse_subrip_time:[00m parsing timestamp '00:13:22,771'
subparse.c:848:parse_subrip_time:[00m parsing timestamp '00:13:24,764'
subparse1>[00m Parsing line 'Ήδη τον ζορίζεις να'
subparse1>[00m invalid UTF-8!
subparse1>[00m successfully converted 4096 characters from ISO-8859-15 to UTF-8
subparse1>[00m Parsing line 'τα κάνει όλα μόνος I?I?I?;'
subparse1>[00m Parsing line ''
subparse1>[00m Sending text 'Ήδη τον ζορίζεις να
τα κάνει όλα μόνος I&#x84;I?I?;', 0:13:22.771000000 + 0:00:01.993000000
subparse.c:199:gst_sub_parse_src_query:[00m Handling duration query
subparse.c:199:gst_sub_parse_src_query:[00m Handling duration query
subparse.c:199:gst_sub_parse_src_query:[00m Handling duration query
subparse.c:199:gst_sub_parse_src_query:[00m Handling duration query
subparse.c:199:gst_sub_parse_src_query:[00m Handling duration query
subparse1>[00m Parsing line '138'
subparse1>[00m Parsing line '00:13:25,295 --> 00:13:28,077'
subparse.c:848:parse_subrip_time:[00m parsing timestamp '00:13:25,295'
subparse.c:848:parse_subrip_time:[00m parsing timestamp '00:13:28,077'
subparse1>[00m Parsing line 'I€I? I»IµI?; I I?I?I±I?I?I¬I? IZIµ I¶I?II?I¶IµI? I?I?I?I?I?I±.'
subparse1>[00m Parsing line 'I?I?I¬I»IµIYIµ I?I? I±I?I?I?I?I?I?I·I?I?!'
subparse1>[00m Parsing line ''
subparse1>[00m Sending text 'I€I? I»IµI&#x82;; I&#x9f; I?I&#x80;I±I?I&#x80;I¬I&#x82; IZIµ I¶I?I&#x81;I?I¶IµI? I&#x84;I?I&#x80;I?I&#x84;I±.
I&#x94;I?I¬I»IµIYIµ I&#x84;I? I±I?I&#x84;I?I?I?I?I·I&#x84;I?!', 0:13:25.295000000 + 0:00:02.782000000

Can we further debug it?

Comment 19 Reynaldo H. Verdejo Pinochet 2014-12-03 22:16:05 UTC

(In reply to comment #18)
> [..]
> 
> Can we further debug it?

Sure. Can you attach an small (in duration) sub file that shows
the problem with the patch applied? say a 2-3 minutes file. I'm
failing to reproduce your failure using a section of the last
file you attached.

Comment 20 Reynaldo H. Verdejo Pinochet 2014-12-04 04:44:04 UTC

Created attachment 292114 [details] [review]
subparse: avoid false negatives dealing with UTF-8

OK. Turns out we were only looking at part of the problem.

Termination NULs are an issue, Granted. And these were already
been handled by my previous patch. But there's another condition
that wasn't been considered:

It might be that only part of the available data is be valid
UTF-8. For example a byte at the end might be the start of a
valid UTF-8 run (ie: d0 / 11010000) but not be a valid UTF-8
character by itself. In this case, we should consume only the
valid portion of the run.

This new patch addresses the whole issue as I see it. Quickly
drafted solution and I'm likely fine-tuning it tomorrow but
comments & testing welcome all the same. I tried out with the
samples I have and it works.

Comment 21 Reynaldo H. Verdejo Pinochet 2014-12-04 04:48:13 UTC

Comment on attachment 291949 [details] [review]
avoid false negative with g_utf8_validate and set detected encoding

This patch reads past the end of the sub data and
does not handle the full issue.

Comment 22 Athanasios Oikonomou 2014-12-04 16:49:29 UTC

Review of attachment 292114 [details] [review]:

Now we can handle both schenarios. Great Reynaldo!

subparse gstsubparse.c:199:gst_sub_parse_src_query: Handling duration query
subparse gstsubparse.c:199:gst_sub_parse_src_query: Handling duration query
subparse gstsubparse.c:1554:handle_buffer:<subparse0> Parsing line '137'
subparse gstsubparse.c:1554:handle_buffer:<subparse0> Parsing line '00:13:22,771 --> 00:13:24,764'
subparse gstsubparse.c:862:parse_subrip_time: parsing timestamp '00:13:22,771'
subparse gstsubparse.c:862:parse_subrip_time: parsing timestamp '00:13:24,764'
subparse gstsubparse.c:1554:handle_buffer:<subparse0> Parsing line 'Ήδη τον ζορίζεις να'
subparse gstsubparse.c:479:convert_encoding:<subparse0> At least some of the data was invalid UTF-8
subparse gstsubparse.c:1554:handle_buffer:<subparse0> Parsing line 'τα κάνει όλα μόνος του;'
subparse gstsubparse.c:1554:handle_buffer:<subparse0> Parsing line ''
subparse gstsubparse.c:1585:handle_buffer:<subparse0> Sending text 'Ήδη τον ζορίζεις να
τα κάνει όλα μόνος του;', 0:13:22.771000000 + 0:00:01.993000000

Comment 23 Reynaldo H. Verdejo Pinochet 2014-12-04 19:59:10 UTC

Cool Athanasios. Thanks for your help testing it!

Comment 24 Sebastian Dröge (slomo) 2014-12-04 21:06:55 UTC

Also looks good to me, but I'd like to wait for Tim's opinion before merging this :)

Comment 25 Reynaldo H. Verdejo Pinochet 2014-12-14 16:22:42 UTC

Tim?

Comment 26 Tim-Philipp Müller 2014-12-16 01:22:58 UTC

I'm not sure if this is ok as-is. I'm fine with cutting off the text after the first invalid character for text that is definitely UTF-8, i.e. where we had a BOM, but if I'm reading the current patch correctly it would basically just cut off non-UTF8 encodings at the first non-ASCII character and treat it as UTF-8 instead of falling through to the fallbacks (current locale, environment variables, etc.).

Comment 27 Athanasios Oikonomou 2015-03-20 17:42:20 UTC

Is there any other ideas how to solve the pending issues from that bug?

Comment 28 Reynaldo H. Verdejo Pinochet 2015-04-19 21:10:35 UTC

Created attachment 301945 [details] [review]
subparse: avoid false negatives dealing with UTF-8

Patch rebased on top of latest master.

Had completely forgot about this issue. Will take another
look and comment. My take though, was that as long as we
rely on finding one valid code point the patch should be
rather safe. I might not be remembering this clearly but
from what I know, finding a valid code point in otherwise
non UTF-8 text it's quite unlikely.

Relying on the BOM being present is not a good idea in
general, BOM presence is not required and even frowned
upon.

Comment 29 Tim-Philipp Müller 2015-04-19 21:33:54 UTC

My comment was about the equally valid non-UTF-8 case. Of course we should not rely on a BOM, but if a BOM is present we can make stronger assumptions about the encoding of the text.

Comment 30 Athanasios Oikonomou 2015-08-07 12:41:24 UTC

Review of attachment 301945 [details] [review]:

Above patch works with both srt provided.

Comment 31 Sebastian Dröge (slomo) 2015-08-07 13:50:56 UTC

Comment on attachment 301945 [details] [review]
subparse: avoid false negatives dealing with UTF-8

Please don't mark patches as accept-commit-now, that's only meant to be used by the maintainers.

Comment 32 Athanasios Oikonomou 2017-06-28 20:41:10 UTC

1.12.1 and patch still applies and works for me.

Comment 33 GStreamer system administrator 2018-11-03 11:33:27 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/148.