Bug 664257 – [mpegtsparse] Support UTF-16BE text encoding

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 664257 - [mpegtsparse] Support UTF-16BE text encoding


Summary:	[mpegtsparse] Support UTF-16BE text encoding


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-bad
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.10.23
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-11-17 10:40 UTC by Sebastian Pölsterl
Modified:	2011-11-22 11:53 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Support UTF-16BE encoding (1.62 KB, patch) 2011-11-17 10:40 UTC, Sebastian Pölsterl	needs-work	Details \| Review
Support additional encodings (2.26 KB, patch) 2011-11-21 17:11 UTC, Sebastian Pölsterl	committed	Details \| Review

Description Sebastian Pölsterl 2011-11-17 10:40:53 UTC

Created attachment 201583 [details] [review]
Support UTF-16BE encoding

The patch allows to correctly retrieve text encoded as UTF-16BE from the EIT as used in Taiwan.

The changes are based on http://ghostsinthelab.org/?p=892

A sample file is available at http://file.kidwm.net/dvb.tar.gz (courtesy of wandererm {AT} gmail.com).

Comment 1 Tim-Philipp Müller 2011-11-21 15:35:09 UTC

Comment on attachment 201583 [details] [review]
Support UTF-16BE encoding

Not entirely convinced this is right as it is.

For one, ETSI EN 300 468 says for charset ID 0x14 "Big5 subset of ISO/IEC 10646" - we should probably express it like that.

Secondly, the newline "fixes" look dubious to me. I mean, clearly it's right for UTF16-BE, but I presume they have been originally added for UTF16-LE or somesuch? Not sure where the endianness of e.g. ID 0x11 is specified, I'm *assuming* someone tested the 0x11 and it was on a little endian machine, which would go against the assumption of BE encoding. Don't know, maybe you can find a reference.

(This whole "is_multibyte" thing isn't really expressive enough and should probably be changed).

Comment 2 Sebastian Pölsterl 2011-11-21 17:11:48 UTC

Created attachment 201830 [details] [review]
Support additional encodings

Apparently, I used an outdated version of the spec. I updated support for additional encodings based on V1.11.1 of this spec.

The whole encoding/decoding was written by myself a couple of years back. The is_multibyte part is only responsible to remove control codes from the text. I did not have any samples with multi-byte encoding back then, therefore it might be wrong all along. In addition, the spec doesn't mention anything about the byte order. All I know is that it works with the sample file.

Regarding "Big5 subset of ISO/IEC 10646". I tried to find out what exactly this means, but could not find any information that links Big5 and ISO/IEC 10646.

I suggest to accept this patch since the spec is very vague about the details and it works correctly with the sample file. If someone else comes along later with a sample that does not work, we can look into it again.

Comment 3 Tim-Philipp Müller 2011-11-22 11:53:09 UTC

Ok, fair enough. On second thought it seems unlikely that the newline bit in the multibyte code path was ever triggered for little-endian, because the characters are read as big-endian and the control code value has the 0xA in the lower bits, as would be expected I guess.

 commit 9759d66407f2be8ec29975b0eff3230bb1dae0ef
 Author: Sebastian Pölsterl <sebp@k-d-w.org>
 Date:   Thu Nov 17 11:33:56 2011 +0100

    mpegtsparse: support more character set encodings
    
    Support UTF-16BE, EUC-KR (KSX1001), GB2312 and ISO-10646/UTF8 text
    encoding and fixed new line for multibyte encoding
    
    https://bugzilla.gnome.org/show_bug.cgi?id=664257