After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 664257 - [mpegtsparse] Support UTF-16BE text encoding
[mpegtsparse] Support UTF-16BE text encoding
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-bad
git master
Other Linux
: Normal normal
: 0.10.23
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2011-11-17 10:40 UTC by Sebastian Pölsterl
Modified: 2011-11-22 11:53 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Support UTF-16BE encoding (1.62 KB, patch)
2011-11-17 10:40 UTC, Sebastian Pölsterl
needs-work Details | Review
Support additional encodings (2.26 KB, patch)
2011-11-21 17:11 UTC, Sebastian Pölsterl
committed Details | Review

Description Sebastian Pölsterl 2011-11-17 10:40:53 UTC
Created attachment 201583 [details] [review]
Support UTF-16BE encoding

The patch allows to correctly retrieve text encoded as UTF-16BE from the EIT as used in Taiwan.

The changes are based on http://ghostsinthelab.org/?p=892

A sample file is available at http://file.kidwm.net/dvb.tar.gz (courtesy of wandererm {AT} gmail.com).
Comment 1 Tim-Philipp Müller 2011-11-21 15:35:09 UTC
Comment on attachment 201583 [details] [review]
Support UTF-16BE encoding

Not entirely convinced this is right as it is.

For one, ETSI EN 300 468 says for charset ID 0x14 "Big5 subset of ISO/IEC 10646" - we should probably express it like that.

Secondly, the newline "fixes" look dubious to me. I mean, clearly it's right for UTF16-BE, but I presume they have been originally added for UTF16-LE or somesuch? Not sure where the endianness of e.g. ID 0x11 is specified, I'm *assuming* someone tested the 0x11 and it was on a little endian machine, which would go against the assumption of BE encoding. Don't know, maybe you can find a reference.

(This whole "is_multibyte" thing isn't really expressive enough and should probably be changed).
Comment 2 Sebastian Pölsterl 2011-11-21 17:11:48 UTC
Created attachment 201830 [details] [review]
Support additional encodings

Apparently, I used an outdated version of the spec. I updated support for additional encodings based on V1.11.1 of this spec.

The whole encoding/decoding was written by myself a couple of years back. The is_multibyte part is only responsible to remove control codes from the text. I did not have any samples with multi-byte encoding back then, therefore it might be wrong all along. In addition, the spec doesn't mention anything about the byte order. All I know is that it works with the sample file.

Regarding "Big5 subset of ISO/IEC 10646". I tried to find out what exactly this means, but could not find any information that links Big5 and ISO/IEC 10646.

I suggest to accept this patch since the spec is very vague about the details and it works correctly with the sample file. If someone else comes along later with a sample that does not work, we can look into it again.
Comment 3 Tim-Philipp Müller 2011-11-22 11:53:09 UTC
Ok, fair enough. On second thought it seems unlikely that the newline bit in the multibyte code path was ever triggered for little-endian, because the characters are read as big-endian and the control code value has the 0xA in the lower bits, as would be expected I guess.

 commit 9759d66407f2be8ec29975b0eff3230bb1dae0ef
 Author: Sebastian Pölsterl <sebp@k-d-w.org>
 Date:   Thu Nov 17 11:33:56 2011 +0100

    mpegtsparse: support more character set encodings
    
    Support UTF-16BE, EUC-KR (KSX1001), GB2312 and ISO-10646/UTF8 text
    encoding and fixed new line for multibyte encoding
    
    https://bugzilla.gnome.org/show_bug.cgi?id=664257