GNOME Bugzilla – Bug 664257
[mpegtsparse] Support UTF-16BE text encoding
Last modified: 2011-11-22 11:53:22 UTC
Created attachment 201583 [details] [review] Support UTF-16BE encoding The patch allows to correctly retrieve text encoded as UTF-16BE from the EIT as used in Taiwan. The changes are based on http://ghostsinthelab.org/?p=892 A sample file is available at http://file.kidwm.net/dvb.tar.gz (courtesy of wandererm {AT} gmail.com).
Comment on attachment 201583 [details] [review] Support UTF-16BE encoding Not entirely convinced this is right as it is. For one, ETSI EN 300 468 says for charset ID 0x14 "Big5 subset of ISO/IEC 10646" - we should probably express it like that. Secondly, the newline "fixes" look dubious to me. I mean, clearly it's right for UTF16-BE, but I presume they have been originally added for UTF16-LE or somesuch? Not sure where the endianness of e.g. ID 0x11 is specified, I'm *assuming* someone tested the 0x11 and it was on a little endian machine, which would go against the assumption of BE encoding. Don't know, maybe you can find a reference. (This whole "is_multibyte" thing isn't really expressive enough and should probably be changed).
Created attachment 201830 [details] [review] Support additional encodings Apparently, I used an outdated version of the spec. I updated support for additional encodings based on V1.11.1 of this spec. The whole encoding/decoding was written by myself a couple of years back. The is_multibyte part is only responsible to remove control codes from the text. I did not have any samples with multi-byte encoding back then, therefore it might be wrong all along. In addition, the spec doesn't mention anything about the byte order. All I know is that it works with the sample file. Regarding "Big5 subset of ISO/IEC 10646". I tried to find out what exactly this means, but could not find any information that links Big5 and ISO/IEC 10646. I suggest to accept this patch since the spec is very vague about the details and it works correctly with the sample file. If someone else comes along later with a sample that does not work, we can look into it again.
Ok, fair enough. On second thought it seems unlikely that the newline bit in the multibyte code path was ever triggered for little-endian, because the characters are read as big-endian and the control code value has the 0xA in the lower bits, as would be expected I guess. commit 9759d66407f2be8ec29975b0eff3230bb1dae0ef Author: Sebastian Pölsterl <sebp@k-d-w.org> Date: Thu Nov 17 11:33:56 2011 +0100 mpegtsparse: support more character set encodings Support UTF-16BE, EUC-KR (KSX1001), GB2312 and ISO-10646/UTF8 text encoding and fixed new line for multibyte encoding https://bugzilla.gnome.org/show_bug.cgi?id=664257