GNOME Bugzilla – Bug 597786
[tag] enhance gst_tag_freeform_string_to_utf8 to handle 16-bit Unicode
Last modified: 2009-10-09 14:23:57 UTC
Created attachment 145024 [details] [review] Check for and use UTF-16 byte-order-mark Attached patch checks passed string for a byte-order-mark at the start, which is a standard way to indicate 16-bit Unicode data (in BE or LE). On the one hand, it seems to make sense to check for a standard encoding after checking for another standard one (utf8) and before sort-of guessing encoding based upon environment variables. In particular, some tag formats explicitly define such possible encoding (e.g. 3GPP tags). On the other hand, it might be commented that it 'breaks' ABI/API as conversions that previously failed (or produced garbled stuff) now succeed.
Sounds like a good idea to me. Chances of things breaking should be rather small - after all we not only need a BOM at the beginning (containing values which are not really used in 8-bit charsets) but also valid UTF-16 to the end of the string. Even if the BOM is a false positive, what are the chances that the rest of the string happens to be valid UTF-16 as well? > + case 0xFEF: > + c = "UTF-16BE"; Shouldn't this be 0xFEFF instead?
Created attachment 145029 [details] [review] Check for and use UTF-16 byte-order-mark Indeed, it should be 0xFEFF, so updated patch.
While you're at it, could you also do the same for UTF32? :) It's BOM are LE: 0xFF 0xFE 0x00 0x00 BE: 0x00 0x00 0xFE 0x FF (Note, the first is also a valid UTF16 string of length 0 but that shouldn't be a problem I guess)
Created attachment 145038 [details] [review] Check for and use UTF-16/32 byte-order-mark Upon popular request, also consider UTF-32 BOM
commit e18b42c0b631941c7dc9da9ad7475fb2ad8e95a8 Author: Mark Nauwelaerts <mark.nauwelaerts@collabora.co.uk> Date: Thu Oct 8 14:16:44 2009 +0200 tag: use BOM to recognize UTF-16/32 encoding and convert accordingly