After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 597786 - [tag] enhance gst_tag_freeform_string_to_utf8 to handle 16-bit Unicode
[tag] enhance gst_tag_freeform_string_to_utf8 to handle 16-bit Unicode
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-base
git master
Other Linux
: Normal normal
: 0.10.26
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2009-10-08 09:47 UTC by Mark Nauwelaerts
Modified: 2009-10-09 14:23 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Check for and use UTF-16 byte-order-mark (1.36 KB, patch)
2009-10-08 09:47 UTC, Mark Nauwelaerts
none Details | Review
Check for and use UTF-16 byte-order-mark (1.36 KB, patch)
2009-10-08 10:47 UTC, Mark Nauwelaerts
none Details | Review
Check for and use UTF-16/32 byte-order-mark (1.80 KB, patch)
2009-10-08 12:25 UTC, Mark Nauwelaerts
committed Details | Review

Description Mark Nauwelaerts 2009-10-08 09:47:47 UTC
Created attachment 145024 [details] [review]
Check for and use UTF-16 byte-order-mark

Attached patch checks passed string for a byte-order-mark at the start, which
is a standard way to indicate 16-bit Unicode data (in BE or LE).

On the one hand, it seems to make sense to check for a standard encoding after
checking for another standard one (utf8) and before sort-of guessing encoding
based upon environment variables.  In particular, some tag formats explicitly
define such possible encoding (e.g. 3GPP tags).

On the other hand, it might be commented that it 'breaks' ABI/API as
conversions that previously failed (or produced garbled stuff) now succeed.
Comment 1 Tim-Philipp Müller 2009-10-08 10:04:57 UTC
Sounds like a good idea to me. Chances of things breaking should be rather small - after all we not only need a BOM at the beginning (containing values which are not really used in 8-bit charsets) but also valid UTF-16 to the end of the string. Even if the BOM is a false positive, what are the chances that the rest of the string happens to be valid UTF-16 as well?


> +      case 0xFEF:
> +        c = "UTF-16BE";

Shouldn't this be 0xFEFF instead?
Comment 2 Mark Nauwelaerts 2009-10-08 10:47:10 UTC
Created attachment 145029 [details] [review]
Check for and use UTF-16 byte-order-mark

Indeed, it should be 0xFEFF, so updated patch.
Comment 3 Sebastian Dröge (slomo) 2009-10-08 11:03:02 UTC
While you're at it, could you also do the same for UTF32? :) It's BOM are
LE: 0xFF 0xFE 0x00 0x00
BE: 0x00 0x00 0xFE 0x FF

(Note, the first is also a valid UTF16 string of length 0 but that shouldn't be a problem I guess)
Comment 4 Mark Nauwelaerts 2009-10-08 12:25:20 UTC
Created attachment 145038 [details] [review]
Check for and use UTF-16/32 byte-order-mark

Upon popular request, also consider UTF-32 BOM
Comment 5 Mark Nauwelaerts 2009-10-09 14:23:38 UTC
commit e18b42c0b631941c7dc9da9ad7475fb2ad8e95a8
Author: Mark Nauwelaerts <mark.nauwelaerts@collabora.co.uk>
Date:   Thu Oct 8 14:16:44 2009 +0200

    tag: use BOM to recognize UTF-16/32 encoding and convert accordingly