Bug 597786 – [tag] enhance gst_tag_freeform_string_to_utf8 to handle 16-bit Unicode

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 597786 - [tag] enhance gst_tag_freeform_string_to_utf8 to handle 16-bit Unicode


Summary:	[tag] enhance gst_tag_freeform_string_to_utf8 to handle 16-bit Unicode


Status:	RESOLVED FIXED

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.10.26
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-10-08 09:47 UTC by Mark Nauwelaerts
Modified:	2009-10-09 14:23 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Check for and use UTF-16 byte-order-mark (1.36 KB, patch) 2009-10-08 09:47 UTC, Mark Nauwelaerts	none	Details \| Review
Check for and use UTF-16 byte-order-mark (1.36 KB, patch) 2009-10-08 10:47 UTC, Mark Nauwelaerts	none	Details \| Review
Check for and use UTF-16/32 byte-order-mark (1.80 KB, patch) 2009-10-08 12:25 UTC, Mark Nauwelaerts	committed	Details \| Review

Description Mark Nauwelaerts 2009-10-08 09:47:47 UTC

Created attachment 145024 [details] [review]
Check for and use UTF-16 byte-order-mark

Attached patch checks passed string for a byte-order-mark at the start, which
is a standard way to indicate 16-bit Unicode data (in BE or LE).

On the one hand, it seems to make sense to check for a standard encoding after
checking for another standard one (utf8) and before sort-of guessing encoding
based upon environment variables.  In particular, some tag formats explicitly
define such possible encoding (e.g. 3GPP tags).

On the other hand, it might be commented that it 'breaks' ABI/API as
conversions that previously failed (or produced garbled stuff) now succeed.

Comment 1 Tim-Philipp Müller 2009-10-08 10:04:57 UTC

Sounds like a good idea to me. Chances of things breaking should be rather small - after all we not only need a BOM at the beginning (containing values which are not really used in 8-bit charsets) but also valid UTF-16 to the end of the string. Even if the BOM is a false positive, what are the chances that the rest of the string happens to be valid UTF-16 as well?


> +      case 0xFEF:
> +        c = "UTF-16BE";

Shouldn't this be 0xFEFF instead?

Comment 2 Mark Nauwelaerts 2009-10-08 10:47:10 UTC

Created attachment 145029 [details] [review]
Check for and use UTF-16 byte-order-mark

Indeed, it should be 0xFEFF, so updated patch.

Comment 3 Sebastian Dröge (slomo) 2009-10-08 11:03:02 UTC

While you're at it, could you also do the same for UTF32? :) It's BOM are
LE: 0xFF 0xFE 0x00 0x00
BE: 0x00 0x00 0xFE 0x FF

(Note, the first is also a valid UTF16 string of length 0 but that shouldn't be a problem I guess)

Comment 4 Mark Nauwelaerts 2009-10-08 12:25:20 UTC

Created attachment 145038 [details] [review]
Check for and use UTF-16/32 byte-order-mark

Upon popular request, also consider UTF-32 BOM

Comment 5 Mark Nauwelaerts 2009-10-09 14:23:38 UTC

commit e18b42c0b631941c7dc9da9ad7475fb2ad8e95a8
Author: Mark Nauwelaerts <mark.nauwelaerts@collabora.co.uk>
Date:   Thu Oct 8 14:16:44 2009 +0200

    tag: use BOM to recognize UTF-16/32 encoding and convert accordingly