GNOME Bugzilla – Bug 405072
[API] add gst_tag_freeform_string_to_utf8()
Last modified: 2007-04-12 12:38:30 UTC
I think it might be useful to add a function that takes a "freeform" string (= ASCII, UTF-8 or of unknown 1-byte character encoding like ISO-8859-*) and tries to convert it to UTF-8. A possible use would then look like this: static void gst_tag_extract_id3v1_string (GstTagList * list, const gchar * tag, const gchar * start, const guint size) { const gchar *env_vars[] = { "GST_ID3V1_TAG_ENCODING", "GST_ID3_TAG_ENCODING", "GST_TAG_ENCODING", NULL }; gchar *utf8; utf8 = gst_tag_freeform_string_to_utf8 (start, size, env_vars); if (utf8 && *utf8 != '\0') { gst_tag_list_add (list, GST_TAG_MERGE_REPLACE, tag, utf8, NULL); } g_free (utf8); } We have a few plugins where this could be useful, e.g. icydemux (where you can't override the encoding used yet via environment variables).
Created attachment 82034 [details] [review] propsed API addition
It's possible (though I'm not sure how common it is in the widespread encodings we're likely to encounter) for data to validate as UTF-8, but not actually be intended as UTF-8. If the user has specified an encoding to use via a Magic Environment Variable, we should obey that first and foremost. There's an argument that for some of these broken protocols/fileformats, we should use the locale after that, and only THEN attempt utf-8, but I don't think that's ideal usually.
> It's possible (though I'm not sure how common it is in the widespread encodings > we're likely to encounter) for data to validate as UTF-8, but not actually be > intended as UTF-8. Yes, that's theoretically possible, but the chances of this happening with a non-garbage string are close to zero IIRC. I've actually looked into this once for a bug and found that it could only happen with absolutely unlikely character combinations (as in: completely constructed, since one of them would almost always need to be an uncommon non-letter character like a special quotes or copyright sign etc.). I back then concluded that the chances of those occuring in any kind of text (as in: book/paper manuscripts) are very very small, and the chances of them occuring in tag chunks are almost zero. It's really mostly about adding a convenience function for this to avoid code duplication.
Yeah, I agree with the principle of the API - it sounds useful, and sensible. I was only arguing with the details of what it should do. If it's really not going to occur with any likely-to-be-found-in-the-wild charsets, then I suppose it doesn't matter.
> If the user has specified an encoding to use via a Magic Environment Variable, > we should obey that first and foremost. Not sure about this, because: - conversion from almost any (non-UTF8) 8-bit character encoding will be successful, since those encodings tend to make use of the whole 8-bit range (minus one or two not allowed values, but those tend to be the same). This means we can't just "test if this works". It will almost always work and then produce bad output. - we have environment variables that are very general, like GST_TAG_ENCODING and those that are very specific, like GST_ID3V1_TAG_ENCODING. Your suggestion would make most sense for the very specific ones IMHO. We could provide API so the caller can tell us which ones are the specific ones to check first and which ones are the general ones to check later, but I don't really think it's worth given how small chances are to falsely identify a non-UTF-8 string as UTF-8. But then this assertion of mine could just be wrong or, since admittedly I only checked ISO-8859-* etc. and not, for example, Korean/Japanese/Chinese/other Asian locales. > There's an argument that for some of these broken protocols/fileformats, we > should use the locale after that, and only THEN attempt utf-8, but I don't > think that's ideal usually. I think locale should always come last, for the reason mentioned above that conversion from ISO-8859-* will almost always succeed whatever the input.
Well, I'm convinced: you say that UTF-8 misidentification is really unlikely in practice, so I'm happy with your API proposal.
Committed to CVS, with updated Since: tag of course: 2007-04-12 Tim-Philipp Müller <tim at centricular dot net> * docs/libs/gst-plugins-base-libs-sections.txt: * gst-libs/gst/tag/tag.h: * gst-libs/gst/tag/tags.c: (gst_tag_freeform_string_to_utf8): API: add gst_tag_freeform_string_to_utf8() (#405072). * gst-libs/gst/tag/gstid3tag.c: (gst_tag_extract_id3v1_string): Use gst_tag_freeform_string_to_utf8() here.