After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 405072 - [API] add gst_tag_freeform_string_to_utf8()
[API] add gst_tag_freeform_string_to_utf8()
Status: RESOLVED FIXED
Product: GStreamer
Classification: Platform
Component: gst-plugins-base
git master
Other Linux
: Normal enhancement
: 0.10.13
Assigned To: GStreamer Maintainers
GStreamer Maintainers
Depends on:
Blocks:
 
 
Reported: 2007-02-06 18:23 UTC by Tim-Philipp Müller
Modified: 2007-04-12 12:38 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
propsed API addition (3.78 KB, patch)
2007-02-06 18:24 UTC, Tim-Philipp Müller
committed Details | Review

Description Tim-Philipp Müller 2007-02-06 18:23:54 UTC
I think it might be useful to add a function that takes a "freeform" string (= ASCII, UTF-8 or of unknown 1-byte character encoding like ISO-8859-*) and tries to convert it to UTF-8.

A possible use would then look like this:

static void
gst_tag_extract_id3v1_string (GstTagList * list, const gchar * tag,
    const gchar * start, const guint size)
{
  const gchar *env_vars[] = { "GST_ID3V1_TAG_ENCODING",
      "GST_ID3_TAG_ENCODING", "GST_TAG_ENCODING", NULL };
  gchar *utf8;

  utf8 = gst_tag_freeform_string_to_utf8 (start, size, env_vars);

  if (utf8 && *utf8 != '\0') {
    gst_tag_list_add (list, GST_TAG_MERGE_REPLACE, tag, utf8, NULL);
  }

  g_free (utf8);
}

We have a few plugins where this could be useful, e.g. icydemux (where you can't override the encoding used yet via environment variables).
Comment 1 Tim-Philipp Müller 2007-02-06 18:24:32 UTC
Created attachment 82034 [details] [review]
propsed API addition
Comment 2 Michael Smith 2007-02-06 18:49:46 UTC
It's possible (though I'm not sure how common it is in the widespread encodings we're likely to encounter) for data to validate as UTF-8, but not actually be  intended as UTF-8.

If the user has specified an encoding to use via a Magic Environment Variable, we should obey that first and foremost.

There's an argument that for some of these broken protocols/fileformats, we should use the locale after that, and only THEN attempt utf-8, but I don't think that's ideal usually.

Comment 3 Tim-Philipp Müller 2007-02-06 19:03:34 UTC
> It's possible (though I'm not sure how common it is in the widespread encodings
> we're likely to encounter) for data to validate as UTF-8, but not actually be 
> intended as UTF-8.

Yes, that's theoretically possible, but the chances of this happening with a non-garbage string are close to zero IIRC. I've actually looked into this once for a bug and found that it could only happen with absolutely unlikely character combinations (as in: completely constructed, since one of them would almost always need to be an uncommon non-letter character like a special quotes or copyright sign etc.). I back then concluded that the chances of those occuring in any kind of text (as in: book/paper manuscripts) are very very small, and the chances of them occuring in tag chunks are almost zero.

It's really mostly about adding a convenience function for this to avoid code duplication.
Comment 4 Michael Smith 2007-02-06 19:58:10 UTC
Yeah, I agree with the principle of the API - it sounds useful, and sensible.

I was only arguing with the details of what it should do. If it's really not going to occur with any likely-to-be-found-in-the-wild charsets, then I suppose it doesn't matter.


Comment 5 Tim-Philipp Müller 2007-02-08 10:22:55 UTC
> If the user has specified an encoding to use via a Magic Environment Variable,
> we should obey that first and foremost.

Not sure about this, because:

 - conversion from almost any (non-UTF8) 8-bit character
   encoding will be successful, since those encodings tend
   to make use of the whole 8-bit range (minus one or two
   not allowed values, but those tend to be the same). This
   means we can't just "test if this works". It will almost
   always work and then produce bad output.

 - we have environment variables that are very general, like
   GST_TAG_ENCODING and those that are very specific, like
   GST_ID3V1_TAG_ENCODING. Your suggestion would make most
   sense for the very specific ones IMHO.

We could provide API so the caller can tell us which ones are the specific ones to check first and which ones are the general ones to check later, but I don't really think it's worth given how small chances are to falsely identify a non-UTF-8 string as UTF-8.

But then this assertion of mine could just be wrong or, since admittedly I only checked ISO-8859-* etc. and not, for example, Korean/Japanese/Chinese/other Asian locales.

 
> There's an argument that for some of these broken protocols/fileformats, we
> should use the locale after that, and only THEN attempt utf-8, but I don't
> think that's ideal usually.

I think locale should always come last, for the reason mentioned above that conversion from ISO-8859-* will almost always succeed whatever the input.

Comment 6 Michael Smith 2007-02-08 10:35:25 UTC
Well, I'm convinced: you say that UTF-8 misidentification is really unlikely in practice, so I'm happy with your API proposal.
Comment 7 Tim-Philipp Müller 2007-04-12 12:38:30 UTC
Committed to CVS, with updated Since: tag of course:

 2007-04-12  Tim-Philipp Müller  <tim at centricular dot net>

        * docs/libs/gst-plugins-base-libs-sections.txt:
        * gst-libs/gst/tag/tag.h:
        * gst-libs/gst/tag/tags.c: (gst_tag_freeform_string_to_utf8):
          API: add gst_tag_freeform_string_to_utf8() (#405072).

        * gst-libs/gst/tag/gstid3tag.c: (gst_tag_extract_id3v1_string):
          Use gst_tag_freeform_string_to_utf8() here.