Bug 610969 – Nice to have g_utf8_make_valid as public

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 610969 - Nice to have g_utf8_make_valid as public


Summary:	Nice to have g_utf8_make_valid as public


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	2.33.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Simon McVittie
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	591603 (view as bug list)
Depends on:
Blocks:	610797

Reported:	2010-02-24 16:29 UTC by Xabier Rodríguez Calvar
Modified:	2016-10-13 20:54 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
1/3 Patch to add public g_utf8_make_valid (3.43 KB, patch) 2010-02-24 16:29 UTC, Xabier Rodríguez Calvar	needs-work	Details \| Review
2/3 Added doc for g_utf8_make_valid (3.36 KB, patch) 2010-02-24 16:30 UTC, Xabier Rodríguez Calvar	needs-work	Details \| Review
3/3 Added tests for g_utf8_make_valid (21.37 KB, patch) 2010-02-24 16:30 UTC, Xabier Rodríguez Calvar	none	Details \| Review
1/3 Adds g_utf8_make_valid as public function (3.48 KB, patch) 2012-05-23 10:28 UTC, Xabier Rodríguez Calvar	none	Details \| Review
2/3 Added doc for g_utf8_make_valid (2.52 KB, application/octet-stream) 2012-05-23 10:29 UTC, Xabier Rodríguez Calvar		Details
2/3 Added doc for g_utf8_make_valid (2.52 KB, patch) 2012-05-23 10:30 UTC, Xabier Rodríguez Calvar	none	Details \| Review
3/3 Added test for g_utf8_make_valid (21.55 KB, patch) 2012-05-23 10:30 UTC, Xabier Rodríguez Calvar	none	Details \| Review
Make g_utf8_make_valid public (7.27 KB, patch) 2016-10-13 12:10 UTC, Simon McVittie	committed	Details \| Review
Bump version to 2.51.0 (1.05 KB, patch) 2016-10-13 12:10 UTC, Simon McVittie	committed	Details \| Review

Description Xabier Rodríguez Calvar 2010-02-24 16:29:29 UTC

Created attachment 154605 [details] [review]
1/3 Patch to add public g_utf8_make_valid

GLib already has a _g_utf8_make_valid to purify a utf8 string containing some invalid chars. I think it would be nice to have this function as public, so I wrote these patches.

I wrote _g_utf8_make_valid as a wrapper to g_utf8_make_valid not to change the code that was already using this, though changes would be trivial.

Comment 1 Xabier Rodríguez Calvar 2010-02-24 16:30:01 UTC

Created attachment 154606 [details] [review]
2/3 Added doc for g_utf8_make_valid

Comment 2 Xabier Rodríguez Calvar 2010-02-24 16:30:34 UTC

Created attachment 154607 [details] [review]
3/3 Added tests for g_utf8_make_valid

Comment 3 Dan Winship 2010-02-24 16:38:21 UTC

*** Bug 591603 has been marked as a duplicate of this bug. ***

Comment 4 Javier Jardón (IRC: jjardon) 2010-02-25 04:08:11 UTC

Comment on attachment 154606 [details] [review]
2/3 Added doc for g_utf8_make_valid

Please, use inline comments instead separate tmpl files

See http://live.gnome.org/GTK%2B/TaskAPIDocMigration for more details

Comment 5 Javier Jardón (IRC: jjardon) 2010-02-25 04:13:33 UTC

Actually you already use inline comments, sorry for the bugzilla SPAM

Comment 6 Christian Dywan 2010-06-10 10:17:04 UTC

I had a related use case, which was to escape for GMarkup and replace invalid characters at the same time. Albeit that might be a corner case, just figured I would mention it.

What is the use case for append_remainer, out of curiosity?

Comment 7 Xabier Rodríguez Calvar 2010-06-10 10:38:15 UTC

(In reply to comment #6)
> I had a related use case, which was to escape for GMarkup and replace invalid
> characters at the same time. Albeit that might be a corner case, just figured I
> would mention it.

You mean replacing or scaping the special characters, right? The issue is that in this case the characters set is a complete different one.

> What is the use case for append_remainer, out of curiosity?

It allows selecting it you just want to append the rest of the unprocessed string or not. I guess in some cases, you'd like to do it, but not in others.

Comment 8 Christian Dywan 2010-06-10 11:04:14 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > I had a related use case, which was to escape for GMarkup and replace invalid
> > characters at the same time. Albeit that might be a corner case, just figured I
> > would mention it.
> 
> You mean replacing or scaping the special characters, right? The issue is that
> in this case the characters set is a complete different one.

More concretely, I'm thinking of a string that is in the wrong encoding because it came from a network and the encoding was incorrectly declared. So I have to replace invalid characters so I can at least display something to the user, provided part of it is ASCII or UTF-8.

Comment 9 Xabier Rodríguez Calvar 2010-06-10 11:12:35 UTC

(In reply to comment #8)
> More concretely, I'm thinking of a string that is in the wrong encoding because
> it came from a network and the encoding was incorrectly declared. So I have to
> replace invalid characters so I can at least display something to the user,
> provided part of it is ASCII or UTF-8.

I think for internal things there was already a _g_utf8_make_valid, which I used as a base. I guess you can use it already without using my patch.

Comment 10 Simon McVittie 2011-01-28 12:07:16 UTC

Review of attachment 154605 [details] [review]:

I'm not a GLib committer, but I'd like this functionality; it's useful whenever you receive "alleged UTF-8" from a network, and want to pass it to something that requires, and strictly validates, UTF-8.

It'd be particularly useful when dealing with D-Bus, where the failure mode for sending invalid UTF-8 is really bad (you're silently disconnected, and libdbus defaults to calling _exit in response).

::: glib/glib.symbols
@@ +1509,3 @@
 g_utf8_to_utf16 G_GNUC_MALLOC
 g_utf8_validate
+g_utf8_make_valid

Nitpicking: this isn't in alphabetical order, should it be?

::: glib/gunicode.h
@@ +208,3 @@
 } GUnicodeScript;
 
+#define G_UTF8_INVALID_FALLBACK_SEQUENCE "\357\277\275"

This deserves a commment that says it's U+FFFD REPLACEMENT CHARACTER (the official Unicode name of this character). Indeed, I'd be inclined to call it G_UTF8_REPLACEMENT_CHARACTER.

::: glib/gutf8.c
@@ +1854,3 @@
+                   gssize len,
+                   const gchar *fallback,
+                   gboolean append_remainder)

The addition of @len seems potentially nice to have, but only if the implementation is changed to never read beyond @len (i.e. cope with strings that aren't 0-terminated).

I'd personally be perfectly happy with just taking the existing _g_utf8_make_valid and making it public:

    gchar *g_utf8_make_valid (const gchar *);

I'm not sure I see why you'd want to have a different fallback? U+FFFD seems like it should always be appropriate.

The semantics of append_remainder = FALSE seem to be "if the string is entirely valid, return a copy of it, but if there's anything invalid in the string, don't append the part after the last invalid piece". I don't see why that would ever be useful? Maybe it's an attempt to support @len, but if so, it doesn't work.

@@ +1864,3 @@
   string = NULL;
+  remainder = str;
+  remaining_bytes = strlen (str);

Doesn't this defeat the purpose of having a length argument, which is support for using non-0-terminated strings as input?

Comment 11 Simon McVittie 2011-01-28 12:10:47 UTC

Review of attachment 154606 [details] [review]:

::: glib/gutf8.c
@@ +1853,3 @@
+ * @str: text string to sanitize
+ * @len: length of text to sanitize, in bytes. The rest of the string
+ * will be appended as it is.

Oh, so *that's* the intended semantic. Why would you ever want that?

@@ +1857,3 @@
+ * utf8 ones with. If you pass %NULL,
+ * %G_UTF8_INVALID_FALLBACK_SEQUENCE will be used. To replace it with
+ * nothing, use the empty string "".

Replacing with the empty string does seem a reasonable use-case for a different replacement, although if it was implemented as a separate function, it could take a non-const argument and perform the transformation in-place, saving a malloc.

Comment 12 Simon McVittie 2011-01-28 12:14:31 UTC

I'd be happy to work on improving those patches for GLib inclusion, if a GLib committer is willing to merge them (and can advise how elaborate we should make the API).

Comment 13 Simon McVittie 2011-07-12 12:07:48 UTC

I notice that GConf also contains a copy/pasted version of g_utf8_make_valid.

Comment 14 Xabier Rodríguez Calvar 2012-05-22 16:57:56 UTC

(In reply to comment #11)
> ::: glib/gutf8.c
> @@ +1853,3 @@
> + * @str: text string to sanitize
> + * @len: length of text to sanitize, in bytes. The rest of the string
> + * will be appended as it is.
> 
> Oh, so *that's* the intended semantic. Why would you ever want that?

Why not? The append_remainder in that case appends the rest of the string that is unprocessed if you stopped before because of the len.

> @@ +1857,3 @@
> + * utf8 ones with. If you pass %NULL,
> + * %G_UTF8_INVALID_FALLBACK_SEQUENCE will be used. To replace it with
> + * nothing, use the empty string "".
> 
> Replacing with the empty string does seem a reasonable use-case for a different
> replacement, although if it was implemented as a separate function, it could
> take a non-const argument and perform the transformation in-place, saving a
> malloc.

Well, I guess anybody can replace the way they want :)

There is doc for that definition too, though we can improve it, yes. I basically agree with the rest of comments.

Comment 15 Xabier Rodríguez Calvar 2012-05-23 10:28:06 UTC

Created attachment 214742 [details] [review]
1/3 Adds g_utf8_make_valid as public function

Comment 16 Xabier Rodríguez Calvar 2012-05-23 10:29:29 UTC

Created attachment 214743 [details]
2/3 Added doc for g_utf8_make_valid

Comment 17 Xabier Rodríguez Calvar 2012-05-23 10:30:05 UTC

Created attachment 214744 [details] [review]
2/3 Added doc for g_utf8_make_valid

Comment 18 Xabier Rodríguez Calvar 2012-05-23 10:30:53 UTC

Created attachment 214745 [details] [review]
3/3 Added test for g_utf8_make_valid

Comment 19 Xabier Rodríguez Calvar 2012-05-23 10:32:07 UTC

I reworked the patches so that they are updated to the be released in 2.34 if anybody allows it.

Comment 20 Xabier Rodríguez Calvar 2012-05-23 10:32:47 UTC

Updated version because it is still missing in 2.33.x

Comment 21 Simon McVittie 2012-10-11 13:38:56 UTC

(In reply to comment #14)
> (In reply to comment #11)
> > ::: glib/gutf8.c
> > @@ +1853,3 @@
> > + * @str: text string to sanitize
> > + * @len: length of text to sanitize, in bytes. The rest of the string
> > + * will be appended as it is.
> > 
> > Oh, so *that's* the intended semantic. Why would you ever want that?
> 
> Why not? The append_remainder in that case appends the rest of the string that
> is unprocessed if you stopped before because of the len.

I find "why not?" is rarely a great motivation for APIs...

The situation in which the current _g_utf8_make_valid() is used (which is equivalent to your len < 0, append_remainder = TRUE) is something like this:

"""
I have a NUL-terminated string that came from a network service or a filename or something. It claims to be UTF-8, but it might not really be valid. I want to display it in some vaguely useful way (perhaps a log message), but the way in which I'm displaying it has a strict UTF-8 requirement. If the string is not UTF-8, I don't particularly mind what I get, as long as it's somewhat sensible.
"""

With len >= 0 and append_remainder = TRUE, what you're doing seems to be:

"""
I have a NUL-terminated string that came from [somewhere] which claims to be valid UTF-8. I know that the part after the first [n] bytes *is* valid, and I trust that implicitly and am willing to have my app crash if that isn't true; what I don't know is whether the first [n] bytes are *also* valid.
"""

and I don't know how you'd get into that situation "in real life"?

The semantic that I initially thought your version of the function had was more like g_strndup - "if len >= 0, we can cope with non-NUL-terminated input, and we'll NUL-terminate the output" - which would let you do things like this:

"""
I have a length-marked string (Pascal-style string) in a buffer that came from some network protocol or file format, which claims to be valid UTF-8 but might not be really. I want to copy out the string according to its stated length, while transforming it into UTF-8.
"""

which is a bit weird, but might perhaps be useful for applying Postel's principle (specifically, the "be liberal in what you accept" part) to a binary format or protocol, without excessive copying? (If you don't mind an extra malloc/free cycle, you can always just g_strndup() + _g_utf8_make_valid() + free the output of g_strndup().)

With len < 0 or len < strlen (str) and append_remainder = FALSE, as far as I can see you get results like:

    g_utf8_make_valid ("hello\xffworld!", -1, FALSE) -> "hello" + U+FFFD

with the valid part that follows the last invalid/replaced part omitted: it's not clear to me why this would ever be useful? Is this intentional or a bug?

With len >= strlen (str) and append_remainder = FALSE, perhaps it has more g_strdup-like behaviour? I'm not really sure - it's not obvious.

My counter-proposal is still "make g_utf8_make_valid() public, with precisely the same semantics as the current _g_utf8_make_valid()". I'm happy to prepare a patch for this if there's any chance that it might be accepted.

Comment 22 Simon McVittie 2012-10-11 13:41:18 UTC

(In reply to comment #21)
> My counter-proposal is still "make g_utf8_make_valid() public, with precisely
> the same semantics as the current _g_utf8_make_valid()". I'm happy to prepare a
> patch for this if there's any chance that it might be accepted.

In fact, if people like this idea, it should be done by updating Simon van der Linden's patch from Attachment #140587 [details], and perhaps using the doc-comment from telepathy-glib's tp_utf8_make_valid().

Comment 23 Dan Winship 2014-11-11 18:26:44 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > My counter-proposal is still "make g_utf8_make_valid() public, with precisely
> > the same semantics as the current _g_utf8_make_valid()". I'm happy to prepare a
> > patch for this if there's any chance that it might be accepted.
> 
> In fact, if people like this idea, it should be done by updating Simon van der
> Linden's patch from Attachment #140587 [details], and perhaps using the doc-comment from
> telepathy-glib's tp_utf8_make_valid().

I'd support this. I just wrote some code where I wanted that.

Comment 24 Simon McVittie 2016-10-13 11:49:30 UTC

(In reply to Simon McVittie from comment #22)
> In fact, if people like this idea, it should be done by updating Simon van
> der Linden's patch from Attachment #140587 [details], and perhaps using the
> doc-comment from telepathy-glib's tp_utf8_make_valid().

I now find myself needing this again, so I'm doing as I suggested.

Comment 25 Simon McVittie 2016-10-13 12:10:20 UTC

Created attachment 337593 [details] [review]
Make g_utf8_make_valid public

Based on a patch by Simon van der Linden and rebased onto current GLib,
with improved documentation loosely based on Telepathy's
tp_utf8_make_valid().

Comment 26 Simon McVittie 2016-10-13 12:10:36 UTC

Created attachment 337594 [details] [review]
Bump version to 2.51.0

So that we don't get "deprecation" (availability) warnings for
new functionality of GLib 2.51/2.52.

Comment 27 Colin Walters 2016-10-13 20:30:59 UTC

Review of attachment 337593 [details] [review]:

Kind of bikeshedding here, but I bet some callers
could gain a bit of performance with
an API that only did the strdup if it needed
to - in Rust we'd use
https://doc.rust-lang.org/std/borrow/enum.Cow.html

GLib is very malloc-happy, but maybe such users
exist?  Eh, probably not worth changing, nevermind.

Comment 28 Colin Walters 2016-10-13 20:31:32 UTC

Review of attachment 337594 [details] [review]:

Seems sane.

Comment 29 Simon McVittie 2016-10-13 20:49:02 UTC

(In reply to Colin Walters from comment #27)
> Kind of bikeshedding here, but I bet some callers
> could gain a bit of performance with
> an API that only did the strdup if it needed
> to

I'm sure they could, but this function is already used in assorted places in GLib, and is also copy-pasted into evolution-data-server (with a second implementation in libcamel that isn't as good), gconf, gjs, gsmartcontrol and telepathy-glib. I suspect that's more API users than some functions that GLib *does* export :-)

Comment 30 Simon McVittie 2016-10-13 20:54:06 UTC

Fixed in git for 2.51.0, thanks.

Comment 31 Simon McVittie 2016-10-13 20:54:35 UTC

Comment on attachment 337593 [details] [review]
Make g_utf8_make_valid public

c46dbd4

Comment 32 Simon McVittie 2016-10-13 20:54:48 UTC

Comment on attachment 337594 [details] [review]
Bump version to 2.51.0

6ab68e9