GNOME Bugzilla – Bug 610969
Nice to have g_utf8_make_valid as public
Last modified: 2016-10-13 20:54:48 UTC
Created attachment 154605 [details] [review] 1/3 Patch to add public g_utf8_make_valid GLib already has a _g_utf8_make_valid to purify a utf8 string containing some invalid chars. I think it would be nice to have this function as public, so I wrote these patches. I wrote _g_utf8_make_valid as a wrapper to g_utf8_make_valid not to change the code that was already using this, though changes would be trivial.
Created attachment 154606 [details] [review] 2/3 Added doc for g_utf8_make_valid
Created attachment 154607 [details] [review] 3/3 Added tests for g_utf8_make_valid
*** Bug 591603 has been marked as a duplicate of this bug. ***
Comment on attachment 154606 [details] [review] 2/3 Added doc for g_utf8_make_valid Please, use inline comments instead separate tmpl files See http://live.gnome.org/GTK%2B/TaskAPIDocMigration for more details
Actually you already use inline comments, sorry for the bugzilla SPAM
I had a related use case, which was to escape for GMarkup and replace invalid characters at the same time. Albeit that might be a corner case, just figured I would mention it. What is the use case for append_remainer, out of curiosity?
(In reply to comment #6) > I had a related use case, which was to escape for GMarkup and replace invalid > characters at the same time. Albeit that might be a corner case, just figured I > would mention it. You mean replacing or scaping the special characters, right? The issue is that in this case the characters set is a complete different one. > What is the use case for append_remainer, out of curiosity? It allows selecting it you just want to append the rest of the unprocessed string or not. I guess in some cases, you'd like to do it, but not in others.
(In reply to comment #7) > (In reply to comment #6) > > I had a related use case, which was to escape for GMarkup and replace invalid > > characters at the same time. Albeit that might be a corner case, just figured I > > would mention it. > > You mean replacing or scaping the special characters, right? The issue is that > in this case the characters set is a complete different one. More concretely, I'm thinking of a string that is in the wrong encoding because it came from a network and the encoding was incorrectly declared. So I have to replace invalid characters so I can at least display something to the user, provided part of it is ASCII or UTF-8.
(In reply to comment #8) > More concretely, I'm thinking of a string that is in the wrong encoding because > it came from a network and the encoding was incorrectly declared. So I have to > replace invalid characters so I can at least display something to the user, > provided part of it is ASCII or UTF-8. I think for internal things there was already a _g_utf8_make_valid, which I used as a base. I guess you can use it already without using my patch.
Review of attachment 154605 [details] [review]: I'm not a GLib committer, but I'd like this functionality; it's useful whenever you receive "alleged UTF-8" from a network, and want to pass it to something that requires, and strictly validates, UTF-8. It'd be particularly useful when dealing with D-Bus, where the failure mode for sending invalid UTF-8 is really bad (you're silently disconnected, and libdbus defaults to calling _exit in response). ::: glib/glib.symbols @@ +1509,3 @@ g_utf8_to_utf16 G_GNUC_MALLOC g_utf8_validate +g_utf8_make_valid Nitpicking: this isn't in alphabetical order, should it be? ::: glib/gunicode.h @@ +208,3 @@ } GUnicodeScript; +#define G_UTF8_INVALID_FALLBACK_SEQUENCE "\357\277\275" This deserves a commment that says it's U+FFFD REPLACEMENT CHARACTER (the official Unicode name of this character). Indeed, I'd be inclined to call it G_UTF8_REPLACEMENT_CHARACTER. ::: glib/gutf8.c @@ +1854,3 @@ + gssize len, + const gchar *fallback, + gboolean append_remainder) The addition of @len seems potentially nice to have, but only if the implementation is changed to never read beyond @len (i.e. cope with strings that aren't 0-terminated). I'd personally be perfectly happy with just taking the existing _g_utf8_make_valid and making it public: gchar *g_utf8_make_valid (const gchar *); I'm not sure I see why you'd want to have a different fallback? U+FFFD seems like it should always be appropriate. The semantics of append_remainder = FALSE seem to be "if the string is entirely valid, return a copy of it, but if there's anything invalid in the string, don't append the part after the last invalid piece". I don't see why that would ever be useful? Maybe it's an attempt to support @len, but if so, it doesn't work. @@ +1864,3 @@ string = NULL; + remainder = str; + remaining_bytes = strlen (str); Doesn't this defeat the purpose of having a length argument, which is support for using non-0-terminated strings as input?
Review of attachment 154606 [details] [review]: ::: glib/gutf8.c @@ +1853,3 @@ + * @str: text string to sanitize + * @len: length of text to sanitize, in bytes. The rest of the string + * will be appended as it is. Oh, so *that's* the intended semantic. Why would you ever want that? @@ +1857,3 @@ + * utf8 ones with. If you pass %NULL, + * %G_UTF8_INVALID_FALLBACK_SEQUENCE will be used. To replace it with + * nothing, use the empty string "". Replacing with the empty string does seem a reasonable use-case for a different replacement, although if it was implemented as a separate function, it could take a non-const argument and perform the transformation in-place, saving a malloc.
I'd be happy to work on improving those patches for GLib inclusion, if a GLib committer is willing to merge them (and can advise how elaborate we should make the API).
I notice that GConf also contains a copy/pasted version of g_utf8_make_valid.
(In reply to comment #11) > ::: glib/gutf8.c > @@ +1853,3 @@ > + * @str: text string to sanitize > + * @len: length of text to sanitize, in bytes. The rest of the string > + * will be appended as it is. > > Oh, so *that's* the intended semantic. Why would you ever want that? Why not? The append_remainder in that case appends the rest of the string that is unprocessed if you stopped before because of the len. > @@ +1857,3 @@ > + * utf8 ones with. If you pass %NULL, > + * %G_UTF8_INVALID_FALLBACK_SEQUENCE will be used. To replace it with > + * nothing, use the empty string "". > > Replacing with the empty string does seem a reasonable use-case for a different > replacement, although if it was implemented as a separate function, it could > take a non-const argument and perform the transformation in-place, saving a > malloc. Well, I guess anybody can replace the way they want :) There is doc for that definition too, though we can improve it, yes. I basically agree with the rest of comments.
Created attachment 214742 [details] [review] 1/3 Adds g_utf8_make_valid as public function
Created attachment 214743 [details] 2/3 Added doc for g_utf8_make_valid
Created attachment 214744 [details] [review] 2/3 Added doc for g_utf8_make_valid
Created attachment 214745 [details] [review] 3/3 Added test for g_utf8_make_valid
I reworked the patches so that they are updated to the be released in 2.34 if anybody allows it.
Updated version because it is still missing in 2.33.x
(In reply to comment #14) > (In reply to comment #11) > > ::: glib/gutf8.c > > @@ +1853,3 @@ > > + * @str: text string to sanitize > > + * @len: length of text to sanitize, in bytes. The rest of the string > > + * will be appended as it is. > > > > Oh, so *that's* the intended semantic. Why would you ever want that? > > Why not? The append_remainder in that case appends the rest of the string that > is unprocessed if you stopped before because of the len. I find "why not?" is rarely a great motivation for APIs... The situation in which the current _g_utf8_make_valid() is used (which is equivalent to your len < 0, append_remainder = TRUE) is something like this: """ I have a NUL-terminated string that came from a network service or a filename or something. It claims to be UTF-8, but it might not really be valid. I want to display it in some vaguely useful way (perhaps a log message), but the way in which I'm displaying it has a strict UTF-8 requirement. If the string is not UTF-8, I don't particularly mind what I get, as long as it's somewhat sensible. """ With len >= 0 and append_remainder = TRUE, what you're doing seems to be: """ I have a NUL-terminated string that came from [somewhere] which claims to be valid UTF-8. I know that the part after the first [n] bytes *is* valid, and I trust that implicitly and am willing to have my app crash if that isn't true; what I don't know is whether the first [n] bytes are *also* valid. """ and I don't know how you'd get into that situation "in real life"? The semantic that I initially thought your version of the function had was more like g_strndup - "if len >= 0, we can cope with non-NUL-terminated input, and we'll NUL-terminate the output" - which would let you do things like this: """ I have a length-marked string (Pascal-style string) in a buffer that came from some network protocol or file format, which claims to be valid UTF-8 but might not be really. I want to copy out the string according to its stated length, while transforming it into UTF-8. """ which is a bit weird, but might perhaps be useful for applying Postel's principle (specifically, the "be liberal in what you accept" part) to a binary format or protocol, without excessive copying? (If you don't mind an extra malloc/free cycle, you can always just g_strndup() + _g_utf8_make_valid() + free the output of g_strndup().) With len < 0 or len < strlen (str) and append_remainder = FALSE, as far as I can see you get results like: g_utf8_make_valid ("hello\xffworld!", -1, FALSE) -> "hello" + U+FFFD with the valid part that follows the last invalid/replaced part omitted: it's not clear to me why this would ever be useful? Is this intentional or a bug? With len >= strlen (str) and append_remainder = FALSE, perhaps it has more g_strdup-like behaviour? I'm not really sure - it's not obvious. My counter-proposal is still "make g_utf8_make_valid() public, with precisely the same semantics as the current _g_utf8_make_valid()". I'm happy to prepare a patch for this if there's any chance that it might be accepted.
(In reply to comment #21) > My counter-proposal is still "make g_utf8_make_valid() public, with precisely > the same semantics as the current _g_utf8_make_valid()". I'm happy to prepare a > patch for this if there's any chance that it might be accepted. In fact, if people like this idea, it should be done by updating Simon van der Linden's patch from Attachment #140587 [details], and perhaps using the doc-comment from telepathy-glib's tp_utf8_make_valid().
(In reply to comment #22) > (In reply to comment #21) > > My counter-proposal is still "make g_utf8_make_valid() public, with precisely > > the same semantics as the current _g_utf8_make_valid()". I'm happy to prepare a > > patch for this if there's any chance that it might be accepted. > > In fact, if people like this idea, it should be done by updating Simon van der > Linden's patch from Attachment #140587 [details], and perhaps using the doc-comment from > telepathy-glib's tp_utf8_make_valid(). I'd support this. I just wrote some code where I wanted that.
(In reply to Simon McVittie from comment #22) > In fact, if people like this idea, it should be done by updating Simon van > der Linden's patch from Attachment #140587 [details], and perhaps using the > doc-comment from telepathy-glib's tp_utf8_make_valid(). I now find myself needing this again, so I'm doing as I suggested.
Created attachment 337593 [details] [review] Make g_utf8_make_valid public Based on a patch by Simon van der Linden and rebased onto current GLib, with improved documentation loosely based on Telepathy's tp_utf8_make_valid().
Created attachment 337594 [details] [review] Bump version to 2.51.0 So that we don't get "deprecation" (availability) warnings for new functionality of GLib 2.51/2.52.
Review of attachment 337593 [details] [review]: Kind of bikeshedding here, but I bet some callers could gain a bit of performance with an API that only did the strdup if it needed to - in Rust we'd use https://doc.rust-lang.org/std/borrow/enum.Cow.html GLib is very malloc-happy, but maybe such users exist? Eh, probably not worth changing, nevermind.
Review of attachment 337594 [details] [review]: Seems sane.
(In reply to Colin Walters from comment #27) > Kind of bikeshedding here, but I bet some callers > could gain a bit of performance with > an API that only did the strdup if it needed > to I'm sure they could, but this function is already used in assorted places in GLib, and is also copy-pasted into evolution-data-server (with a second implementation in libcamel that isn't as good), gconf, gjs, gsmartcontrol and telepathy-glib. I suspect that's more API users than some functions that GLib *does* export :-)
Fixed in git for 2.51.0, thanks.
Comment on attachment 337593 [details] [review] Make g_utf8_make_valid public c46dbd4
Comment on attachment 337594 [details] [review] Bump version to 2.51.0 6ab68e9