GNOME Bugzilla – Bug 748620
g_regex_* utf-8 validity requirements are not stated clearly
Last modified: 2018-05-08 11:29:18 UTC
a crash in midnight commander turned out to be caused by invalid utf-8 being passed to to g_regex_match(). that's obviously a bug in mc, but the glib documentation doesn't exactly help identifying it. https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html says only this: "Note that, unless you set the G_REGEX_RAW flag, all the strings passed to these functions must be encoded in UTF-8." it's about expectations. i'd expect invalid utf-8 sequences to just not match, or throw some error messages, but not crash. such unexpected behavior with far-reaching consequences should be rather clearly documented if it's intentional. additionally, a link to a suggested validation function (g_utf8_validate()) should be provided. pcre does it right: "If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the result is undefined and your program may crash." - that means that it both clearly documents the issue, *and* provides a solution (at some cost, obviously).
This isn't specific to g_regex though; *all* GLib functions (and Gtk+ functions, etc) that take UTF-8 strings have undefined behavior if you pass invalid UTF-8 (except for functions like g_utf8_validate() that explicitly document otherwise). We don't really want to specify this on every single function. (It's documented somewhere... I don't know where though. Maybe we need a better "general GLib conventions" section in the docs.)
The reason why I'd consider making an exception and making it a bit more explicit at g_regex_*() is because its argument's semantics depend on a flag. If G_REGEX_RAW is specified, it's perfectly valid for the argument to be invalid UTF-8. I guess this doesn't quite really happen with other methods, right? (In a more strongly typed language there'd probably be a different type for bytestream and for UTF-8 stream, and then you'd have to have two methods, or overload by the argument's type, or something like this... Yeah it's plain C with char*, but still, the expectation gets different based on the flag.) (Purely for future reference: the mentioned mc bug is https://www.midnight-commander.org/ticket/3449. It has a very poor signal:noise ratio and doesn't add anything to this discussion right now.)
i'd argue from a different perspective: people sometimes use regexes to validate untrusted input. so a small reminder that you really need to utf8-validate the input first sounds like a very good idea. also note that most people don't read the whole docs from scratch, so "it's somewhere" doesn't really cut it if you don't link to it. lastly, it's a matter of impact. most functions fed with invalid utf-8 will misbehave only mildly, if at all. i'd definitely mark every function that can crash (or worse).
(In reply to Oswald Buddenhagen from comment #3) > also note that most people don't read the whole docs from scratch, so "it's > somewhere" doesn't really cut it if you don't link to it. We don't specify the rules for GErrors, GCancellables, and GAsyncReadyCallbacks everywhere either. (Let alone "you can't pass a NULL pointer unless it explicitly says you can" and stuff like that.) The docs would become unwieldy if we specified every rule in every place it was appropriate. I agree that we need to make sure that the behavior is known, but I don't think that mentioning it on every single API is the right way to do that. > lastly, it's a matter of impact. most functions fed with invalid utf-8 will > misbehave only mildly, if at all. i'd definitely mark every function that > can crash (or worse). It's undefined behavior. There may be functions that mildly misbehave in 2.42 but crash in 2.44 due to a small change to the code. Or that mildly misbehave on Linux but crash on Windows, etc.
Created attachment 371266 [details] [review] gregex: Clarify units in documentation Make it a bit clearer that all lengths passed to GRegex methods are in bytes (not characters). This is mentioned in the section overview, but who reads that? Signed-off-by: Philip Withnall <withnall@endlessm.com>
Created attachment 371267 [details] [review] gregex: Highlight some argument names in the documentation Signed-off-by: Philip Withnall <withnall@endlessm.com>
Created attachment 371268 [details] [review] gregex: Highlight in the docs that input must be in UTF-8 Signed-off-by: Philip Withnall <withnall@endlessm.com>
Review of attachment 371266 [details] [review]: Okay
Review of attachment 371267 [details] [review]: Okay
Review of attachment 371268 [details] [review]: Okay
Attachment 371266 [details] pushed as fe35f57 - gregex: Clarify units in documentation Attachment 371267 [details] pushed as f75624f - gregex: Highlight some argument names in the documentation Attachment 371268 [details] pushed as 0adbeac - gregex: Highlight in the docs that input must be in UTF-8