After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 748620 - g_regex_* utf-8 validity requirements are not stated clearly
g_regex_* utf-8 validity requirements are not stated clearly
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: docs
unspecified
Other All
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2015-04-29 08:28 UTC by Oswald Buddenhagen
Modified: 2018-05-08 11:29 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
gregex: Clarify units in documentation (4.63 KB, patch)
2018-04-23 11:09 UTC, Philip Withnall
committed Details | Review
gregex: Highlight some argument names in the documentation (1.99 KB, patch)
2018-04-23 11:09 UTC, Philip Withnall
committed Details | Review
gregex: Highlight in the docs that input must be in UTF-8 (1.97 KB, patch)
2018-04-23 11:09 UTC, Philip Withnall
committed Details | Review

Description Oswald Buddenhagen 2015-04-29 08:28:23 UTC
a crash in midnight commander turned out to be caused by invalid utf-8 being passed to to g_regex_match(). that's obviously a bug in mc, but the glib documentation doesn't exactly help identifying it. https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html says only this:

"Note that, unless you set the G_REGEX_RAW flag, all the strings passed to these functions must be encoded in UTF-8."

it's about expectations. i'd expect invalid utf-8 sequences to just not match, or throw some error messages, but not crash. such unexpected behavior with far-reaching consequences should be rather clearly documented if it's intentional. additionally, a link to a suggested validation function (g_utf8_validate()) should be provided.

pcre does it right: "If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the result is undefined and your program may crash." - that means that it both clearly documents the issue, *and* provides a solution (at some cost, obviously).
Comment 1 Dan Winship 2015-04-29 12:34:03 UTC
This isn't specific to g_regex though; *all* GLib functions (and Gtk+ functions, etc) that take UTF-8 strings have undefined behavior if you pass invalid UTF-8 (except for functions like g_utf8_validate() that explicitly document otherwise). We don't really want to specify this on every single function. (It's documented somewhere... I don't know where though. Maybe we need a better "general GLib conventions" section in the docs.)
Comment 2 Egmont Koblinger 2015-04-29 12:49:29 UTC
The reason why I'd consider making an exception and making it a bit more explicit at g_regex_*() is because its argument's semantics depend on a flag. If G_REGEX_RAW is specified, it's perfectly valid for the argument to be invalid UTF-8. I guess this doesn't quite really happen with other methods, right? (In a more strongly typed language there'd probably be a different type for bytestream and for UTF-8 stream, and then you'd have to have two methods, or overload by the argument's type, or something like this... Yeah it's plain C with char*, but still, the expectation gets different based on the flag.)

(Purely for future reference: the mentioned mc bug is https://www.midnight-commander.org/ticket/3449. It has a very poor signal:noise ratio and doesn't add anything to this discussion right now.)
Comment 3 Oswald Buddenhagen 2015-04-29 21:07:53 UTC
i'd argue from a different perspective: people sometimes use regexes to validate untrusted input. so a small reminder that you really need to utf8-validate the input first sounds like a very good idea.

also note that most people don't read the whole docs from scratch, so "it's somewhere" doesn't really cut it if you don't link to it.

lastly, it's a matter of impact. most functions fed with invalid utf-8 will misbehave only mildly, if at all. i'd definitely mark every function that can crash (or worse).
Comment 4 Dan Winship 2015-04-30 13:03:33 UTC
(In reply to Oswald Buddenhagen from comment #3)
> also note that most people don't read the whole docs from scratch, so "it's
> somewhere" doesn't really cut it if you don't link to it.

We don't specify the rules for GErrors, GCancellables, and GAsyncReadyCallbacks everywhere either. (Let alone "you can't pass a NULL pointer unless it explicitly says you can" and stuff like that.) The docs would become unwieldy if we specified every rule in every place it was appropriate.

I agree that we need to make sure that the behavior is known, but I don't think that mentioning it on every single API is the right way to do that.

> lastly, it's a matter of impact. most functions fed with invalid utf-8 will
> misbehave only mildly, if at all. i'd definitely mark every function that
> can crash (or worse).

It's undefined behavior. There may be functions that mildly misbehave in 2.42 but crash in 2.44 due to a small change to the code. Or that mildly misbehave on Linux but crash on Windows, etc.
Comment 5 Philip Withnall 2018-04-23 11:09:30 UTC
Created attachment 371266 [details] [review]
gregex: Clarify units in documentation

Make it a bit clearer that all lengths passed to GRegex methods are in
bytes (not characters). This is mentioned in the section overview, but
who reads that?

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 6 Philip Withnall 2018-04-23 11:09:37 UTC
Created attachment 371267 [details] [review]
gregex: Highlight some argument names in the documentation

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 7 Philip Withnall 2018-04-23 11:09:44 UTC
Created attachment 371268 [details] [review]
gregex: Highlight in the docs that input must be in UTF-8

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 8 Emmanuele Bassi (:ebassi) 2018-05-08 11:24:46 UTC
Review of attachment 371266 [details] [review]:

Okay
Comment 9 Emmanuele Bassi (:ebassi) 2018-05-08 11:25:12 UTC
Review of attachment 371267 [details] [review]:

Okay
Comment 10 Emmanuele Bassi (:ebassi) 2018-05-08 11:25:35 UTC
Review of attachment 371268 [details] [review]:

Okay
Comment 11 Philip Withnall 2018-05-08 11:29:04 UTC
Attachment 371266 [details] pushed as fe35f57 - gregex: Clarify units in documentation
Attachment 371267 [details] pushed as f75624f - gregex: Highlight some argument names in the documentation
Attachment 371268 [details] pushed as 0adbeac - gregex: Highlight in the docs that input must be in UTF-8