Bug 748620 – g_regex_* utf-8 validity requirements are not stated clearly

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 748620 - g_regex_* utf-8 validity requirements are not stated clearly


Summary:	g_regex_* utf-8 validity requirements are not stated clearly


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	docs
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-04-29 08:28 UTC by Oswald Buddenhagen
Modified:	2018-05-08 11:29 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gregex: Clarify units in documentation (4.63 KB, patch) 2018-04-23 11:09 UTC, Philip Withnall	committed	Details \| Review
gregex: Highlight some argument names in the documentation (1.99 KB, patch) 2018-04-23 11:09 UTC, Philip Withnall	committed	Details \| Review
gregex: Highlight in the docs that input must be in UTF-8 (1.97 KB, patch) 2018-04-23 11:09 UTC, Philip Withnall	committed	Details \| Review

Description Oswald Buddenhagen 2015-04-29 08:28:23 UTC

a crash in midnight commander turned out to be caused by invalid utf-8 being passed to to g_regex_match(). that's obviously a bug in mc, but the glib documentation doesn't exactly help identifying it. https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html says only this:

"Note that, unless you set the G_REGEX_RAW flag, all the strings passed to these functions must be encoded in UTF-8."

it's about expectations. i'd expect invalid utf-8 sequences to just not match, or throw some error messages, but not crash. such unexpected behavior with far-reaching consequences should be rather clearly documented if it's intentional. additionally, a link to a suggested validation function (g_utf8_validate()) should be provided.

pcre does it right: "If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the result is undefined and your program may crash." - that means that it both clearly documents the issue, *and* provides a solution (at some cost, obviously).

Comment 1 Dan Winship 2015-04-29 12:34:03 UTC

This isn't specific to g_regex though; *all* GLib functions (and Gtk+ functions, etc) that take UTF-8 strings have undefined behavior if you pass invalid UTF-8 (except for functions like g_utf8_validate() that explicitly document otherwise). We don't really want to specify this on every single function. (It's documented somewhere... I don't know where though. Maybe we need a better "general GLib conventions" section in the docs.)

Comment 2 Egmont Koblinger 2015-04-29 12:49:29 UTC

The reason why I'd consider making an exception and making it a bit more explicit at g_regex_*() is because its argument's semantics depend on a flag. If G_REGEX_RAW is specified, it's perfectly valid for the argument to be invalid UTF-8. I guess this doesn't quite really happen with other methods, right? (In a more strongly typed language there'd probably be a different type for bytestream and for UTF-8 stream, and then you'd have to have two methods, or overload by the argument's type, or something like this... Yeah it's plain C with char*, but still, the expectation gets different based on the flag.)

(Purely for future reference: the mentioned mc bug is https://www.midnight-commander.org/ticket/3449. It has a very poor signal:noise ratio and doesn't add anything to this discussion right now.)

Comment 3 Oswald Buddenhagen 2015-04-29 21:07:53 UTC

i'd argue from a different perspective: people sometimes use regexes to validate untrusted input. so a small reminder that you really need to utf8-validate the input first sounds like a very good idea.

also note that most people don't read the whole docs from scratch, so "it's somewhere" doesn't really cut it if you don't link to it.

lastly, it's a matter of impact. most functions fed with invalid utf-8 will misbehave only mildly, if at all. i'd definitely mark every function that can crash (or worse).

Comment 4 Dan Winship 2015-04-30 13:03:33 UTC

(In reply to Oswald Buddenhagen from comment #3)
> also note that most people don't read the whole docs from scratch, so "it's
> somewhere" doesn't really cut it if you don't link to it.

We don't specify the rules for GErrors, GCancellables, and GAsyncReadyCallbacks everywhere either. (Let alone "you can't pass a NULL pointer unless it explicitly says you can" and stuff like that.) The docs would become unwieldy if we specified every rule in every place it was appropriate.

I agree that we need to make sure that the behavior is known, but I don't think that mentioning it on every single API is the right way to do that.

> lastly, it's a matter of impact. most functions fed with invalid utf-8 will
> misbehave only mildly, if at all. i'd definitely mark every function that
> can crash (or worse).

It's undefined behavior. There may be functions that mildly misbehave in 2.42 but crash in 2.44 due to a small change to the code. Or that mildly misbehave on Linux but crash on Windows, etc.

Comment 5 Philip Withnall 2018-04-23 11:09:30 UTC

Created attachment 371266 [details] [review]
gregex: Clarify units in documentation

Make it a bit clearer that all lengths passed to GRegex methods are in
bytes (not characters). This is mentioned in the section overview, but
who reads that?

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 6 Philip Withnall 2018-04-23 11:09:37 UTC

Created attachment 371267 [details] [review]
gregex: Highlight some argument names in the documentation

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 7 Philip Withnall 2018-04-23 11:09:44 UTC

Created attachment 371268 [details] [review]
gregex: Highlight in the docs that input must be in UTF-8

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 8 Emmanuele Bassi (:ebassi) 2018-05-08 11:24:46 UTC

Review of attachment 371266 [details] [review]:

Okay

Comment 9 Emmanuele Bassi (:ebassi) 2018-05-08 11:25:12 UTC

Review of attachment 371267 [details] [review]:

Okay

Comment 10 Emmanuele Bassi (:ebassi) 2018-05-08 11:25:35 UTC

Review of attachment 371268 [details] [review]:

Okay

Comment 11 Philip Withnall 2018-05-08 11:29:04 UTC

Attachment 371266 [details] pushed as fe35f57 - gregex: Clarify units in documentation
Attachment 371267 [details] pushed as f75624f - gregex: Highlight some argument names in the documentation
Attachment 371268 [details] pushed as 0adbeac - gregex: Highlight in the docs that input must be in UTF-8