GNOME Bugzilla – Bug 725072
G_REGEX_RAW isn't truly "raw" in the PCRE sense
Last modified: 2018-05-24 16:20:07 UTC
It's debatable if this is a bug vs. enhancement, but I think of it as a bug from a "user of the glib library" perspective. Currently, glib always sets the PCRE_UCP (unicode properties) compile flag in g_regex_new(), regardless of G_REGEX_RAW being set or not. That has side-effects when searching strings with chars > 127, because pattern assertions like '\w' and '\b', as well as POSIX character classes like '[:alpha:]' and '[:lower:]', will match some of the characters in the > 127 range. (in fact, they match a lot of such characters) I think most users would expect G_REGEX_RAW to actually mean "normal PCRE mode", which PCRE_UCP is not. (it's also slower, fwiw) Since it's been this way forever, I'm not recommending changing G_REGEX_RAW behavior. Instead, I suggest a new GRegexCompileFlags enum value of 'G_REGEX_NO_UCP' be added, which if set will not set the PCRE_UCP flag in g_regex_new().
Created attachment 270183 [details] [review] Patch to add new G_REGEX_NO_UCP compile flag
Review of attachment 270183 [details] [review]: The patch looks fine to me with the one nit fixed. I'm don't however have an opinion on whether we should add this feature or not. ::: glib/gregex.h @@ +291,3 @@ + * and some POSIX character classes such as '[:alpha:]', will match non-ASCII + * characters. If this option is set, GRegex will classify characters using + * normal ASCII tables instead. Since: 2.39 Since: 2.42 now (use the stable version).
(In reply to Christian Persch from comment #2) > Review of attachment 270183 [details] [review] [review]: > > The patch looks fine to me with the one nit fixed. I'm don't however have an > opinion on whether we should add this feature or not. > > ::: glib/gregex.h > @@ +291,3 @@ > + * and some POSIX character classes such as '[:alpha:]', will match > non-ASCII > + * characters. If this option is set, GRegex will classify characters > using > + * normal ASCII tables instead. Since: 2.39 > > Since: 2.42 now (use the stable version). This patch still applies on top of 2.46. Can it be backported or does it only have a chance to make it in 2.48? With (presumably) PCRE 8.34, the [[:punct:]] class unexpectedly matches even more characters even with G_REGEX_RAW set. From http://www.pcre.org/original/changelog.txt (Version 8.34 15-December-2013): 31. Upgraded the handling of the POSIX classes [:graph:], [:print:], and [:punct:] when PCRE_UCP is set so as to include the same characters as Perl does in Unicode mode. What do you think of making G_REGEX_RAW control the PCRE_UCP flag too without exposing a new G_REGEX_NOUCP option to the user? The only reason why you would not do this is when you try to create a "strings"-like utility that scrapes for data out of a binary. I don't know whether something like [\uabcd] would work though with G_REGEX_RAW. The ability to disable PCRE_UCP is important given that we work with binary data and are only interested in matching ASCII with character classes.
Ping... this is breaking tests in Wireshark (a consumer of GRegex)...
My advice would be to port away from GRegex and use PCRE2 (not PCRE which is deprecated) directly.
GRegex is being used in the hope that it is available on platforms without having to worry about maintaining additional libraries. Are you suggesting that GRegex is being deprecated?
(In reply to Peter Wu from comment #6) > GRegex is being used in the hope that it is available on platforms without > having to worry about maintaining additional libraries. Sadly, PCRE has the tendency to change behaviour even in stable API series, which means that the only safe way for you to depend on specific behaviour is to have a strict dependency not only on a version of GLib but also on a version of PCRE. For some platforms this can be achieved by depending on the internal copy of PCRE but: * GLib does not update the internal copy of PCRE that often, which means that some features, bug fixes, or security updates may not be available until GLib itself is released * GLib is being increasingly built and tested with a system copy of PCRE — especially on Linux — instead of the internal copy, because Linux distributors favour unbundling This means that you still need to worry about PCRE, without having the ability to control how your project responds to eventual changes. > Are you suggesting that GRegex is being deprecated? GRegex cannot be ported to PCRE2 in a compatible way; the more common PCRE2 becomes, the less useful GRegex is. Whether or not GRegex is going to be deprecated has been a topic of discussion, but the GLib developers haven't found a resolution, yet.
The original PCRE library should be receiving bugfixes only now, PCRE2 will have the new features according to www.pcre.org. Unbundling is a good thing, but if it breaks GRegex that is unfortunate. Would it be possible to incorporate patch comment 1 or is something else blocking it? (If the version needs to be changed, you can do it directly.)
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/837.