After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 725072 - G_REGEX_RAW isn't truly "raw" in the PCRE sense
G_REGEX_RAW isn't truly "raw" in the PCRE sense
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: gregex
2.38.x
Other Mac OS
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2014-02-24 16:03 UTC by Hadriel Kaplan
Modified: 2018-05-24 16:20 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch to add new G_REGEX_NO_UCP compile flag (3.74 KB, patch)
2014-02-24 20:37 UTC, Hadriel Kaplan
reviewed Details | Review

Description Hadriel Kaplan 2014-02-24 16:03:58 UTC
It's debatable if this is a bug vs. enhancement, but I think of it as a bug from a "user of the glib library" perspective.

Currently, glib always sets the PCRE_UCP (unicode properties) compile flag in g_regex_new(), regardless of G_REGEX_RAW being set or not. That has side-effects when searching strings with chars > 127, because pattern assertions like '\w' and '\b', as well as POSIX character classes like '[:alpha:]' and '[:lower:]', will match some of the characters in the > 127 range. (in fact, they match a lot of such characters)

I think most users would expect G_REGEX_RAW to actually mean "normal PCRE mode", which PCRE_UCP is not. (it's also slower, fwiw)

Since it's been this way forever, I'm not recommending changing G_REGEX_RAW behavior. Instead, I suggest a new GRegexCompileFlags enum value of 'G_REGEX_NO_UCP' be added, which if set will not set the PCRE_UCP flag in g_regex_new().
Comment 1 Hadriel Kaplan 2014-02-24 20:37:24 UTC
Created attachment 270183 [details] [review]
Patch to add new G_REGEX_NO_UCP compile flag
Comment 2 Christian Persch 2014-06-10 10:44:11 UTC
Review of attachment 270183 [details] [review]:

The patch looks fine to me with the one nit fixed. I'm don't however have an opinion on whether we should add this feature or not.

::: glib/gregex.h
@@ +291,3 @@
+ *     and some POSIX character classes such as '[:alpha:]', will match non-ASCII
+ *     characters. If this option is set, GRegex will classify characters using
+ *     normal ASCII tables instead. Since: 2.39

Since: 2.42 now (use the stable version).
Comment 3 Peter Wu 2015-10-02 14:12:40 UTC
(In reply to Christian Persch from comment #2)
> Review of attachment 270183 [details] [review] [review]:
> 
> The patch looks fine to me with the one nit fixed. I'm don't however have an
> opinion on whether we should add this feature or not.
> 
> ::: glib/gregex.h
> @@ +291,3 @@
> + *     and some POSIX character classes such as '[:alpha:]', will match
> non-ASCII
> + *     characters. If this option is set, GRegex will classify characters
> using
> + *     normal ASCII tables instead. Since: 2.39
> 
> Since: 2.42 now (use the stable version).

This patch still applies on top of 2.46. Can it be backported or does it only have a chance to make it in 2.48?

With (presumably) PCRE 8.34, the [[:punct:]] class unexpectedly matches even more characters even with G_REGEX_RAW set.

From http://www.pcre.org/original/changelog.txt (Version 8.34 15-December-2013):
31. Upgraded the handling of the POSIX classes [:graph:], [:print:], and
    [:punct:] when PCRE_UCP is set so as to include the same characters as Perl
    does in Unicode mode.

What do you think of making G_REGEX_RAW control the PCRE_UCP flag too without exposing a new G_REGEX_NOUCP option to the user? The only reason why you would not do this is when you try to create a "strings"-like utility that scrapes for data out of a binary. I don't know whether something like [\uabcd] would work though with G_REGEX_RAW.

The ability to disable PCRE_UCP is important given that we work with binary data and are only interested in matching ASCII with character classes.
Comment 4 Peter Wu 2016-08-08 12:08:54 UTC
Ping... this is breaking tests in Wireshark (a consumer of GRegex)...
Comment 5 Christian Persch 2016-08-08 14:23:55 UTC
My advice would be to port away from GRegex and use PCRE2 (not PCRE which is deprecated) directly.
Comment 6 Peter Wu 2016-08-08 14:47:44 UTC
GRegex is being used in the hope that it is available on platforms without having to worry about maintaining additional libraries. Are you suggesting that GRegex is being deprecated?
Comment 7 Emmanuele Bassi (:ebassi) 2016-08-08 15:08:19 UTC
(In reply to Peter Wu from comment #6)
> GRegex is being used in the hope that it is available on platforms without
> having to worry about maintaining additional libraries.

Sadly, PCRE has the tendency to change behaviour even in stable API series, which means that the only safe way for you to depend on specific behaviour is to have a strict dependency not only on a version of GLib but also on a version of PCRE. For some platforms this can be achieved by depending on the internal copy of PCRE but:

 * GLib does not update the internal copy of PCRE that often, which means that some features, bug fixes, or security updates may not be available until GLib itself is released
 * GLib is being increasingly built and tested with a system copy of PCRE — especially on Linux — instead of the internal copy, because Linux distributors favour unbundling

This means that you still need to worry about PCRE, without having the ability to control how your project responds to eventual changes.

> Are you suggesting that GRegex is being deprecated?

GRegex cannot be ported to PCRE2 in a compatible way; the more common PCRE2 becomes, the less useful GRegex is.

Whether or not GRegex is going to be deprecated has been a topic of discussion, but the GLib developers haven't found a resolution, yet.
Comment 8 Peter Wu 2016-08-08 15:25:38 UTC
The original PCRE library should be receiving bugfixes only now, PCRE2 will have the new features according to www.pcre.org.

Unbundling is a good thing, but if it breaks GRegex that is unfortunate. Would it be possible to incorporate patch comment 1 or is something else blocking it? (If the version needs to be changed, you can do it directly.)
Comment 9 GNOME Infrastructure Team 2018-05-24 16:20:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/837.