Bug 50075 – Unicode regular expression engine

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 50075 - (gregex) Unicode regular expression engine

(gregex)
Summary:	Unicode regular expression engine


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	348349 (view as bug list)
Depends on:	348348
Blocks:	144322 397549

Reported:	2000-12-15 23:38 UTC by Havoc Pennington
Modified:	2011-02-18 15:55 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
gregex PCRE wrapper (48.07 KB, application/x-gtar) 2001-10-16 14:34 UTC, Matthias Clasen		Details
Tar file with new stuff (84.69 KB, application/octet-stream) 2004-02-27 16:24 UTC, Morten Welinder		Details
Diffs to existing files (2.27 KB, patch) 2004-02-27 16:25 UTC, Morten Welinder	none	Details \| Review
new attempt (89.15 KB, application/octet-stream) 2004-06-04 00:42 UTC, Matthias Clasen		Details
new version (5.25 KB, text/plain) 2004-06-05 07:11 UTC, Matthias Clasen		Details
new version (17.87 KB, text/plain) 2004-06-05 07:13 UTC, Matthias Clasen		Details
gregex.c (19.36 KB, text/plain) 2004-06-06 05:32 UTC, Matthias Clasen		Details
New version (24.93 KB, text/plain) 2004-06-11 01:41 UTC, Matthias Clasen		Details
Patch to add GRegex (182.36 KB, application/x-gzip) 2006-10-20 20:45 UTC, Marco Barisione		Details
Patch to add GRegex (182.67 KB, application/x-gzip) 2006-10-21 17:02 UTC, Marco Barisione		Details
Patch to add GRegex (as a separate library) (183.04 KB, application/x-gzip) 2006-10-25 09:56 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (174.02 KB, application/x-gzip) 2006-10-27 19:15 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (173.80 KB, application/x-gzip) 2006-11-05 23:52 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (173.29 KB, application/x-gzip) 2006-11-07 22:54 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (174.78 KB, application/x-gzip) 2006-11-11 18:47 UTC, Marco Barisione		Details
Use G_GNUC_MAY_ALIAS instead of an ugly hack (174.59 KB, application/x-gzip) 2006-12-01 15:10 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (174.67 KB, application/x-gzip) 2006-12-17 16:23 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (174.67 KB, application/x-gzip) 2006-12-17 19:11 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (181.68 KB, application/x-gzip) 2006-12-19 15:21 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (185.14 KB, application/x-gzip) 2007-01-17 11:54 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (185.21 KB, application/x-gzip) 2007-01-18 10:06 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (185.52 KB, application/x-gzip) 2007-01-25 23:27 UTC, Marco Barisione		Details
another patch (7.70 KB, patch) 2007-01-26 21:11 UTC, Matthias Clasen	none	Details \| Review
Patch to add GRegex (inside libglib) (185.77 KB, application/x-gzip) 2007-01-27 15:15 UTC, Marco Barisione		Details
Patch to add GRegex (inside libglib) (185.73 KB, application/x-gzip) 2007-01-31 11:58 UTC, Marco Barisione		Details

Description Havoc Pennington 2000-12-15 23:38:06 UTC

Implement a regex engine for UTF-8

Comment 1 Havoc Pennington 2001-01-29 19:27:48 UTC

Setting all outstanding bugs against 1.3.x to be due for the 2.0.0 milestone;
will go through and move some of them to API freeze milestone and set milestone
to none for punted features.

Comment 2 Havoc Pennington 2001-01-29 19:29:16 UTC

Moving GLib bugs with API keyword to the API freeze milestone

Comment 3 Havoc Pennington 2001-02-15 21:19:59 UTC

This has no chance of making 2.0

Comment 4 Matthias Clasen 2001-10-16 14:30:08 UTC

I have finally been able to track down an archive of Scott Wimers
PCRE wrapper (dating back to 1999), which I'll attach below. Now
that PCRE-3.5 has reasonable UTF-8 support, maybe this would be an
option ?

Though, admittedly, the UTF-8 support in PCRE is more limited than
what glib has to offer, and a tighter integration would be nice
(eg PCRE's character classes can match only single byte characters,
while glib has nice Unicode character classifcations in the form
of g_unichar_is...).

Comment 5 Matthias Clasen 2001-10-16 14:34:55 UTC

Created attachment 5838 [details]
gregex PCRE wrapper

Comment 6 Owen Taylor 2001-10-17 16:57:14 UTC

This is defintiely still a 2.2 feature not a 2.0 feature.

I haven't looked at PCRE recently - if it now has UTF-8 support, 
using it is definitely a good possibility. (I actually made
some stabs at adding UTF-8 support myself at one point,
but gave up because of other more pressing issues.)

But I would like to see a large subset of the Perl Unicode
regex's features if possible; single byte character classes
do sound too limiting. Well, the PCRE codebase is pretty
nice in my experience so extending it shouldn't be hard.

Comment 7 Matthias Clasen 2002-03-15 08:54:22 UTC

Update: PRCE is at version 3.9 now, and the UTF-8 support seems to
have been fixed up a bit, but the same limitations still apply
(e.g. only ASCII character classes)

Comment 8 Paolo Maggi 2002-10-09 12:41:37 UTC

Any news about this RFE?

Comment 9 Matthias Clasen 2002-10-09 13:14:06 UTC

I'm not working on this. Feel free to give it a try.

Comment 10 Matthias Clasen 2003-03-03 22:43:24 UTC

PCRE is at 4.0 now, and the UTF-8 support seems to mostly complete.
The only real limitation is that case-insensitive matching is
restricted to single-byte characters.

Comment 11 Matthias Clasen 2003-03-04 09:44:53 UTC

And here is an old mail from Havoc concerning issues with Scotts
PCRE 
wrapper:

http://mail.gnome.org/archives/gtk-devel-list/2000-
May/msg00039.html

Comment 12 Morten Welinder 2004-02-27 14:44:13 UTC

For the record, I imported PCRE into Gnumeric (in the form of making
UTF-8 versions of regcomp, regexec, etc.)  There were a few problems
(aka bugs), but the author addressed them quickly.

If there is interest, I could adapt my import script to glib.
Clearing target milestone.

Comment 13 Matthias Clasen 2004-02-27 14:46:51 UTC

Did you adapter PCRE to make use of the GLib Unicode facilities, 
rather than duplicating them ?

Comment 14 Morten Welinder 2004-02-27 14:49:21 UTC

"Adapt", if I understand you right.  Here's a sniplet...
	s/\b(to(lower|upper)|is(lower|upper|digit|xdigit|space|print|cntrl|alpha|alnum|graph|punct))\b/g_unichar_$1/g;

Comment 15 Matthias Clasen 2004-02-27 14:52:16 UTC

I would be interested, yes.

Comment 16 Morten Welinder 2004-02-27 16:24:29 UTC

Created attachment 24858 [details]
Tar file with new stuff

Comment 17 Morten Welinder 2004-02-27 16:25:11 UTC

Created attachment 24859 [details] [review]
Diffs to existing files

Comment 18 Morten Welinder 2004-02-27 16:26:41 UTC

That wasn't really hard.  This import is from PCRE 4.5.

The API to study is in gregex.h.

Comment 19 Matthias Clasen 2004-06-04 00:42:58 UTC

Created attachment 28314 [details]
new attempt

Here is a first cut at this, based on Scotts old stuff. I have reduced the api
a bit and made it more g-esk. 

The included PCRE 4.5 has the following modifications:
- underscore-prefix all symbols
- use g_malloc and friends

Remaining tasks:
- make PCRE use g_utf8
- add a testsuite

Comment 20 Morten Welinder 2004-06-04 01:25:20 UTC

Very interesting.  (And very g-esk!)

You should force SUPPORT_UTF8=1 and force all the checks to true somewhere --
see my perl script, for example.

Re g_regex_replace_eval:

* what is the "g" option?
* consider calling the replacement callback with the regex too.  It's going
  to need it to refer back to $n.  (That could be encoded in the user pointer,
  but I don't consider that elegant.)
* having the replacement function create a new string (by copying, for example)
  and then copy that string in replace_eval sounds like a lot of work.  Maybe
  send the GString* to the callback.


Re g_regex_split_next: have a look at g_strndup.

Comment 21 Morten Welinder 2004-06-04 11:45:45 UTC

Also, the GRegex struct needs to have a warning that all such structs need to
be created with g_regex_new as the real struct contains additional members.
I'm not sure why GRegex is exposed at all.

How do you propose to keep up with changes to PCRE with this approach?  (With
my approach it is a most often a matter of re-running the script.)

Comment 22 Matthias Clasen 2004-06-04 13:05:42 UTC

Thanks for the review. Some of your points (e.g then mention of the "g" option,
and the non-use of gstrndup) are just due to me starting from Scotts 99-vintage
patch. 

You are probably right about the struct, it should probably be completely hidden, 
and a getter for the match position should be added. I have not yet given much
thought to the issue of automating the PCRE import, because I really want to
look into making PCRE use the glib unicode support. That will likely make the
modifications too invasive for automatic import to work well. Also, PCRE isn't
exactly released frequently...

Comment 23 Matthias Clasen 2004-06-05 07:11:25 UTC

Created attachment 28363 [details]
new version

Comment 24 Matthias Clasen 2004-06-05 07:13:39 UTC

Created attachment 28364 [details]
new version

Here is an update, taking your comments into account. 
Note that the eval callback changed it a bit. It gets the regex,
the original string and the GString now, can fetch the last match
if it wants to, can append to the result if it wants to, and can
stop further replacements by returning TRUE.

Comment 25 Matthias Clasen 2004-06-06 05:32:10 UTC

Created attachment 28387 [details]
gregex.c

Here is another update. g_regex_replace() now interpolates backreferences of
the form $n in the replacement text

Comment 26 Morten Welinder 2004-06-08 00:57:17 UTC

While it's cute to use the regexp engine to split the replacement string I don't
think it's a good idea.  You will not be able to handle escapes.

What is the exact replacement semantics?  (What if I want to have a dollar sign
in the replacement text?  What if I want to follow $1 by a number?  What about
case in the replacement text? - s/foo/bar/i on FOO might yield BAR, depending
on options.)

For inspiration, fire up search/replace in Gnumeric or read about Emacs' engine.

Comment 27 Matthias Clasen 2004-06-08 01:24:57 UTC

Well, the code I currently have on my disk makes an attempt at handling escapes
by replacing $$ with a literal $. It doesn't handle either reference-followed by
number nor case-sensitive replacement, though. I'm actually thinking about using
the python syntax for references instead of the perl syntax currently, ie
\num or \g<num> and \g<name> for named references. 

Regarding exact semantics, the goal must be to match the perl/python semantics
for replacement as close as possible, I'd say (if only to be able to refer to
the extensive documentation about regex's in these languages). I don't know if
any of these actually support changing the case of the replacement depending on
the case of the match. I know emacs does that...

Comment 28 Matthias Clasen 2004-06-11 01:41:01 UTC

Created attachment 28576 [details]
New version

This version uses python escape sequences in the replacement text; I'm not 100%
sure that is the right thing to do, since PCRE is perl-compatible, not
python-compatible, but it allows for nice things like named captures, which
perl doesn't provide. To that end, I've also added g_regex_fetch_named()

Comment 29 Matthias Clasen 2004-06-29 14:21:00 UTC

We decided to move this into libegg first, to gain some experience with the api
and work on more complete unicode support.

Comment 30 Marco Barisione 2006-07-30 11:17:29 UTC

[This is the comment of bug #348349 that I marked as duplicate of this]

I propose to add a wrapper around PCRE to glib, this should be a separated
libgregex library to mantain the core libglib as small as possible.

The relevant thread in gtk-devel-list is archived at
http://mail.gnome.org/archives/gtk-devel-list/2006-July/thread.html#00099

The source code in at http://techn.ocracy.org/eggregex/, a copy of the
documentation is at http://www.barisione.org/eggregex/.

The code is based (with many modifications) on EggRegex from the libegg module.

(eg)gregex needs to know the script of a gunichar to allow \p{script-name} in
patterns. For now there is an internal copy of some code from Pango, see bug #348348. Please someone can make depend this bug on #348348? I do not have enough permissions.

Comment 31 Marco Barisione 2006-07-30 11:18:21 UTC

*** Bug 348349 has been marked as a duplicate of this bug. ***

Comment 32 Marco Barisione 2006-09-20 14:07:31 UTC

PCRE can be configured at compile time to recognize \n or \r or \r\n as the end of a line. This parameter can be overwritten at compile or match time passing the flags PCRE_NEWLINE_(CR|LF|CRLF).
What should we do in EggRegex? Having \n as the default is useful even on Windows as text files are opened in text mode, but it could be a problem in some cases. The ideal thing is having (\n|\r|\r\n) as newline, but, AFAIK, it's not possible.

egg_regex_replace() accepts an utf-8 replacement string. What should I do when the EGG_REGEX_RAW flag is set? We have the same problem with egg_regex_replace_literal() and egg_regex_expand_references().

egg_regex_match_simple() returns a gboolean. Maybe it would be better to return an int with the same meaning of the return value of egg_regex_get_match_count(). What is your opinion?

What should egg_regex_new() and egg_regex_match*() do when they receive an unknown flag? If the flag is not valid for PCRE then the pcre_* functions just fail. However it could be a flag introduced by a newer version of PCRE or not wrapped by EggRegex, causing a unexpected PCRE behavior that could crash EggRegex.

Comment 33 Matthias Clasen 2006-10-08 18:14:22 UTC

I agree that \n|\r|\r\n would be ideal. If we can't get that we should probably
expose it as match flags. 

For egg_regex_replace() I think we should enforce that the replacement is utf-8
(seems like a good idea in the light of the escape processing that is happening).
The documentation should probably point to egg_regex_replace_literal() for replacing with non-utf-8, and replace_literal() should explicitly allow 
non-utf-8. Since expand_references() is also doing splitting and reference processing on the string_to_expand, it should probably require it to be utf-8.

I think the simple boolean return value from match_simple() is very appealing.

egg_regex_new should refuse unknown flags.

Comment 34 Marco Barisione 2006-10-11 13:52:42 UTC

I wrote to PCRE author about the problem with newlines, this is his response:

This idea is already on a list to consider, but it isn't as simple as
you make it sound, because, in addition to \r and \n there are a number
of other Unicode characters that are considered to be line terminators.

Also, doing it would *not* mean that \n in a pattern matched any of
these "newlines". There's a web page about regex and Unicode

http://unicode.org/unicode/reports/tr18/

and (unless I'm misremembering it), it is clear that \n matches
character U000A only. I think there's a suggestion for a possible
different escape sequence to match "any line terminator". What *would*
happen with PCRE_NEWLINE_ANY is that ^ and $ would match after/before
any of the recognized line terminators.

I propose to keep '\n' as the default, even on Windows and MacOS. This make sense because strings are usually read from a file opened in text mode.

At compile-time this setting can be changed passing EGG_REGEX_NEWLINE_CR or EGG_REGEX_NEWLINE_CRLF to egg_regex_new(). EGG_REGEX_NEWLINE_LF is not needed as it's the default.

At match-time this setting can be changed passing EGG_REGEX_MATCH_NEWLINE_CR, EGG_REGEX_MATCH_NEWLINE_CRLF or EGG_REGEX_MATCH_NEWLINE_LF.

A line break is relevant when compiling a pattern only if EGG_REGEX_EXTENDED is set, because an unescaped # outside a character class begins a comment that lasts until after the next newline. In this case the pattern string must use the newline specified as a compile-time option.

It's not so easy but the default (\n) should work in most cases. What do you think?

Comment 35 Marco Barisione 2006-10-20 20:45:16 UTC

Created attachment 75104 [details]
Patch to add GRegex

This patch adds a seprarate GRegex library. It contains makefiles for Windows but they are not tested.

Comment 36 Marco Barisione 2006-10-21 17:02:39 UTC

Created attachment 75140 [details]
Patch to add GRegex

Fixed a couple of minor win32 things

Comment 37 Marco Barisione 2006-10-25 09:56:22 UTC

Created attachment 75354 [details]
Patch to add GRegex (as a separate library)

Fixed:
- configure checks if the system installed PCRE supports UTF-8 and Unicode properties
- Added a border to the tables in the documentation
- Allow tags in the documentation extracted from .c files
- Use gettext instead of dummy defines

Comment 38 Marco Barisione 2006-10-27 19:15:27 UTC

Created attachment 75526 [details]
Patch to add GRegex (inside libglib)

To avoid exporting symbols from the internal PCRE I changed LIBTOOL_EXPORT_OPTIONS to '-export-symbols-regex "^g.*"', so it exports only symbols starting with "g". Is this acceptable or there are cases when libglib exports symbols not starting with "g"?

When glib is compiled against an external PCRE library, pltcheck.sh fails with this output:
Checking .libs/libglib-2.0.so for local PLT entries
000a6214  00001407 R_386_JUMP_SLOT   000631ad   g_bit_storage
000a6424  00042e07 R_386_JUMP_SLOT   0006316e   g_bit_nth_msf
I don't understand the reason, note that GRegex is not using g_bit_storage or g_bit_nth_msf.

I cannot understand how glib is built under Windows. glib/makefile.msc.in is wrong (it's the makefile for gobject, not glib) and glib/makefile.mingw is never generated from glib/makefile.mingw.in (or at least it's not generated by configure or by win32-fixup.pl).

Comment 39 Marco Barisione 2006-11-05 23:52:26 UTC

Created attachment 76056 [details]
Patch to add GRegex (inside libglib)

Tor Lillqvist removed the old and unused MinGW files from HEAD and Hans Breuer restored the correct glib/makefile.msc.in. So now my Windows problems are resolved.

The only remaining problem is the failure of pltcheck.sh when libglib does not use the internal PCRE.

Comment 40 Havoc Pennington 2006-11-06 01:40:47 UTC

It might be helpful to document some pros and cons of using --with-pcre=system, otherwise distribution packagers will use the wrong one, if there is a wrong one... the packager default is usually to use system copies but in this case it sounds like that would bloat or even break things for most gtk apps since the unicode tables wouldn't be shared?

Also, I mentioned this briefly on the list, there may be language binding concerns with something refcounted that lacks object data. My knowledge may be out of date but last time I understood this stuff the language-bindable patterns were:
 - "value object" (has copy/free)
 - "GObject" (has ref/unref/object-data)
 - "immutable value object" (has ref/unref, but they are just an optimization of copy - think "copy on write where writing never happens")

It may be that most languages already have regexes so there's no point binding GRegex, similarly to how GSList doesn't really get direct language bindings. On the other hand, if any higher-level APIs were based on taking or returning a GRegex (e.g. in an object property) it could get tricky to bind those APIs.

Without object data you can't keep object identity in the binding, i.e. the binding has to create multiple wrapper/proxy objects for one C object. This means that equality tests (and hash tables based on them) for example can break, or that the language's equivalent of GObject object data could break.
Something like:
 regex = new Regex();
 gobject.setProperty("regex", regex);
 assert regex == gobject.getProperty("regex"); // this would fail

A possible fix is to have just free() instead of unref(), but implement copy() internally as a copy-on-write, so copy()ing the regex as it's passed around in GBoxed is cheap. If an object had a regex property, it would conceptually store a copy of the regex that was passed in i.e. value semantics. (Possibly what's wanted anyway, since GRegex has all the match state in it - seems like you pretty much always want a new copy anytime you use the regex to match.)

Comment 41 Behdad Esfahbod 2006-11-06 05:05:41 UTC

(In reply to comment #38)
> Created an attachment (id=75526) [edit]
> Patch to add GRegex (inside libglib)
> 
> To avoid exporting symbols from the internal PCRE I changed
> LIBTOOL_EXPORT_OPTIONS to '-export-symbols-regex "^g.*"', so it exports only
> symbols starting with "g". Is this acceptable or there are cases when libglib
> exports symbols not starting with "g"?

"^g_.*" should be ok.

> When glib is compiled against an external PCRE library, pltcheck.sh fails with
> this output:
> Checking .libs/libglib-2.0.so for local PLT entries
> 000a6214  00001407 R_386_JUMP_SLOT   000631ad   g_bit_storage
> 000a6424  00042e07 R_386_JUMP_SLOT   0006316e   g_bit_nth_msf
> I don't understand the reason, note that GRegex is not using g_bit_storage or
> g_bit_nth_msf.

This doesn't make sense.  Are you sure that this doesn't happen without your patch?  I can only see this happen if for some reason glib thinks your compiler cannot inline.

Comment 42 Marco Barisione 2006-11-06 11:43:19 UTC

(In reply to comment #40)
> It might be helpful to document some pros and cons of using --with-pcre=system,
> otherwise distribution packagers will use the wrong one, if there is a wrong
> one... the packager default is usually to use system copies but in this case it
> sounds like that would bloat or even break things for most gtk apps since the
> unicode tables wouldn't be shared?

The Unicode tables used by PCRE should be identical to the tables used by glib, so using PCRE's tables is (probably) not a problem. All the tests in tests/regex-test.c pass using the system PCRE.
BTW Unicode tables are huge, so having a single copy is a good idea. Moreover glib's implementation is probably faster.

PCRE has a feature called callouts (see man pcrecallout) that relies on a global variable. I didn't wrapped this feature in GRegex to avoid problems if using a shared PCRE library, but, if we will suggest to use the internal PCRE, I'll wrap it.

A problem due to the use of a system-supplied PCRE is that it could add new features. Suppose that PCRE 6.8 will add a new construct to the regular expression syntax, then a developer writing a program on a distro shipping glib 2.14 linked against a shared libpcre 6.8 could use this construct. This means that his program will not work as expected on another distro using glib 2.14 linked against libpcre 6.7 or using the internal copy.

I will add some comments on --with-pcre=system to docs/reference/glib/building.sgml.

> Also, I mentioned this briefly on the list, there may be language binding
> concerns with something refcounted that lacks object data.

In GRegex I use the same approach used by other ref counted structures in GRegex.

> It may be that most languages already have regexes so there's no point binding
> GRegex, similarly to how GSList doesn't really get direct language bindings.

PCRE has some features such as DFA matching and partial matching that are not supported by other languages. So it could be useful to have bindings for GRegex.

> A possible fix is to have just free() instead of unref(), but implement copy()
> internally as a copy-on-write, so copy()ing the regex as it's passed around in
> GBoxed is cheap. If an object had a regex property, it would conceptually store
> a copy of the regex that was passed in i.e. value semantics. (Possibly what's
> wanted anyway, since GRegex has all the match state in it - seems like you
> pretty much always want a new copy anytime you use the regex to match.)

Match functions modify the internal state so copy-on-write is not really useful as a real copy would happen as soon as you use a match or split function.

If ref counting is a real problem I can remove it, but GtkSourceView and MooEdit, the only two programs that are using GRegex at the moment, use ref counting.

However the code I wrote for GtkSourceView could live without ref counting because it wraps GRegex in another structure, so ref counting can be implemented in this struct.


(In reply to comment #41)
> This doesn't make sense.  Are you sure that this doesn't happen without your
> patch?  I can only see this happen if for some reason glib thinks your compiler
> cannot inline.

This happens only when compiling glib with my patch using --with-pcre=system.

I'm using "gcc (GCC) 4.1.2 20061028 (prerelease) (Debian 4.1.1-19)", from Debian unstable.

configure.in thinks that my gcc supports inlining:
checking for __inline... yes
checking for __inline__... yes
checking for inline... yes
checking if inline functions in headers work... yes

Can you try "./autogen.sh --with-pcre=system"?

Comment 43 Havoc Pennington 2006-11-06 15:31:57 UTC

> In GRegex I use the same approach used by other ref counted structures in
> GRegex.

This list of refcounted things in libglib is very short though, order of 5-ish; and I think they mostly _are_ a little problematic binding-wise, when it comes up, though it happens that GIOChannel and GSource are unlikely to be the values of an object property for example.

> Match functions modify the internal state so copy-on-write is not really
> useful as a real copy would happen as soon as you use a match or split 
> function.

Say you have a terminal widget like VTE and it has an "URL regex" property, if you set this property I'd expect something like:

  GRegex *regex = g_regex_new(...);
  g_object_set(vte, "url-regex", regex, NULL);
  g_regex_free(regex);

There are three copies here, the original, the one in the GValue created by g_object_set, and the one that the VTE widget keeps.

The original and the one in the GValue are freed immediately and never used, then probably later back in the main loop VTE might match or split.

It would also be possible to have an "immutable part" and a "match part" perhaps, where the immutable part (the pattern) was never copied even on write, and the "match part" was copied only when each instance does a match.

Comment 44 Yevgen Muntyan 2006-11-06 17:52:25 UTC

(In reply to comment #40)
> Also, I mentioned this briefly on the list, there may be language binding
> concerns with something refcounted that lacks object data. My knowledge may be
> out of date but last time I understood this stuff the language-bindable
> patterns were:
>  - "value object" (has copy/free)
>  - "GObject" (has ref/unref/object-data)
>  - "immutable value object" (has ref/unref, but they are just an optimization
> of copy - think "copy on write where writing never happens")
> 
> It may be that most languages already have regexes so there's no point binding
> GRegex, similarly to how GSList doesn't really get direct language bindings. On
> the other hand, if any higher-level APIs were based on taking or returning a
> GRegex (e.g. in an object property) it could get tricky to bind those APIs.
> 
> Without object data you can't keep object identity in the binding, i.e. the
> binding has to create multiple wrapper/proxy objects for one C object. This
> means that equality tests (and hash tables based on them) for example can
> break, or that the language's equivalent of GObject object data could break.
> Something like:
>  regex = new Regex();
>  gobject.setProperty("regex", regex);
>  assert regex == gobject.getProperty("regex"); // this would fail

Why would it fail in the case when GRegex has Boxed type with ref/unref as copy/free? In other words, what exactly is the bindings problem here?

There may be a problem when you perform search, and save GRegex structure for later use of the *match*. But, it's a problem because there's no GRegexMatch structure; and use cases where you need such a thing should be really rare. Normal use case is "do search; use match". In this "normal" case you have no problems with refcounting. And, it's no language-bindings-specific at all.

Comment 45 Yevgen Muntyan 2006-11-06 18:03:17 UTC

(In reply to comment #44)
> (In reply to comment #40)
> > Also, I mentioned this briefly on the list, there may be language binding
> > concerns with something refcounted that lacks object data. My knowledge may be
> > out of date but last time I understood this stuff the language-bindable
> > patterns were:
> >  - "value object" (has copy/free)
> >  - "GObject" (has ref/unref/object-data)
> >  - "immutable value object" (has ref/unref, but they are just an optimization
> > of copy - think "copy on write where writing never happens")
> > 
> > It may be that most languages already have regexes so there's no point binding
> > GRegex, similarly to how GSList doesn't really get direct language bindings. On
> > the other hand, if any higher-level APIs were based on taking or returning a
> > GRegex (e.g. in an object property) it could get tricky to bind those APIs.
> > 
> > Without object data you can't keep object identity in the binding, i.e. the
> > binding has to create multiple wrapper/proxy objects for one C object. This
> > means that equality tests (and hash tables based on them) for example can
> > break, or that the language's equivalent of GObject object data could break.
> > Something like:
> >  regex = new Regex();
> >  gobject.setProperty("regex", regex);
> >  assert regex == gobject.getProperty("regex"); // this would fail
> 
> Why would it fail in the case when GRegex has Boxed type with ref/unref as
> copy/free? In other words, what exactly is the bindings problem here?

Oops, I got it. getProperty() will create a new wrapper since it has no idea that the boxed it returns is the same one, so wrapper object will be different.
Well, ref/unref vs copy/free has nothing to do with it.

Comment 46 Havoc Pennington 2006-11-06 18:56:36 UTC

> getProperty() will create a new wrapper since it has no idea
> that the boxed it returns is the same one, so wrapper object will be 
> different.

Exactly

> Well, ref/unref vs copy/free has nothing to do with it.

As I outlined, the issue is "by-value" vs. "by-reference" objects (in Java or Python, think primitive types such as int, vs. Object). If you have an object that's refcounted but has no object data, then there are problems wrapping it as a reference type since you can't have a single proxy.

There are also problems wrapping such a thing as a value type, because APIs using the type in C might assume it is a reference type. To pick an extreme example, if I container_add(container, widget) then the API strongly assumes that "widget" is a reference type and I'll be able to modify it by reference later. i.e. it would very much break the container_add() API if a language binding tried to copy the widget (treat it as a value type).

If APIs stick to the three patterns I mentioned (copy/free, ref/unref+objectdata, ref/unref+immutable) then the usage in C will always correspond nicely to the usage in another language. Other patterns may also be possible, but often risk allowing C APIs that don't make sense when bound.

Comment 47 Yevgen Muntyan 2006-11-06 20:13:53 UTC

(In reply to comment #46)
> > getProperty() will create a new wrapper since it has no idea
> > that the boxed it returns is the same one, so wrapper object will be 
> > different.
> 
> Exactly
> 
> > Well, ref/unref vs copy/free has nothing to do with it.
> 
> As I outlined, the issue is "by-value" vs. "by-reference" objects (in Java or
> Python, think primitive types such as int, vs. Object). If you have an object
> that's refcounted but has no object data, then there are problems wrapping it
> as a reference type since you can't have a single proxy.

No, you do not wrap it as a "reference type", you wrap it as an opaque Boxed type, and you don't have any problem at all. That assert will fail, but it's not a problem since nobody said it should not fail (if binding wants it not to fail, the binding should be clever and employ non-existing g_regex_equal method in *any* case).

> There are also problems wrapping such a thing as a value type, because APIs
> using the type in C might assume it is a reference type. To pick an extreme
> example, if I container_add(container, widget) then the API strongly assumes
> that "widget" is a reference type and I'll be able to modify it by reference
> later. i.e. it would very much break the container_add() API if a language
> binding tried to copy the widget (treat it as a value type).

It's all about some sort of guarantees provided by certain functions. In this example, it's known that widget is not duplicated. In GRegex case it's the same thing; or it's known that it's copied by value if decided so. But there is no problem for bindings which would exist in case of ref-counted GRegex and would not exist in case of copied-by-value GRegex.

> If APIs stick to the three patterns I mentioned (copy/free,
> ref/unref+objectdata, ref/unref+immutable) then the usage in C will always
> correspond nicely to the usage in another language. Other patterns may also be
> possible, but often risk allowing C APIs that don't make sense when bound.

This is the third pattern - immutable ref-counted structure. But again, binding should not care about this. All it should know is how to wrap Boxed types.

And actually, why wrap GRegex in bindings at all? Shouldn't it be a thing like GList, and shouldn't nice objects have "regex-pattern" properties instead of G_TYPE_REGEX properties? Are there languages other than C++ (which is C anyway) which need GRegex?

Comment 48 Havoc Pennington 2006-11-06 20:43:09 UTC

> you do not wrap it as a "reference type", you wrap it as an opaque Boxed
> type, and you don't have any problem at all

Just not the case. This leaves you with an object that doesn't behave in the same way as any normal object in the language. Boxed types can be generically marshaled but not generically proxied/wrapped - to proxy/wrap you need to know whether the proxy/wrapper has value or reference semantics.

> It's all about some sort of guarantees provided by certain functions.

Language bindings have to choose between value or reference on the basis of the type being wrapped, not on the basis of the function being called.

If the binding chooses value semantics then any C API that relies on reference semantics won't be bindable.

The point about ref/unref without object data is that it allows a C API to use reference semantics but makes it hard for a binding to support reference semantics, which leads to an unbindable API.

If you have only copy/free then C APIs are forced to use value semantics, which is easy to bind; if you have ref/unref + object-data then it's easy to bind with reference semantics. ref/unref without object data can be cleanly bound as a value object, but allows C APIs to rely on it as a reference object and create an unbindable API.

> actually, why wrap GRegex in bindings at all?

I already asked that question in an earlier comment, and it's a fair question.

Given that there's no meaningful performance or ease-of-use penalty for sticking to one of the three "bindable" patterns though, why not just use one of them.

Comment 49 Marco Barisione 2006-11-07 22:54:14 UTC

Created attachment 76187 [details]
Patch to add GRegex (inside libglib)

- Removed g_regex_ref and g_regex_unref, and added g_regex_free.
- Explained why it's better to use the internal PCRE.

Comment 50 Marco Barisione 2006-11-08 14:41:01 UTC

Yevgen Muntyan disagrees with the removal of _ref/_unref. Personally I don't care as my code can live without it and I'm not a language binding expert.

I would like to hear the last word on ref couting by GLib maintainers.

Comment 51 Yevgen Muntyan 2006-11-08 14:47:54 UTC

Let me point out that

1) ref/unref doesn't create a problem for bindings alone. It may create a problem if C api is like gtk_radio_stuff which uses GList, i.e. if it uses actual C pointer to GRegex structure with no reason. Note that in VTE example GRegex property is not plausible, it'd rather continue have "pattern" property or something; thing is that GRegex is not an alone structure with single match() method - match() takes lots of options.
2) ref/unref allows to avoid copying regexes - in case when pattern is, say, 1k 
characters long (GtkSourceView easily has such patterns), copying regexes around is pointless.

Not really a biggest deal, but why change it?

Comment 52 Havoc Pennington 2006-11-08 16:46:32 UTC

Yevgen, if you are worried about efficiency, just make copy() fast; there's no reason to _ever_ copy the "pattern" part of the regex since it's immutable, and the "match results" part can be copied-on-write or just always made NULL on new copies.

I agree one wants to treat pattern and match results a little differently ideally (which is why most languages have two objects), but it's simpler in C to have only one object so GRegex combines them; still, internally GRegex can think of these two parts of GRegex separately and treat them separately.

If they were separate then GPattern could follow the "immutable value object with ref/unref" pattern and GMatch could follow the "copy/free" pattern. With GRegex combined, it could follow "copy/free" but have an immutable ref/unref GPattern as an internal implementation detail.

(not a glib maintainer, if it isn't obvious)

Comment 53 Yevgen Muntyan 2006-11-08 17:52:35 UTC

(In reply to comment #52)
> Yevgen, if you are worried about efficiency, just make copy() fast; there's no
> reason to _ever_ copy the "pattern" part of the regex since it's immutable, and
> the "match results" part can be copied-on-write or just always made NULL on new
> copies.

This basically means moving ref/unref inside the GRegex - ref-count the pattern + compiled pcre structure. If it's better for bindings to have internal ref/unref, then so be it; just don't copy pattern and pcre structure when not needed (note, you have more than pattern there - there is also compiled pattern and result of pcre_study). 

> I agree one wants to treat pattern and match results a little differently
> ideally (which is why most languages have two objects), but it's simpler in C
> to have only one object so GRegex combines them; still, internally GRegex can
> think of these two parts of GRegex separately and treat them separately.
> 
> If they were separate then GPattern could follow the "immutable value object
> with ref/unref" pattern and GMatch could follow the "copy/free" pattern. With
> GRegex combined, it could follow "copy/free" but have an immutable ref/unref
> GPattern as an internal implementation detail.

It'd suck for C, GRegex combining pattern and match is not very nice, but it's easy.

Comment 54 Marco Barisione 2006-11-11 18:47:49 UTC

Created attachment 76399 [details]
Patch to add GRegex (inside libglib)

GRegex now contains only two pointers to the private structs GRegexPattern and GRegexMatch.

GRegexPattern is shared among different regexes, GRegexMatch is created only when needed.

Comment 55 Marco Barisione 2006-12-01 15:10:50 UTC

Created attachment 77488 [details]
Use G_GNUC_MAY_ALIAS instead of an ugly hack

Is there any news on GRegex inclusion in glib?

Comment 56 Marco Barisione 2006-12-17 16:23:41 UTC

Created attachment 78518 [details]
Patch to add GRegex (inside libglib)

I did some minor changes:
- g_regex_copy() returns NULL when the regex argument is NULL
- removed an unused parameter from split_replacemente() and expand_escape()
- changed the refentry id of regex-syntax.sgml from gregex-syntax.html to glib-regex-syntax.html

The EggRegex code (used by some branches of gtksourceview) at http://techn.ocracy.org/eggregex has been updated too.

Comment 57 Marco Barisione 2006-12-17 19:11:05 UTC

Created attachment 78528 [details]
Patch to add GRegex (inside libglib)

The previous patch didn't changed the refentry id of regex-syntax.sgml (so the html file was named gregex-syntax.html instead of glib-regex-syntax.html.)

Comment 58 Marco Barisione 2006-12-19 15:21:09 UTC

Created attachment 78628 [details]
Patch to add GRegex (inside libglib)

Today Philip Hazel released PCRE 7, this new version of the patch is using PCRE 7 as the internal copy of PCRE.

The new PCRE 7 has bug fixes and some refactoring of its internals, probabily the only visible new things are:
 - Added some syntactic alternatives for features that PCRE had previously implemented using the Python syntax or some other invented syntax. This alternatives are offered to match the new features added in Perl 5.10.
 - Implemented PCRE_NEWLINE_ANY to recognize any of the Unicode newline sequences (http://unicode.org/unicode/reports/tr18/) as "newline" when processing dot, circumflex, or dollar metacharacters, or #-comments in extended mode.
 - Added \R to match any Unicode newline sequence, as suggested in the Unicode report.

Currently GRegex can use PCRE 6.7 or 7 if compiled with --with-pcre=system. If you agree I would require PCRE 7, document the new syntax features and use by default PCRE_NEWLINE_ANY.

Comment 59 Marco Barisione 2006-12-19 15:46:10 UTC

I tested the new newline handling, it recognizes "\r\n", "\n" and "\r" as a single newline character, "\r\r", "\n\r", "\n\n" are recognized correctly as two newline characters.
\R and PCRE_NEWLINE_ANY implementation seems fast (but I didn't tested it), so I think that we really need to use PCRE_NEWLINE_ANY as the default and to write in big bold letters to always use \R when the input string may contain \r\n or other newlines.

Muntyan: using \R should solve the newline problems in gtksourceview, ASAP I'm going to update eggregex so you can test it.

Comment 60 Daniel Elstner 2007-01-16 14:40:06 UTC

Regarding the question whether language bindings would want to wrap GRegex:
glibmm will definitely wrap it.

Comment 61 Matthias Clasen 2007-01-16 17:57:57 UTC

Referring back to the discussion about ref/unref/object data vs language bindings,
I'd like to point out that 

"Datasets associate groups of data elements with particular memory locations. These are useful if you need to associate data with a structure returned from an external library. Since you cannot modify the structure, you use its location in memory as the key into a dataset, where you can associate any number of data elements with it."

So, even if gregex does not object data itself, that should not hinder language bindings from associating extra data with a GRegex...

Comment 62 Daniel Elstner 2007-01-16 18:19:05 UTC

While that is generally true, it's also a pain in the ass.  Not to mention the performance impact.  Granted though, attaching extra data to GObject instances also involves a hash table lookup.

Anyway, I vote for GRegex to implement value semantics.

Comment 63 Havoc Pennington 2007-01-16 18:53:25 UTC

In the "ref/unref/object-data" reference object pattern, you need destroy notification in addition to object data, so datasets are not quite sufficient for that pattern. There are two issues 1) always use the same proxy object for the same original object 2) destroy the proxy object along with the original object (and avoid destroying it while the original is still live)

Comment 64 Matthias Clasen 2007-01-16 19:03:03 UTC

Ah, right. I had forgotten about destroy notification. 
Marcos latest patches seem to have switched to value semantics anyway,
with copy/free and without ref/unref.

Comment 65 Marco Barisione 2007-01-17 11:54:55 UTC

Created attachment 80496 [details]
Patch to add GRegex (inside libglib)

This is an updated patch the uses the new newline handling features added to PCRE 7.0.

The only differences in the GRegex code are the new G_REGEX_NEWLINE_LF and G_REGEX_MATCH_NEWLINE_ANY.

The docs have been updated to explain the new syntax added to PCRE 7.

Comment 66 Marco Barisione 2007-01-18 10:06:52 UTC

Created attachment 80573 [details]
Patch to add GRegex (inside libglib)

In the previous patch I forgot to document G_REGEX_NEWLINE_LF and G_REGEX_MATCH_NEWLINE_ANY, and I forgot to add the description of two error codes.

The EggRegex code at http://techn.ocracy.org/eggregex/ is in sync with GRegex.

Comment 67 Matthias Clasen 2007-01-19 03:05:58 UTC

Before:
   text    data     bss     dec     hex filename
 605806    1432     936  608174   947ae .libs/libglib-2.0.so
.libs/libglib-2.0.so: 90 relocations, 63 relative (70%), 131 PLT entries, 1 for local syms (0%), 0 users

After:
   text    data     bss     dec     hex filename
 726625    2760     940  730325   b24d5 .libs/libglib-2.0.so
.libs/libglib-2.0.so: 290 relocations, 263 relative (90%), 131 PLT entries, 1 for local syms (0%), 0 userstable


So even with sharing Unicode data, this is still a significant size increase
for GLib. We may not be able to do much about the code size, but it should be
possible to get rid of most of the added relocations, a good deal of which 
probably come from the strings in the _pcre_utt table.

Comment 68 Marco Barisione 2007-01-23 10:23:25 UTC

Removing the content of _pcre_utt I get:
libglib-2.0.so.0.1300.0: 186 relocations, 159 relative (85%), 131 PLT entries, 1 for local syms (0%), 0 users

How can I reduce the number of relocations without removing _pcre_utt?

Comment 69 Matthias Clasen 2007-01-23 14:38:33 UTC

Replace the strings in the table by offsets into one big string constant. 
You can find examples of this technique in various places in gtk and gdk,
e.g. gdk_settings_names in gdk/x11/gdksettings.c

Comment 70 Marco Barisione 2007-01-25 23:27:31 UTC

Created attachment 81234 [details]
Patch to add GRegex (inside libglib)

Now the internal pcre is using a single string for the names in pcre_utt to reduce relocations:

libglib-2.0.so.0.1300.0: 187 relocations, 160 relative (85%), 131 PLT entries, 1 for local syms (0%), 0 users

Comment 71 Matthias Clasen 2007-01-26 21:11:28 UTC

Created attachment 81295 [details] [review]
another patch

Here is another patch that further reduces the amount of relocations in pcre/.
Sorry that I didn't take the time to figure out how to plug it into your
update-pcre/ framework.

Another change to reduce reductions and make the code more readable is to replace
match_error by a simple switch:

static const gchar *
match_error (gint errcode)
{
  switch (errcode)
    {
    case PCRE_ERROR_NOMATCH:
      break;
    case PCRE_ERROR_NULL:
      break;
    case PCRE_ERROR_BADOPTION:
      return "";
    case PCRE_ERROR_BADMAGIC:
      return _("corrupted object");
    case PCRE_ERROR_UNKNOWN_OPCODE:
      break;
    case PCRE_ERROR_NOMEMORY:
      return _("out of memory");
    case PCRE_ERROR_NOSUBSTRING:
      break;
    case PCRE_ERROR_MATCHLIMIT:
      return _("backtracking limit reached");
    case PCRE_ERROR_CALLOUT:
      break;
    case PCRE_ERROR_BADUTF8:
      break;
    case PCRE_ERROR_BADUTF8_OFFSET:
      break;
    case PCRE_ERROR_PARTIAL:
      break;
    case PCRE_ERROR_BADPARTIAL:
      return _("the pattern contains items not supported for partial matching");
    case PCRE_ERROR_INTERNAL:
      return _("internal error");
    case PCRE_ERROR_BADCOUNT:
      break;
    case PCRE_ERROR_DFA_UITEM:
      return _("the pattern contains items not supported for partial matching");
    case PCRE_ERROR_DFA_UCOND:
      return _("back references as conditions are not supported for partial matching");
    case PCRE_ERROR_DFA_UMLIMIT:
      break;
    case PCRE_ERROR_DFA_WSSIZE:
      break;
    case PCRE_ERROR_DFA_RECURSE:
    case PCRE_ERROR_RECURSIONLIMIT:
      return _("recursion limit reached");
    case PCRE_ERROR_NULLWSLIMIT:
      return _("workspace limit for empty substrings reached");
    case PCRE_ERROR_BADNEWLINE:
      return _("invalid combination of newline flags");
    default:
      break;
    }
  return _("unknown error");
}


With those changes, I think we come pretty close to the old number of relocations.

Comment 72 Marco Barisione 2007-01-27 15:15:33 UTC

Created attachment 81326 [details]
Patch to add GRegex (inside libglib)

libglib-2.0.so.0.1300.0: 102 relocations, 75 relative (73%), 131 PLT entries, 1 for local syms (0%), 0 users

Comment 73 Behdad Esfahbod 2007-01-27 19:22:25 UTC

The "1 for local syms" should be fixed too.  Run "make check" in glib/glib/ to see which symbol is not hidden well enough.

Comment 74 Matthias Clasen 2007-01-28 06:02:27 UTC

That 1 is not something introduced by the regex patch, though. It is present in 2.12, too.

Comment 75 Behdad Esfahbod 2007-01-28 08:54:02 UTC

Right.  Forgot about it.

Comment 76 Marco Barisione 2007-01-31 11:58:07 UTC

Created attachment 81582 [details]
Patch to add GRegex (inside libglib)

Declared _pcre_ucp_names as char[] instead of char* as suggested by Ulrich Drepper.

Comment 77 Yevgen Muntyan 2007-02-12 16:22:58 UTC

Would be nice to wrap pcre_fullinfo() in some way. gtksourceview needs PCRE_INFO_BACKREFMAX and PCRE_INFO_CAPTURECOUNT; PCRE_INFO_FIRSTBYTE and PCRE_INFO_LASTLITERAL look useful too.

Comment 78 Matthias Clasen 2007-03-15 04:23:39 UTC

Yevgen, where exactly is gtksourceview using these ? In fact, I cannot even see it use pcre at all...

Anyway, I looked over the patch again, and think we should commit this finally. 
It is an impressive amount of work, and keeping it in bugzilla is not making it 
any better. 

Marco, can you commit to trunk ?

Comment 79 Yevgen Muntyan 2007-03-15 05:58:40 UTC

(In reply to comment #78)
> Yevgen, where exactly is gtksourceview using these ? In fact, I cannot even see
> it use pcre at all...

It's soc-2006 branch. gtksourceview/libegg/regex/...

> Anyway, I looked over the patch again, and think we should commit this finally. 
> It is an impressive amount of work, and keeping it in bugzilla is not making it 
> any better. 

EggRegex rules!

Comment 80 Matthias Clasen 2007-03-15 06:11:29 UTC

Ah, ok. I think we can easily add accessors like

gint g_regex_get_max_backref  ( GRegex *regex)
gint g_regex_get_capture_count (GRegex *regex)

for useful fullinfo bits.

Comment 81 Marco Barisione 2007-03-15 13:09:43 UTC

Committed to trunk. Please close this bug as I can't.

Comment 82 Matthias Clasen 2007-03-15 17:58:46 UTC

Thanks a lot, Marco.

Yevgen, can you file a new bug for whatever additional api the pcre-branch of gtksourceview needs ?