Bug 419368 – Split GRegex in GRegex and GRegexMatch

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 419368 - Split GRegex in GRegex and GRegexMatch


Summary:	Split GRegex in GRegex and GRegexMatch


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	2.13.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-03-17 14:47 UTC by Marco Barisione
Modified:	2007-05-01 22:23 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Log of the discussion on #gtk+ (5.46 KB, text/plain) 2007-03-19 17:16 UTC, Marco Barisione		Details
Proposed API (7.03 KB, application/octet-stream) 2007-03-20 13:01 UTC, Marco Barisione		Details
New API (132.14 KB, patch) 2007-03-28 15:51 UTC, Marco Barisione	committed	Details \| Review
new GRegex API (132.61 KB, patch) 2007-04-13 19:28 UTC, Marco Barisione	rejected	Details \| Review

Description Marco Barisione 2007-03-17 14:47:53 UTC

GRegex is a single struct that contains the immutable data (pattern, default options, etc) and the last match info, but it seems that a lot of people prefer two separate struct, see http://mail.gnome.org/archives/gtk-devel-list/2007-March/msg00133.html.

Comment 1 Matthias Clasen 2007-03-18 04:22:52 UTC

I'd also like to see the separate optimize function be replaced by a a compile flag that gets used in the constructor. That makes the data really immutable.
Also, we can then avoid the complication of atomic assignment.

Comment 2 Marco Barisione 2007-03-18 20:51:48 UTC

Match objects/matchers can be relatively big objects so I would like to recycle at least one freed match(er) object to cover probably the most common use cases, so I can kill _optimize but probably GRegex cannot be immutable if we to store the freed object.

Comment 3 Yevgen Muntyan 2007-03-18 22:43:08 UTC

As far as I understood, the common case is

GPattern *p = g_pattern_new ();
GMatcher *m = g_matcher_new (p);
/* use m now as you used GRegex object in old code */

i.e. you don't have to cache/reuse anything. I'd guess someone needs to tell what API should be; it should *not* be just "split GRegex into two", since just splitting doesn't solve any problems.

Owen?

Comment 4 Matthias Clasen 2007-03-18 23:16:04 UTC

Yes, thats how I would expect the split to work. 
There will probably be a few accessors that take a pattern a
argument, but most of the API will be taking a matcher.

Please don't call it GPattern, though. GLib already contains
a g_pattern_ api.

Comment 5 Marco Barisione 2007-03-18 23:41:10 UTC

How would you modify g_regex_split_next() and g_regex_match_next() for the new API? If possibile I would like to remove g_regex_clear().

I prefer a Python-like match object over a matcher, but maybe only because I usually use regexps in Python:

GRegex *regex;
GMatch *match;

regex = g_regex_new("^([a-zA-Z0-9-]+)=(.*)$", 0, 0, NULL);

if (g_regex_match (regex, &match, string, 0))
{
    gchar *key = g_match_fetch (match, 0, string);
    g_print ("Key is '%s'\n", key);
    g_free (key);
    g_match_free (match);
}

/* I don't need match, so I can pass NULL. */
if (g_regex_match (regex, NULL, another_string, 0))
{
    g_print ("Matched\n");
}

g_regex_free (regex);

Comment 6 Yevgen Muntyan 2007-03-19 00:53:50 UTC

(In reply to comment #5)
> How would you modify g_regex_split_next() and g_regex_match_next() for the new
> API? If possibile I would like to remove g_regex_clear().
> 
> I prefer a Python-like match object over a matcher, but maybe only because I
> usually use regexps in Python:

Match objects are no good for match_next(), that's why it should be Matcher.

GFoo *foo = g_foo_new("\\w*");
GBar *bar = g_bar_new(foo);

if (g_bar_match (bar, string, 0))
{
  char *s = g_bar_fetch (bar, 0, string);
  ...
}

while (g_bar_match_next(bar, ...))
{
   int start, end;
   g_bar_fetch_pos (bar, 0, &start, &end);
   ...
}

Here "foo" stands for "pattern", and "bar" stands for "matcher" (no idea about good names). In other words, you replace 'regex' with '???' everywhere (except egg_regex_new()), where '???' is the name for matcher object. Pattern should be dealt with once, when you create matcher object; later you work with matcher, just like you work with GRegex now.
In particular, it doesn't make sense to do this split now if noone knows good names for these classes (GRegex is such a good name now!).

Comment 7 Marco Barisione 2007-03-19 11:08:52 UTC

I'm going to call the structs GRegex and GMatcher but I'm open to suggestions.

Muntyan your code is broken because you cannot call g_matcher_match_next() after a g_matcher_match() without a call to g_matcher_clear(). g_matcher_clear() sucks so I want to remove it.

Now you have something like:

GRegex *regex = g_regex_new (...);
GMatcher *matcher = g_matcher_new (regex);

/* Use the matcher... */

g_matcher_clear ();
while (g_matcher_match_next (matcher, string, 0))
{
    ...
}

I would prefer to remove g_matcher_clear() and g_matcher_match_next(), adding g_matcher_next(), this would be similar to using G(S)List calling _next() at the end of the loop body:

g_matcher_match (matcher, string, 0);
while (g_matcher_matched (matcher))
{
    ...
    g_matcher_next (matcher, string);
}


_split() should be a method of GRegex as in Java because you do not need the matcher after a split. The problem is that g_regex_split_next() (used when you want a token per call instead of an array) needs to keep its status somewhere, so we can add a GRegexSplitter struct or even remove g_regex_split_next():

GRegex *regex = g_regex_new (",\*", 0, 0, NULL);
GRegexSplitter *splitter = g_regex_splitter_new (regex);
gchar *token;

while ((token = g_regex_splitter_next_token (splitter, string)))
{
    g_print ("Token: %s\n", token);
}


g_regex_replace*(), g_regex_get_string_number() and g_regex_escape_string() will remain methods of GRegex.

Comment 8 Marco Barisione 2007-03-19 17:16:40 UTC

Created attachment 84892 [details]
Log of the discussion on #gtk+

Comment 9 Marco Barisione 2007-03-20 13:01:47 UTC

Created attachment 84966 [details]
Proposed API

/* How to use the proposed API */
static const char string[] = "hello 123 world 456 789";
GRegex *regex = g_regex_new ("\d+", G_REGEX_OPTIMIZE, 0, NULL);
GMatchInfo *match_info;

/* If you do not care about subpatterns, reference expansion, etc.
 * you can pass a NULL match_info. */
if (g_regex_match (regex, string, 0, NULL))
{
    g_print ("Matched\n");
}

/* You only want the first match: */
if (g_regex_match (regex, string, &match_info, NULL))
{
    g_print ("Number: %s\n", number);
    g_match_info_free (match_info);
}

/* Retrieve all the non-overlapping matches. */
g_regex_match (regex, string, 0, &match_info);
while (g_match_info_matches (match_info))
{
    gchar *number = g_match_info_fetch (match_info, 0);
    g_print ("Number: %s\n", number);
    g_match_info_next (match_info);
}
g_match_info_free (match_info);

g_regex_free (regex);

Comment 10 Matthias Clasen 2007-03-20 13:39:01 UTC

Does the example miss some 

    gchar *number = g_match_info_fetch (match_info, 0);

in the second if ?

Comment 11 Marco Barisione 2007-03-20 13:42:33 UTC

Yes and I'm not freeing number :)

Comment 12 Matthias Clasen 2007-03-21 13:46:54 UTC

Looking at the proposed api made me wonder: do we want/need a replace-variant
that only replaces a single match ? (it should take a match as an argument then, probably)

Also, slightly odd that we have a g_match_free() but no g_match_new().

Comment 13 Marco Barisione 2007-03-21 18:14:41 UTC

I prefer to add a max_replacements argument or keeping the API as it is.
g_regex_replace() and g_regex_replace_literal() are only convenience functions to avoid writing lots of callbacks for g_regex_replace_eval(), to replace a single match you can use this:

static gboolean
replace_foo_with_bar_cb (const GRegex     *regex,
			 const GMatchInfo *match_info,
			 const gchar      *string,
			 GString          *result,
			 gpointer          data)
{
  g_string_append (result, "bar");
  return TRUE; /* stop after the first iteration */
}

static gchar *
replace_foo_with_bar (const gchar *string)
{
  /* replace "foo" with "bar". */
  GRegex *regex = g_regex_new ("foo");
  gchar *ret = g_regex_replace_eval (regex, string, -1, 0, 0,
				     replace_foo_with_bar_cb,
				     NULL, NULL);
  g_regex_free (regex);
  return ret;
}


g_regex_match*() are factories for GMatchInfo, there are other structs in glib without a real _new(), such as G(S)List and GStaticMutex.
Do you have a better idea? I do not like matchers, they would be used for matching but not for replacing or splitting.

Comment 14 Matthias Clasen 2007-03-22 00:07:45 UTC

Ah, I had forgotten that replace_eval had TRUE-stops return value semantics.
Thats good enough, but is probably worth a note in the docs (ie. "Note: To replace only the first match, use...")

I think the proposed api for match is fine.

Comment 15 Marco Barisione 2007-03-23 11:58:57 UTC

I'm trying to implement the API but I have some problems with error handling, the simple case, without GError is easy:

g_regex_match (regex, string, 0, &match_info);
while (g_match_info_matches (match_info))
{
    ...
    g_match_info_next (match_info);
}
g_match_info_free (match_info);


However if you use g_regex_match_full() you have:

g_regex_match_full (regex, string, -1, 0, 0, &match_info, error);
while (g_match_info_matches (match_info))
{
    ...
    g_match_info_next (match_info);
}
g_match_info_free (match_info);

How can we handle errors generated by g_match_info_next()? Adding an error argument to g_match_info_next() is ugly:

g_regex_match_full (regex, string, -1, 0, 0, &match_info, &error);
if (error != NULL)
{
    ...
}
while (g_match_info_matches (match_info))
{
    ...
    g_match_info_next (match_info, &error);
    if (error != NULL)
    {
        ...
    }
}
g_match_info_free (match_info);

Another option is:
g_regex_match_full (regex, string, -1, 0, 0, &match_info, NULL);
while (g_match_info_matches (match_info))
{
    ...
    g_match_info_next (match_info);
}
if (!g_match_info_is_ok (match_info, &error)
{
    ...
}
g_match_info_free (match_info);

But this is ugly too, I would prefer to ignore errors in g_match_info_next() as most errors can be generated only by the first call to pcre_exec, that is in g_regex_match_full().
The possible errors for g_match_info_next() are:
- PCRE_ERROR_NOMEMORY: we can avoid propagating this as glib aborts when g_malloc() fails
- PCRE_ERROR_INTERNAL: a bug in PCRE, we can just print an error message
- PCRE_ERROR_MATCHLIMIT, PCRE_ERROR_RECURSIONLIMIT: internal limits reached

So the possible real errors are PCRE_ERROR_MATCHLIMIT and PCRE_ERROR_RECURSIONLIMIT, caused only by pathological regexps but note that some pathological regexps can crash PCRE, Perl, etc.
An example of a pathological regexp is "(.|.)*[xy]" matched against the string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa". You can modify it to fail only in the _next() method using the pattern "A|((.|.)*[xy])" and the string "Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", so the first iteration finds "A", the second one does not find an "A" so it can only match the pathological regexp.

Comment 16 Marco Barisione 2007-03-23 13:21:12 UTC

Another possibility is to add an error argument to g_match_info_matches:

g_regex_match_full (regex, string, -1, 0, 0, &match_info, NULL);
while (g_match_info_matches (match_info, &error))
{
    ...
    g_match_info_next (match_info);
}
g_match_info_free (match_info);
if (error != NULL)
{
    ...
}

Comment 17 Yevgen Muntyan 2007-03-23 13:39:50 UTC

(In reply to comment #16)
> Another possibility is to add an error argument to g_match_info_matches:
> 
> g_regex_match_full (regex, string, -1, 0, 0, &match_info, NULL);
> while (g_match_info_matches (match_info, &error))
> {
>     ...
>     g_match_info_next (match_info);
> }
> g_match_info_free (match_info);

What does g_match_info_matches do here? Does it call pcre_exec()? Then it should not be "matches". If it's next() that calls pcre_exec, then it should be called match_next. In any case it's no good - first Regex method searches for match, then MatchInfo.
So, rename MatchInfo to Matcher and you're done ;)

Comment 18 Marco Barisione 2007-03-23 13:50:06 UTC

No, g_match_info_matches() returns TRUE if the previous match operation matched. Do you have a better name? Maybe g_match_info_matched()? I'm not able to chose names...

Comment 19 Matthias Clasen 2007-03-23 14:34:38 UTC

i think g_match_info_matches() is ok, or maybe g_match_info_has_match() ?

Wrt to error handling, why do you have to handle the error returned by
match_info_next() inside the loop ? Certainly, match_info_matches will
return FALSE if an error occurred, so you could write:

g_regex_match_info_full (..., &error);

while (g_match_info_matches (match_info))
  {
    ...
    g_match_info_next (match_info, &error);
  }

if (error) 
  {
     /* handle error */
  }


Maybe thats ugly too, but I think it is tolerable.

Comment 20 Yevgen Muntyan 2007-03-23 15:53:40 UTC

(In reply to comment #18)
> No, g_match_info_matches() returns TRUE if the previous match operation
> matched. Do you have a better name? Maybe g_match_info_matched()? I'm not able
> to chose names...

My objection is not about names (choosing good names should be left to native speakers ;), it was "if .. then .. else .. then ..". Thing is: g_match_info_next() calls pcre_exec() and g_regex_match_full() calls pcre_exec, i.e. we have two objects which both perform search operation. It doesn't look nice: either GRegex should search or GMatcher should search, or something like that; MatchInfo should be *info*, about already performed search.

Comment 21 Marco Barisione 2007-03-28 15:51:42 UTC

Created attachment 85466 [details] [review]
New API

Comment 22 Matthias Clasen 2007-04-02 16:00:41 UTC

Small comments from looking through the patch:

+ * The match is done on the string passed to the match function, so you
+ * cannot free it before calling this function.

I think this is somewhat misleading; one could think that it says you have
to call g_match_info_next() before freeing the string. How about

   The match is done on the string passed to the match function, so you
   cannot call this function after freeing the string.

There are similar formulations in other places. I think the api looks ok.
At the risk of delaying the final conclusion here further, I'm coming back to the earlier discussion again. Another option more in line with YMs complaints would
be to do away with the distinction between first and subsequent matches, and do

 match = g_match_info_new ();
 while (g_regex_match_full (regex, match, ..., &error)) 
   {
      ...
   }

 if (error) 
   {
     ...
   }

 g_match_info_free (match);

Comment 23 Marco Barisione 2007-04-02 17:47:27 UTC

I like your API proposal and it should be easy to implement it without lots of changes to the code and I prefer it over matchers because you can pass a null GMatchInfo if you do not need it. The thing I don't like is that you can do:

g_regex_match (regex, match_info, "hello world", ...);
g_regex_match (regex, match_info, "ciao", ...);

What about a g_return_val_if_fail() to check if the string and options have been changed?

What should i do if the same GMatchInfo is used for different regexes?

g_regex_match(regex1, match_info, ...);
g_regex_match(regex2, match_info, ...);

Do you prefer GMatchInfo as the last parameter before the GError or as the second parameter after the GRegex?

Comment 24 Matthias Clasen 2007-04-02 18:42:50 UTC

thats the downside of this proposal, that you can specify different string and option arguments to consecutive regex_match calls. Don't know if we can easily
do return_if_fail for these situations. In any case, all of these should be considered errors.

Comment 25 Matthias Clasen 2007-04-11 12:16:00 UTC

In the interest of finalizing this api, I think we should go with the api you proposed. New inspiration is not likely to happen anytime soon.

Comment 26 Yevgen Muntyan 2007-04-11 16:13:05 UTC

Is this API really what "most people want"? It's still not nice, you have to free match_info (and it can be biggish, like a kilobyte), you got two structures which performs match; and it's certainly not what people on the list proposed (kind of).  Why not leave old API as it is, with some nicified copy() method, so nobody is confused about its purpose? Since nobody bothered to propose "real nice" API, maybe it's not a good idea to introduce something different but uglier?

Comment 27 Marco Barisione 2007-04-11 17:29:01 UTC

If you want I can keep the last freed GMatchInfo so it can be "recycled".

Comment 28 Yevgen Muntyan 2007-04-11 20:23:16 UTC

(In reply to comment #27)
> If you want I can keep the last freed GMatchInfo so it can be "recycled".

Can't say I do want it or I don't. Probably it's not a bad thing. I do hate having to free the match_info thing while with EggRegex it was all hidden inside (for zero benefit, that is).

Comment 29 Marco Barisione 2007-04-13 19:28:10 UTC

Created attachment 86309 [details] [review]
new GRegex API

This patch implements the API proposed in comment 22.

Comment 30 Matthias Clasen 2007-04-25 05:10:51 UTC

Marco, sorry to change my mind again, but considering the problem pointed out in comment #23, I'd rather stay with your patch in comment #21. Just fix the type 
s/nedd/need/ in 2 places, and commit it.

Just one more variation to propose (can't stop...): We could address the
"recycle match" issue by letting the use allocate the match object if it needs it:

match = g_match_info_new ();
g_regex_match_info_full (regex, match, ...);

while (g_match_info_matches (match))
  {
    ...
    g_match_info_next (match, NULL);
  }


/* reuse the same match with a new regex */
g_regex_match_info_full (regex2, match...);

...

g_match_info_free (match);

Comment 31 Yevgen Muntyan 2007-04-29 14:02:42 UTC

(In reply to comment #30)
> Just one more variation to propose (can't stop...): We could address the
> "recycle match" issue by letting the use allocate the match object if it needs
> it:

Note that it was done before by GRegex by wrapping match structure. Given that you can't use one MatchInfo structure with different GRegex objects, you would need to store somewhere both GRegex and GMatchInfo, and then you come to some struct MyRegex {GRegex regex; GMatchInfo match_info;} which is kind of strange. If a user needs to babysit GRegex, he won't babysit it, he will do "normal" thing (allocating-freeing GMatchInfo all the time).

Comment 32 Matthias Clasen 2007-04-29 15:45:59 UTC

> Given that you can't use one MatchInfo structure with different GRegex 
> objects...

Of course, the idea of having caller-allocated matchinfo objects only makes sense
if we do allow reusing matchinfos with different regex objects. We would also need a match_info_clear() then, I guess.

Comment 33 Yevgen Muntyan 2007-04-29 16:25:04 UTC

MatchInfo contains an int array which pcre fills in with match offsets, and that array is what actually takes the space. When you use a MatchInfo with different Regex objects that array may need to be expanded. So single clear() isn't enough; MatchInfo methods also should take care of reallocating the array if needed. Quite doable of course. 
Still, it's nasty. All this "nice and clear" business because of one single method with a bad name!

Comment 34 Matthias Clasen 2007-04-30 16:02:06 UTC

I went ahead and committed Marcos patch as-is.

2007-04-30  Matthias Clasen  <mclasen@redhat.com>

        * glib/glib.symbols:
        * glib/gregex.[hc]: Split GRegex into GRegex and GMatchInfo.
        (#419368, Marco Barisione)

        * tests/regex-test.c: Adapt.

Comment 35 Marco Barisione 2007-05-01 20:42:15 UTC

Matthias, the patch you committed does not contain the change from "The string is fetched from the string passed to the match function, so you cannot free it before calling this function." to "... so you cannot call this function after freeing the string." and there is the typo you found ("nedd" instead of "need"). Can I fix them and commit my changes?

Comment 36 Matthias Clasen 2007-05-01 22:23:05 UTC

Sure, go ahead, I must have overlooked that.