GNOME Bugzilla – Bug 419368
Split GRegex in GRegex and GRegexMatch
Last modified: 2007-05-01 22:23:05 UTC
GRegex is a single struct that contains the immutable data (pattern, default options, etc) and the last match info, but it seems that a lot of people prefer two separate struct, see http://mail.gnome.org/archives/gtk-devel-list/2007-March/msg00133.html.
I'd also like to see the separate optimize function be replaced by a a compile flag that gets used in the constructor. That makes the data really immutable. Also, we can then avoid the complication of atomic assignment.
Match objects/matchers can be relatively big objects so I would like to recycle at least one freed match(er) object to cover probably the most common use cases, so I can kill _optimize but probably GRegex cannot be immutable if we to store the freed object.
As far as I understood, the common case is GPattern *p = g_pattern_new (); GMatcher *m = g_matcher_new (p); /* use m now as you used GRegex object in old code */ i.e. you don't have to cache/reuse anything. I'd guess someone needs to tell what API should be; it should *not* be just "split GRegex into two", since just splitting doesn't solve any problems. Owen?
Yes, thats how I would expect the split to work. There will probably be a few accessors that take a pattern a argument, but most of the API will be taking a matcher. Please don't call it GPattern, though. GLib already contains a g_pattern_ api.
How would you modify g_regex_split_next() and g_regex_match_next() for the new API? If possibile I would like to remove g_regex_clear(). I prefer a Python-like match object over a matcher, but maybe only because I usually use regexps in Python: GRegex *regex; GMatch *match; regex = g_regex_new("^([a-zA-Z0-9-]+)=(.*)$", 0, 0, NULL); if (g_regex_match (regex, &match, string, 0)) { gchar *key = g_match_fetch (match, 0, string); g_print ("Key is '%s'\n", key); g_free (key); g_match_free (match); } /* I don't need match, so I can pass NULL. */ if (g_regex_match (regex, NULL, another_string, 0)) { g_print ("Matched\n"); } g_regex_free (regex);
(In reply to comment #5) > How would you modify g_regex_split_next() and g_regex_match_next() for the new > API? If possibile I would like to remove g_regex_clear(). > > I prefer a Python-like match object over a matcher, but maybe only because I > usually use regexps in Python: Match objects are no good for match_next(), that's why it should be Matcher. GFoo *foo = g_foo_new("\\w*"); GBar *bar = g_bar_new(foo); if (g_bar_match (bar, string, 0)) { char *s = g_bar_fetch (bar, 0, string); ... } while (g_bar_match_next(bar, ...)) { int start, end; g_bar_fetch_pos (bar, 0, &start, &end); ... } Here "foo" stands for "pattern", and "bar" stands for "matcher" (no idea about good names). In other words, you replace 'regex' with '???' everywhere (except egg_regex_new()), where '???' is the name for matcher object. Pattern should be dealt with once, when you create matcher object; later you work with matcher, just like you work with GRegex now. In particular, it doesn't make sense to do this split now if noone knows good names for these classes (GRegex is such a good name now!).
I'm going to call the structs GRegex and GMatcher but I'm open to suggestions. Muntyan your code is broken because you cannot call g_matcher_match_next() after a g_matcher_match() without a call to g_matcher_clear(). g_matcher_clear() sucks so I want to remove it. Now you have something like: GRegex *regex = g_regex_new (...); GMatcher *matcher = g_matcher_new (regex); /* Use the matcher... */ g_matcher_clear (); while (g_matcher_match_next (matcher, string, 0)) { ... } I would prefer to remove g_matcher_clear() and g_matcher_match_next(), adding g_matcher_next(), this would be similar to using G(S)List calling _next() at the end of the loop body: g_matcher_match (matcher, string, 0); while (g_matcher_matched (matcher)) { ... g_matcher_next (matcher, string); } _split() should be a method of GRegex as in Java because you do not need the matcher after a split. The problem is that g_regex_split_next() (used when you want a token per call instead of an array) needs to keep its status somewhere, so we can add a GRegexSplitter struct or even remove g_regex_split_next(): GRegex *regex = g_regex_new (",\*", 0, 0, NULL); GRegexSplitter *splitter = g_regex_splitter_new (regex); gchar *token; while ((token = g_regex_splitter_next_token (splitter, string))) { g_print ("Token: %s\n", token); } g_regex_replace*(), g_regex_get_string_number() and g_regex_escape_string() will remain methods of GRegex.
Created attachment 84892 [details] Log of the discussion on #gtk+
Created attachment 84966 [details] Proposed API /* How to use the proposed API */ static const char string[] = "hello 123 world 456 789"; GRegex *regex = g_regex_new ("\d+", G_REGEX_OPTIMIZE, 0, NULL); GMatchInfo *match_info; /* If you do not care about subpatterns, reference expansion, etc. * you can pass a NULL match_info. */ if (g_regex_match (regex, string, 0, NULL)) { g_print ("Matched\n"); } /* You only want the first match: */ if (g_regex_match (regex, string, &match_info, NULL)) { g_print ("Number: %s\n", number); g_match_info_free (match_info); } /* Retrieve all the non-overlapping matches. */ g_regex_match (regex, string, 0, &match_info); while (g_match_info_matches (match_info)) { gchar *number = g_match_info_fetch (match_info, 0); g_print ("Number: %s\n", number); g_match_info_next (match_info); } g_match_info_free (match_info); g_regex_free (regex);
Does the example miss some gchar *number = g_match_info_fetch (match_info, 0); in the second if ?
Yes and I'm not freeing number :)
Looking at the proposed api made me wonder: do we want/need a replace-variant that only replaces a single match ? (it should take a match as an argument then, probably) Also, slightly odd that we have a g_match_free() but no g_match_new().
I prefer to add a max_replacements argument or keeping the API as it is. g_regex_replace() and g_regex_replace_literal() are only convenience functions to avoid writing lots of callbacks for g_regex_replace_eval(), to replace a single match you can use this: static gboolean replace_foo_with_bar_cb (const GRegex *regex, const GMatchInfo *match_info, const gchar *string, GString *result, gpointer data) { g_string_append (result, "bar"); return TRUE; /* stop after the first iteration */ } static gchar * replace_foo_with_bar (const gchar *string) { /* replace "foo" with "bar". */ GRegex *regex = g_regex_new ("foo"); gchar *ret = g_regex_replace_eval (regex, string, -1, 0, 0, replace_foo_with_bar_cb, NULL, NULL); g_regex_free (regex); return ret; } g_regex_match*() are factories for GMatchInfo, there are other structs in glib without a real _new(), such as G(S)List and GStaticMutex. Do you have a better idea? I do not like matchers, they would be used for matching but not for replacing or splitting.
Ah, I had forgotten that replace_eval had TRUE-stops return value semantics. Thats good enough, but is probably worth a note in the docs (ie. "Note: To replace only the first match, use...") I think the proposed api for match is fine.
I'm trying to implement the API but I have some problems with error handling, the simple case, without GError is easy: g_regex_match (regex, string, 0, &match_info); while (g_match_info_matches (match_info)) { ... g_match_info_next (match_info); } g_match_info_free (match_info); However if you use g_regex_match_full() you have: g_regex_match_full (regex, string, -1, 0, 0, &match_info, error); while (g_match_info_matches (match_info)) { ... g_match_info_next (match_info); } g_match_info_free (match_info); How can we handle errors generated by g_match_info_next()? Adding an error argument to g_match_info_next() is ugly: g_regex_match_full (regex, string, -1, 0, 0, &match_info, &error); if (error != NULL) { ... } while (g_match_info_matches (match_info)) { ... g_match_info_next (match_info, &error); if (error != NULL) { ... } } g_match_info_free (match_info); Another option is: g_regex_match_full (regex, string, -1, 0, 0, &match_info, NULL); while (g_match_info_matches (match_info)) { ... g_match_info_next (match_info); } if (!g_match_info_is_ok (match_info, &error) { ... } g_match_info_free (match_info); But this is ugly too, I would prefer to ignore errors in g_match_info_next() as most errors can be generated only by the first call to pcre_exec, that is in g_regex_match_full(). The possible errors for g_match_info_next() are: - PCRE_ERROR_NOMEMORY: we can avoid propagating this as glib aborts when g_malloc() fails - PCRE_ERROR_INTERNAL: a bug in PCRE, we can just print an error message - PCRE_ERROR_MATCHLIMIT, PCRE_ERROR_RECURSIONLIMIT: internal limits reached So the possible real errors are PCRE_ERROR_MATCHLIMIT and PCRE_ERROR_RECURSIONLIMIT, caused only by pathological regexps but note that some pathological regexps can crash PCRE, Perl, etc. An example of a pathological regexp is "(.|.)*[xy]" matched against the string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa". You can modify it to fail only in the _next() method using the pattern "A|((.|.)*[xy])" and the string "Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", so the first iteration finds "A", the second one does not find an "A" so it can only match the pathological regexp.
Another possibility is to add an error argument to g_match_info_matches: g_regex_match_full (regex, string, -1, 0, 0, &match_info, NULL); while (g_match_info_matches (match_info, &error)) { ... g_match_info_next (match_info); } g_match_info_free (match_info); if (error != NULL) { ... }
(In reply to comment #16) > Another possibility is to add an error argument to g_match_info_matches: > > g_regex_match_full (regex, string, -1, 0, 0, &match_info, NULL); > while (g_match_info_matches (match_info, &error)) > { > ... > g_match_info_next (match_info); > } > g_match_info_free (match_info); What does g_match_info_matches do here? Does it call pcre_exec()? Then it should not be "matches". If it's next() that calls pcre_exec, then it should be called match_next. In any case it's no good - first Regex method searches for match, then MatchInfo. So, rename MatchInfo to Matcher and you're done ;)
No, g_match_info_matches() returns TRUE if the previous match operation matched. Do you have a better name? Maybe g_match_info_matched()? I'm not able to chose names...
i think g_match_info_matches() is ok, or maybe g_match_info_has_match() ? Wrt to error handling, why do you have to handle the error returned by match_info_next() inside the loop ? Certainly, match_info_matches will return FALSE if an error occurred, so you could write: g_regex_match_info_full (..., &error); while (g_match_info_matches (match_info)) { ... g_match_info_next (match_info, &error); } if (error) { /* handle error */ } Maybe thats ugly too, but I think it is tolerable.
(In reply to comment #18) > No, g_match_info_matches() returns TRUE if the previous match operation > matched. Do you have a better name? Maybe g_match_info_matched()? I'm not able > to chose names... My objection is not about names (choosing good names should be left to native speakers ;), it was "if .. then .. else .. then ..". Thing is: g_match_info_next() calls pcre_exec() and g_regex_match_full() calls pcre_exec, i.e. we have two objects which both perform search operation. It doesn't look nice: either GRegex should search or GMatcher should search, or something like that; MatchInfo should be *info*, about already performed search.
Created attachment 85466 [details] [review] New API
Small comments from looking through the patch: + * The match is done on the string passed to the match function, so you + * cannot free it before calling this function. I think this is somewhat misleading; one could think that it says you have to call g_match_info_next() before freeing the string. How about The match is done on the string passed to the match function, so you cannot call this function after freeing the string. There are similar formulations in other places. I think the api looks ok. At the risk of delaying the final conclusion here further, I'm coming back to the earlier discussion again. Another option more in line with YMs complaints would be to do away with the distinction between first and subsequent matches, and do match = g_match_info_new (); while (g_regex_match_full (regex, match, ..., &error)) { ... } if (error) { ... } g_match_info_free (match);
I like your API proposal and it should be easy to implement it without lots of changes to the code and I prefer it over matchers because you can pass a null GMatchInfo if you do not need it. The thing I don't like is that you can do: g_regex_match (regex, match_info, "hello world", ...); g_regex_match (regex, match_info, "ciao", ...); What about a g_return_val_if_fail() to check if the string and options have been changed? What should i do if the same GMatchInfo is used for different regexes? g_regex_match(regex1, match_info, ...); g_regex_match(regex2, match_info, ...); Do you prefer GMatchInfo as the last parameter before the GError or as the second parameter after the GRegex?
thats the downside of this proposal, that you can specify different string and option arguments to consecutive regex_match calls. Don't know if we can easily do return_if_fail for these situations. In any case, all of these should be considered errors.
In the interest of finalizing this api, I think we should go with the api you proposed. New inspiration is not likely to happen anytime soon.
Is this API really what "most people want"? It's still not nice, you have to free match_info (and it can be biggish, like a kilobyte), you got two structures which performs match; and it's certainly not what people on the list proposed (kind of). Why not leave old API as it is, with some nicified copy() method, so nobody is confused about its purpose? Since nobody bothered to propose "real nice" API, maybe it's not a good idea to introduce something different but uglier?
If you want I can keep the last freed GMatchInfo so it can be "recycled".
(In reply to comment #27) > If you want I can keep the last freed GMatchInfo so it can be "recycled". Can't say I do want it or I don't. Probably it's not a bad thing. I do hate having to free the match_info thing while with EggRegex it was all hidden inside (for zero benefit, that is).
Created attachment 86309 [details] [review] new GRegex API This patch implements the API proposed in comment 22.
Marco, sorry to change my mind again, but considering the problem pointed out in comment #23, I'd rather stay with your patch in comment #21. Just fix the type s/nedd/need/ in 2 places, and commit it. Just one more variation to propose (can't stop...): We could address the "recycle match" issue by letting the use allocate the match object if it needs it: match = g_match_info_new (); g_regex_match_info_full (regex, match, ...); while (g_match_info_matches (match)) { ... g_match_info_next (match, NULL); } /* reuse the same match with a new regex */ g_regex_match_info_full (regex2, match...); ... g_match_info_free (match);
(In reply to comment #30) > Just one more variation to propose (can't stop...): We could address the > "recycle match" issue by letting the use allocate the match object if it needs > it: Note that it was done before by GRegex by wrapping match structure. Given that you can't use one MatchInfo structure with different GRegex objects, you would need to store somewhere both GRegex and GMatchInfo, and then you come to some struct MyRegex {GRegex regex; GMatchInfo match_info;} which is kind of strange. If a user needs to babysit GRegex, he won't babysit it, he will do "normal" thing (allocating-freeing GMatchInfo all the time).
> Given that you can't use one MatchInfo structure with different GRegex > objects... Of course, the idea of having caller-allocated matchinfo objects only makes sense if we do allow reusing matchinfos with different regex objects. We would also need a match_info_clear() then, I guess.
MatchInfo contains an int array which pcre fills in with match offsets, and that array is what actually takes the space. When you use a MatchInfo with different Regex objects that array may need to be expanded. So single clear() isn't enough; MatchInfo methods also should take care of reallocating the array if needed. Quite doable of course. Still, it's nasty. All this "nice and clear" business because of one single method with a bad name!
I went ahead and committed Marcos patch as-is. 2007-04-30 Matthias Clasen <mclasen@redhat.com> * glib/glib.symbols: * glib/gregex.[hc]: Split GRegex into GRegex and GMatchInfo. (#419368, Marco Barisione) * tests/regex-test.c: Adapt.
Matthias, the patch you committed does not contain the change from "The string is fetched from the string passed to the match function, so you cannot free it before calling this function." to "... so you cannot call this function after freeing the string." and there is the typo you found ("nedd" instead of "need"). Can I fix them and commit my changes?
Sure, go ahead, I must have overlooked that.