GNOME Bugzilla – Bug 607702
[Patch] Adds native regular expression literals support
Last modified: 2010-03-25 09:45:45 UTC
Created attachment 151962 [details] [review] Vala regular expression literals support The patch attached adds native regular expression literals into Vala. This makes it possible to check regular expressions during compilation phase instead of runtime checking. Regex literals also makes code easier to read and fits well into the language syntax. See an example below. ----- using GLib; #if false // Using GLib.Regex library from Vala static int main (string[] args) { stdout.printf ("%s\n", args[1]); try { var re = new Regex ("""ab/(\d+)""", RegexCompileFlags.CASELESS); MatchInfo info; if (re.match (args[1], 0, out info)) { stdout.printf ("Matches ('%s')...\n", info.fetch (1)); } else { stdout.printf ("Does not match.\n"); } } catch (RegexError err) { stderr.printf ("Invalid regex.\n"); return 1; } return 0; } #else // Using regular expression literals static int main (string[] args) { stdout.printf ("%s\n", args[1]); MatchInfo info; Regex re; if ((re = @/ab\/(\d+)/i).match (args[1], 0, out info)) { stdout.printf ("Matches ('%s')...\n", info.fetch (1)); } else { stdout.printf ("Does not match.\n"); } return 0; } #endif ------ In the future it could also be possible to make some optimizations so that the regex is stored into a static variable and created only once when entering the block the first time. This patch, however, does not do it yet.
Here is another example: using GLib; static int main (string[] args) { MatchInfo info; // Simple greedy regular expression matching. var str1 = "mississippi"; if (@/(is.*ip)/.match (str1, 0, out info)) { stdout.printf ("Part of %s is '%s'...\n", str1, info.fetch (1)); } else { stdout.printf ("Did not match at all.\n"); } // Match caseless. var str2 = "demonStration"; if (@/mon(str.*o)n/i.match (str2, 0, out info)) { stdout.printf ("Part of %s is '%s'...\n", str2, info.fetch (1)); } else { stdout.printf ("%s did not match at all.\n", str2); } // Match and pick substrings. var ts = "Time: 10:42:12"; if (@/Time: (..):(..):(..)/.match (ts, 0, out info)) { stdout.printf ("%s\n\thours = %s\n\tminutes = %s\n\tseconds = %s\n\n", ts, info.fetch (1), info.fetch (2), info.fetch (3)); } // Replace demo: word swapping try { var str = "apple grape"; stdout.printf ("'%s' becomes '%s'\n", str, @/^([^ ]*) *([^ ]*)/.replace (str, -1, 0, """\2 \1""")); } catch (RegexError err) { // Replacing still needs exception catching message (err.message); } return 0; }
That's one of the things I love from perl and ruby :) and I was also thinking on having support for this in vala. Good point! But it will be probably nice to reduce the syntax required to use them. The MatchInfo can be allocated by vala instead of by the user, and the info.fetch() variables can be wrapped by special variables $1, $2, ... this will make the support of regular expressions a bit hard to implement, but will make the syntax much simpler and readable. We can also add the =~ operator to do things like: if ("Time: 10:42:12" =~ @/Time: (..):(..):(..)/) { print (@"hours = $1\n"); print (@"minutes = $2\n"); print (@"seconds = $3\n"); } One of the problems I have always seen in regular expression support in languages is that the expressions are compiled everytime you use them instead of caching the compilation of the regexp and then just using instances of this compilation. Which makes the execution a bit slower. Replacing strings can be something like: var str = "apple grape" str =~ @s/^([^ ]*) *([^ ]*)/\2 \1/); print ("%s\n", str); // must print 'grape apple'
Yes, I think $<n> would be very nice extension for matching. There is, of course, several details in supporting this like: * is the value of $3 (from previous matching) changed (or cleared) if current matching only takes two substrings * what is the scope of these $-variables * error handling when, i.g. if the regex was not created from a plain regex literal I like the idea of replacements too. I just wanted to start from something simple (plain literals) which is the easiest case but eliminates most of the unnecessary exception handling from source by taking it to be done by the compiler. I would also like to cache the regex into a static variable but I didn't find my way in the Vala compiler source to do it easily (no local constants supported at the moment).
*** Bug 584968 has been marked as a duplicate of this bug. ***
Thanks for the patch. Have you encountered syntax conflicts or why have you used @/.../ instead of the more common /.../?
Actually I'm not sure how easy it would be to scan /.../ compared to @/.../ but I wanted to be sure. At least in the case below the scanner does not have to know the context if there is this additional @-character. var x = b/a/c;
In Perl each variable has $-prefix. Thus, $b/$a/$c is not a problem in Perl. I think you either has to separate the division operator from reg.expr. literal syntax, or, separate variables somehow so that they will not be mixed with reg.expr. content (and the later is not possible without braking Vala syntax).
Yes, it's certainly not trivial to parse this without @ due to ambiguities. However, JavaScript does this as well, for example. Assuming there is no technical reason why this should be more problematic in Vala than in JavaScript, I'd really like to avoid the extra @. A question regarding the patch, what's the reason to have separate Regex and RegexLiteral classes? Are they ever used separately?
I'm happy if we can get rid of @. I just thought it would be nicer if the scanner does not have to be aware of parsing context but it is probably not I big issue. RegexLiteral and Regex can be combined. Regex inherits Expression and RegexLiteral inherits Literal which inherits Expression. Probably RegexLiteral would be the one to keep?
(In reply to comment #9) > I'm happy if we can get rid of @. I just thought it would be nicer if the > scanner does not have to be aware of parsing context but it is probably not I > big issue. Yes, keeping scanner simple is certainly desirable. However, I think it's worth it in this case to achieve the same syntax as in JavaScript (among others). > RegexLiteral and Regex can be combined. Regex inherits Expression and > RegexLiteral inherits Literal which inherits Expression. Probably RegexLiteral > would be the one to keep? Sounds good to me.
Created attachment 156768 [details] [review] Second version of regular expression literals for Vala This version removes the redundant class Regex. It also uses static variables to store the literals; they are created only once. The syntax is still @/.../ but I will try to fix that next.
Created attachment 156801 [details] [review] Third version of regular expression literals This patch uses /.../ syntax. It was much easier than I originally thought. I only needed to add previous property for the scanner and check that the previous token is in a set. At least works for me :-) If this is now OK for inclusion I could start working on the rest of the regexp support. So maybe we keep this bug open? What would be the matching syntax? Taken from Perl it could be if (s=~/(foo)(\w+)/) { print ($1 + $2 + "=" + $$); } Replacements would be var replaced=~s/aa/bb/;
Is everyone convinced that this should be included in Vala ? I am not quite convinced myself, but I might be missing something.
(In reply to comment #13) > Is everyone convinced that this should be included in Vala ? I am not quite > convinced myself, but I might be missing something. I'm planning to add it to Vala, however, marked as experimental syntax. This means that you get a warning when you use it without --enable-experimental. This should allow us to get a better feeling for a definite decision at a later point.
(In reply to comment #12) > Created an attachment (id=156801) [details] [review] > Third version of regular expression literals > > This patch uses /.../ syntax. It was much easier than I originally thought. I > only needed to add previous property for the scanner and check that the > previous token is in a set. At least works for me :-) Great, I'll try to review the patch as soon as possible. > If this is now OK for inclusion I could start working on the rest of the regexp > support. So maybe we keep this bug open? > > What would be the matching syntax? Taken from Perl it could be > > if (s=~/(foo)(\w+)/) { > print ($1 + $2 + "=" + $$); > } > > Replacements would be > > var replaced=~s/aa/bb/; I'm not sure whether more syntactic sugar is necessary than the main /.../ syntax from your current patch. The main /.../ regex syntax has, in my opinion, significant advantages such as possible compile-time checking of constant regular expressions and thus avoiding RegexError handling. It also makes it easy to use regular expressions efficiently (that is, without recompiling it every time you use it) very easily. For these reasons, I think it makes sense to support special syntax for it. On the other hand, the advantages of matching and replacing syntax are less important, in my opinion, yet they are quite invasive as in the above example.
I forgot to mention that another important advantage of regular expression literals is to avoid the awkward double escaping due to backslash being used as escape character in both, string literals and regular expression.
Created attachment 156870 [details] [review] Fourth version of regular expression literals This version fixes the regex variable naming issues when the same file has multiple regex literals in different methods. In addition, regex literal can now be returned directly from a method and placed into an array.
Created attachment 156875 [details] [review] Fifth version of regular expression literals This one adds support for regular expression literals to be used inside ?? (coalescing) expressions.
(In reply to comment #14) > (In reply to comment #13) > > Is everyone convinced that this should be included in Vala ? I am not quite > > convinced myself, but I might be missing something. > > I'm planning to add it to Vala, however, marked as experimental syntax. This > means that you get a warning when you use it without --enable-experimental. > This should allow us to get a better feeling for a definite decision at a later > point. If Vala gets the support for regex syntax, can we also expect support for inline xml, and json and also a syntax for parsing binary data ?
Created attachment 156947 [details] [review] Sixth version of regular expression literals This fixes several missing escape sequences. In addition, the test example is extended quite a lot. I think this is complete now (the famous last words) :-)
Created attachment 156949 [details] [review] Missing new files for the sixth version The previous patch was missing the new files. They are here.
Created attachment 156956 [details] [review] git patch Thanks for the updates. I've marked the feature as experimental and attached it as a git patch. One remaining issue we have here is that this makes Vala require GLib 2.14, while we currently only require GLib 2.12. Maybe it's time to update that requirement as we've accidentally broken (and later fixed) it various times already and it also can't generate thread-safe get_type functions when targeting 2.12.
> One remaining issue we have here is that this makes Vala require GLib 2.14, while we currently only require GLib 2.12. I think the problem was maemo/Diablo still using GLib 2.12 but Fremantle release changed that.
Created attachment 157022 [details] Performance demonstration Native regex literals make the code to run over 4x faster compared to the case where regex is build every time and then matched once. See the attached demo. Of course it can be done manually with the plain library but it makes the code even more longer and more complex to read. library: 4.76206 sec => 104996 per sec native: 1.11075 sec => 450147 per sec (+328.7%) On 2GHz Intel Core 2 Duo with 2GB 1067 MHz DDR3 (Mac OS X 10.5.8).
commit 1afe020286302dcce26abc19ed559da05d21e3eb Author: Jukka-Pekka Iivonen <jp0409@jippii.fi> Date: Wed Mar 24 10:07:32 2010 +0100 Add experimental support for regular expression literals Fixes bug 607702.