After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 607702 - [Patch] Adds native regular expression literals support
[Patch] Adds native regular expression literals support
Status: RESOLVED FIXED
Product: vala
Classification: Core
Component: general
0.7.x
Other Linux
: Normal enhancement
: ---
Assigned To: Vala maintainers
Vala maintainers
: 584968 (view as bug list)
Depends on:
Blocks: 546123
 
 
Reported: 2010-01-21 19:37 UTC by Jukka-Pekka Iivonen
Modified: 2010-03-25 09:45 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Vala regular expression literals support (20.39 KB, patch)
2010-01-21 19:37 UTC, Jukka-Pekka Iivonen
reviewed Details | Review
Second version of regular expression literals for Vala (19.34 KB, patch)
2010-03-22 16:47 UTC, Jukka-Pekka Iivonen
none Details | Review
Third version of regular expression literals (15.84 KB, patch)
2010-03-22 20:36 UTC, Jukka-Pekka Iivonen
none Details | Review
Fourth version of regular expression literals (16.31 KB, patch)
2010-03-23 14:12 UTC, Jukka-Pekka Iivonen
none Details | Review
Fifth version of regular expression literals (16.35 KB, patch)
2010-03-23 14:30 UTC, Jukka-Pekka Iivonen
none Details | Review
Sixth version of regular expression literals (16.75 KB, patch)
2010-03-24 08:12 UTC, Jukka-Pekka Iivonen
none Details | Review
Missing new files for the sixth version (5.20 KB, patch)
2010-03-24 08:31 UTC, Jukka-Pekka Iivonen
none Details | Review
git patch (21.46 KB, patch)
2010-03-24 09:52 UTC, Jürg Billeter
none Details | Review
Performance demonstration (1.69 KB, text/plain)
2010-03-24 21:01 UTC, Jukka-Pekka Iivonen
  Details

Description Jukka-Pekka Iivonen 2010-01-21 19:37:40 UTC
Created attachment 151962 [details] [review]
Vala regular expression literals support

The patch attached adds native regular expression literals into Vala. This makes it possible to check regular expressions during compilation phase instead of runtime checking. Regex literals also makes code easier to read and fits well into the language syntax. See an example below.

-----

using GLib;

#if false

// Using GLib.Regex library from Vala
static int main (string[] args)
{
        stdout.printf ("%s\n", args[1]);
        try {
                var re = new Regex ("""ab/(\d+)""", RegexCompileFlags.CASELESS);

                MatchInfo info;
                if (re.match (args[1], 0, out info)) {
                        stdout.printf ("Matches ('%s')...\n", info.fetch (1));
                } else {
                        stdout.printf ("Does not match.\n");
                }
        } catch (RegexError err) {
                stderr.printf ("Invalid regex.\n");
                return 1;
        }
        return 0;
}
#else

// Using regular expression literals
static int main (string[] args)
{
        stdout.printf ("%s\n", args[1]);

        MatchInfo info;
        Regex re;
        if ((re = @/ab\/(\d+)/i).match (args[1], 0, out info)) {
                stdout.printf ("Matches ('%s')...\n", info.fetch (1));
        } else {
                stdout.printf ("Does not match.\n");
        }
        return 0;
}

#endif

------

In the future it could also be possible to make some optimizations so that the regex is stored into a static variable and created only once when entering the block the first time. This patch, however, does not do it yet.
Comment 1 Jukka-Pekka Iivonen 2010-01-22 20:22:12 UTC
Here is another example:

using GLib;


static int main (string[] args)
{
        MatchInfo info;

        // Simple greedy regular expression matching.
        var str1 = "mississippi";
        if (@/(is.*ip)/.match (str1, 0, out info)) {
                stdout.printf ("Part of %s is '%s'...\n", str1, info.fetch (1));
        } else {
                stdout.printf ("Did not match at all.\n");
        }

        // Match caseless.
        var str2 = "demonStration";
        if (@/mon(str.*o)n/i.match (str2, 0, out info)) {
                stdout.printf ("Part of %s is '%s'...\n", str2, info.fetch (1));
        } else {
                stdout.printf ("%s did not match at all.\n", str2);
        }

        // Match and pick substrings.
        var ts   = "Time: 10:42:12";
        if (@/Time: (..):(..):(..)/.match (ts, 0, out info)) {
                stdout.printf ("%s\n\thours = %s\n\tminutes = %s\n\tseconds = %s\n\n", ts, info.fetch (1), info.fetch (2), info.fetch (3));
        }

        // Replace demo: word swapping
        try {
                var str = "apple grape";
                stdout.printf ("'%s' becomes '%s'\n", str, @/^([^ ]*) *([^ ]*)/.replace (str, -1, 0, """\2 \1"""));
        } catch (RegexError err) {
                // Replacing still needs exception catching
                message (err.message);
        }

        return 0;
}
Comment 2 pancake 2010-01-25 16:23:42 UTC
That's one of the things I love from perl and ruby :) and I was also thinking on
having support for this in vala. Good point!

But it will be probably nice to reduce the syntax required to use them.

The MatchInfo can be allocated by vala instead of by the user, and the info.fetch() variables can be wrapped by special variables $1, $2, ... this will make the support of regular expressions a bit hard to implement, but will make the syntax much simpler and readable.

We can also add the =~ operator to do things like:

if ("Time: 10:42:12" =~ @/Time: (..):(..):(..)/) {
  print (@"hours   = $1\n");
  print (@"minutes = $2\n");
  print (@"seconds = $3\n");
}

One of the problems I have always seen in regular expression support in languages is that the expressions are compiled everytime you use them instead of caching the compilation of the regexp and then just using instances of this compilation. Which makes the execution a bit slower.

Replacing strings can be something like:

var str = "apple grape"
str =~ @s/^([^ ]*) *([^ ]*)/\2 \1/);
print ("%s\n", str); // must print 'grape apple'
Comment 3 Jukka-Pekka Iivonen 2010-01-25 19:12:14 UTC
Yes, I think $<n> would be very nice extension for matching. There is, of course, several details in supporting this like:

 * is the value of $3 (from previous matching) changed (or cleared) if current matching only takes two substrings
 * what is the scope of these $-variables
 * error handling when, i.g. if the regex was not created from a plain regex literal

I like the idea of replacements too. I just wanted to start from something simple (plain literals) which is the easiest case but eliminates most of the unnecessary exception handling from source by taking it to be done by the compiler. I would also like to cache the regex into a static variable but I didn't find my way in the Vala compiler source to do it easily (no local constants supported at the moment).
Comment 4 Jürg Billeter 2010-01-29 18:35:27 UTC
*** Bug 584968 has been marked as a duplicate of this bug. ***
Comment 5 Jürg Billeter 2010-03-19 18:33:40 UTC
Thanks for the patch. Have you encountered syntax conflicts or why have you used @/.../ instead of the more common /.../?
Comment 6 Jukka-Pekka Iivonen 2010-03-20 13:40:10 UTC
Actually I'm not sure how easy it would be to scan /.../ compared to @/.../ but I wanted to be sure. At least in the case below the scanner does not have to know the context if there is this additional @-character.

var x = b/a/c;
Comment 7 Jukka-Pekka Iivonen 2010-03-20 15:09:22 UTC
In Perl each variable has $-prefix. Thus, $b/$a/$c is not a problem in Perl. I think you either has to separate the division operator from reg.expr. literal syntax, or, separate variables somehow so that they will not be mixed with reg.expr. content (and the later is not possible without braking Vala syntax).
Comment 8 Jürg Billeter 2010-03-21 19:42:24 UTC
Yes, it's certainly not trivial to parse this without @ due to ambiguities. However, JavaScript does this as well, for example. Assuming there is no technical reason why this should be more problematic in Vala than in JavaScript, I'd really like to avoid the extra @.

A question regarding the patch, what's the reason to have separate Regex and RegexLiteral classes? Are they ever used separately?
Comment 9 Jukka-Pekka Iivonen 2010-03-21 20:48:29 UTC
I'm happy if we can get rid of @. I just thought it would be nicer if the scanner does not have to be aware of parsing context but it is probably not I big issue.

RegexLiteral and Regex can be combined. Regex inherits Expression and RegexLiteral inherits Literal which inherits Expression. Probably RegexLiteral would be the one to keep?
Comment 10 Jürg Billeter 2010-03-21 21:23:54 UTC
(In reply to comment #9)
> I'm happy if we can get rid of @. I just thought it would be nicer if the
> scanner does not have to be aware of parsing context but it is probably not I
> big issue.

Yes, keeping scanner simple is certainly desirable. However, I think it's worth it in this case to achieve the same syntax as in JavaScript (among others).

> RegexLiteral and Regex can be combined. Regex inherits Expression and
> RegexLiteral inherits Literal which inherits Expression. Probably RegexLiteral
> would be the one to keep?

Sounds good to me.
Comment 11 Jukka-Pekka Iivonen 2010-03-22 16:47:39 UTC
Created attachment 156768 [details] [review]
Second version of regular expression literals for Vala

This version removes the redundant class Regex. It also uses static variables to store the literals; they are created only once.

The syntax is still @/.../ but I will try to fix that next.
Comment 12 Jukka-Pekka Iivonen 2010-03-22 20:36:24 UTC
Created attachment 156801 [details] [review]
Third version of regular expression literals

This patch uses /.../ syntax. It was much easier than I originally thought. I only needed to add previous property for the scanner and check that the previous token is in a set. At least works for me :-)

If this is now OK for inclusion I could start working on the rest of the regexp support. So maybe we keep this bug open?

What would be the matching syntax? Taken from Perl it could be

  if (s=~/(foo)(\w+)/) {
     print ($1 + $2 + "=" + $$);
  }

Replacements would be

  var replaced=~s/aa/bb/;
Comment 13 Ali Sabil 2010-03-22 20:39:34 UTC
Is everyone convinced that this should be included in Vala ? I am not quite convinced myself, but I might be missing something.
Comment 14 Jürg Billeter 2010-03-22 20:45:39 UTC
(In reply to comment #13)
> Is everyone convinced that this should be included in Vala ? I am not quite
> convinced myself, but I might be missing something.

I'm planning to add it to Vala, however, marked as experimental syntax. This means that you get a warning when you use it without --enable-experimental. This should allow us to get a better feeling for a definite decision at a later point.
Comment 15 Jürg Billeter 2010-03-22 20:54:59 UTC
(In reply to comment #12)
> Created an attachment (id=156801) [details] [review]
> Third version of regular expression literals
> 
> This patch uses /.../ syntax. It was much easier than I originally thought. I
> only needed to add previous property for the scanner and check that the
> previous token is in a set. At least works for me :-)

Great, I'll try to review the patch as soon as possible.

> If this is now OK for inclusion I could start working on the rest of the regexp
> support. So maybe we keep this bug open?
> 
> What would be the matching syntax? Taken from Perl it could be
> 
>   if (s=~/(foo)(\w+)/) {
>      print ($1 + $2 + "=" + $$);
>   }
> 
> Replacements would be
> 
>   var replaced=~s/aa/bb/;

I'm not sure whether more syntactic sugar is necessary than the main /.../ syntax from your current patch. The main /.../ regex syntax has, in my opinion, significant advantages such as possible compile-time checking of constant regular expressions and thus avoiding RegexError handling. It also makes it easy to use regular expressions efficiently (that is, without recompiling it every time you use it) very easily. For these reasons, I think it makes sense to support special syntax for it.

On the other hand, the advantages of matching and replacing syntax are less important, in my opinion, yet they are quite invasive as in the above example.
Comment 16 Jürg Billeter 2010-03-22 21:04:10 UTC
I forgot to mention that another important advantage of regular expression literals is to avoid the awkward double escaping due to backslash being used as escape character in both, string literals and regular expression.
Comment 17 Jukka-Pekka Iivonen 2010-03-23 14:12:06 UTC
Created attachment 156870 [details] [review]
Fourth version of regular expression literals

This version fixes the regex variable naming issues when the same file has multiple regex literals in different methods. In addition, regex literal can now be returned directly from a method and placed into an array.
Comment 18 Jukka-Pekka Iivonen 2010-03-23 14:30:31 UTC
Created attachment 156875 [details] [review]
Fifth version of regular expression literals

This one adds support for regular expression literals to be used inside ?? (coalescing) expressions.
Comment 19 Ali Sabil 2010-03-23 14:32:37 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > Is everyone convinced that this should be included in Vala ? I am not quite
> > convinced myself, but I might be missing something.
> 
> I'm planning to add it to Vala, however, marked as experimental syntax. This
> means that you get a warning when you use it without --enable-experimental.
> This should allow us to get a better feeling for a definite decision at a later
> point.

If Vala gets the support for regex syntax, can we also expect support for inline xml, and json and also a syntax for parsing binary data ?
Comment 20 Jukka-Pekka Iivonen 2010-03-24 08:12:40 UTC
Created attachment 156947 [details] [review]
Sixth version of regular expression literals

This fixes several missing escape sequences. In addition, the test example is extended quite a lot. I think this is complete now (the famous last words) :-)
Comment 21 Jukka-Pekka Iivonen 2010-03-24 08:31:59 UTC
Created attachment 156949 [details] [review]
Missing new files for the sixth version

The previous patch was missing the new files. They are here.
Comment 22 Jürg Billeter 2010-03-24 09:52:13 UTC
Created attachment 156956 [details] [review]
git patch

Thanks for the updates. I've marked the feature as experimental and attached it as a git patch. One remaining issue we have here is that this makes Vala require GLib 2.14, while we currently only require GLib 2.12. Maybe it's time to update that requirement as we've accidentally broken (and later fixed) it various times already and it also can't generate thread-safe get_type functions when targeting 2.12.
Comment 23 Jukka-Pekka Iivonen 2010-03-24 13:23:32 UTC
> One remaining issue we have here is that this makes Vala
require GLib 2.14, while we currently only require GLib 2.12.

I think the problem was maemo/Diablo still using GLib 2.12 but Fremantle release changed that.
Comment 24 Jukka-Pekka Iivonen 2010-03-24 21:01:07 UTC
Created attachment 157022 [details]
Performance demonstration

Native regex literals make the code to run over 4x faster compared to the case where regex is build every time and then matched once. See the attached demo. Of course it can be done manually with the plain library but it makes the code even more longer and more complex to read.

library:	4.76206 sec => 104996 per sec
native: 	1.11075 sec => 450147 per sec (+328.7%)

On 2GHz Intel Core 2 Duo with 2GB 1067 MHz DDR3 (Mac OS X 10.5.8).
Comment 25 Jürg Billeter 2010-03-25 09:45:45 UTC
commit 1afe020286302dcce26abc19ed559da05d21e3eb
Author: Jukka-Pekka Iivonen <jp0409@jippii.fi>
Date:   Wed Mar 24 10:07:32 2010 +0100

    Add experimental support for regular expression literals
    
    Fixes bug 607702.