GNOME Bugzilla – Bug 577676
XPath 2.0-style regular expressions (also used in EXSLT, XSLT 2.0, XQuery 1.0)
Last modified: 2021-07-05 13:20:40 UTC
I know that libxml already provides code for XML Schema Datatypes-style regular expressions [XSD], but it would really be nice if it could also provide support for the slightly different form(s) of regular expressions used in JavaScript [ECMA-262 pages 129-145], EXSLT (which just references JavaScript's RegExp syntax -- see [EXSLT-regexp]), and XSLT 2.0/XPath 2.0/XQuery 1.0 [xpath-functions]. It appears that both types are closely based on Perl's regular expression syntax (big surprise!), but more closely in the JavaScript/XPath cases. As far as the regular expressions themselves go, the XSD regexps [XSD] are missing: * "^" and "$" (regexps are implicitly anchored at the beginning and end of the string in XSD). These should be fairly trivial to support with the existing code. * "reluctant" versions of qualifiers (??, *?, +?, etc.) (Perl calls these non-"greedy"). These don't do anything interesting when just checking for matches, but are quite important when doing search or capturing sub-expression matches. * Sub-expression (group) capture of the text matched by parenthesized portions of the regular expression * Back references to captured text (to match it again) The following JavaScript regexp features seem to be missing even from [xpath-functions]: * "\b" and "\B" assertions (for "at word boundary" and "not at word boundary", respectively, based on \w and \W) * "(?: ... )", "(?= ... )", and "(?! ... )" atoms. "(?: ... )" is just a capture-free version of "( ... )". "(?= ... )" and "(?! ... )" assert that the nested regexp does or does not (respectively) match the next portion of the input, but do not consume it. Both [xpath-functions] and JavaScript have "flag characters" that affect the semantics of the match: * "i": ignore case in matching * "m": multi-line; ^ and $ match immediately after/before a newline, not just at beginning/end of string. * "g" (JavaScript only): complicated stateful behavior, probably not intended to apply to the EXSLT case. * "s" ([xpath-functions] only): single-line; allow "." to match newlines * "x" ([xpath-functions] only): remove (most) whitespace from the regexp before matching. (Doesn't support the # metacharacter here like perl does, though ...) References [ECMA-262] Standard ECMA-262: ECMAScript Language Specification. http://www.ecma-international.org/publications/standards/Ecma-262.htm [EXSLT-regexp] EXSLT - regexp:match http://www.exslt.org/regexp/functions/match/index.html [xpath-functions] XQuery 1.0 and XPath 2.0 Functions and Operators. http://www.w3.org/TR/2007/REC-xpath-functions-20070123/#regex-syntax [XSD] XML Schema Part 2: Datatypes Second Edition. http://www.w3.org/TR/xmlschema-2/#regexs
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.