GNOME Bugzilla – Bug 419376
Functions using named subpatterns behave inconsistently when G_REGEX_DUPNAMES is used
Last modified: 2007-05-29 09:40:39 UTC
It's not clear what g_regex_fetch_named(), g_regex_fetch_named_pos() and g_regex_get_string_number() should behave when G_REGEX_DUPNAMES is used and PCRE behavior seems inconsistent.
The way things currently are, you get one of the matches, which one being undefined, according to the pcre docs. I think this is fine, but the documentation needs to point it out. It would also be very good to include an example of appropriate uses of DUPNAMES in the docs, e.g. a(?'middle' c+)b|b(?'middle' d+)a .
From man pcrepattern: --------------------- By default, a name must be unique within a pattern, but it is possible to relax this constraint by setting the PCRE_DUPNAMES option at compile time. This can be useful for patterns where only one instance of the named parentheses can match. ... The convenience function for extracting the data by name returns the substring for the first, and in this example, the only, subpattern of that name that matched. --------------------- In any case current GRegex problem is (was) that it gets named subpattern in a wrong way. Matthias, what do you mean by "this is fine"? I.e. what exactly is fine?
Both g_regex_fetch_named() and g_regex_fetch_named_pos() are broken because they were written before PCRE 6.7, the version that added PCRE_DUPNAMES. pcre_get_stringtable_entries() can be used to retrieve every subpattern with a given name but I'm not sure we need to wrap it but we can use it internally. pcre_get_stringnumber() returns one of the numbers that are associated with the name, but it is undefined which it is. The man page says that pcre_get_named_substring() and pcre_copy_named_substring() call pcre_get_stringnumber(), and if it succeeds, they then call pcre_copy_substring() or pcre_get_substring(), as appropriate. If the return value of pcre_get_stringnumber() is undefined, then also the return value of pcre_get_named_substring() and pcre_copy_named_substring() is undefined. We could just say that the return value of the functions using named patterns is undefined but I don't like it, so I'm going to look for a nice solution (maybe using pcre_get_stringtable_entries) but only after fixing bug #419368.
> Matthias, what do you mean by "this is fine"? I.e. what exactly is fine? From my reading of the PCRE docs, DUPNAMES is only intended to be used if the pattern is such that only one of the identically named matches can happen at a time. If you have multiple matches with the same name, you must have violated that constraint, therefore it is fine to return an undefined result.
(In reply to comment #4) > > Matthias, what do you mean by "this is fine"? I.e. what exactly is fine? > > From my reading of the PCRE docs, DUPNAMES is only intended to be used if the > pattern is such that only one of the identically named matches can happen at > a time. If you have multiple matches with the same name, you must have > violated that constraint, therefore it is fine to return an undefined result. OK, I thought you were talking about general case, because what Marco said wasn't clear either: > pcre_get_stringnumber() returns one of the numbers that are associated with > the name, but it is undefined which it is. Indeed looks so (I bet it's the first subpattern with this name, and I bet it's just undocumented, not left undefined as some evil/wrong thing). > The man page says that pcre_get_named_substring() and > pcre_copy_named_substring() call pcre_get_stringnumber(), and if it succeeds, > they then call pcre_copy_substring() or pcre_get_substring(), as appropriate. > If the return value of pcre_get_stringnumber() is undefined, then also the > return value of pcre_get_named_substring() and pcre_copy_named_substring() is > undefined. This is not so, docs explicitly say get_named_substring() will return first one matched. There is indeed something strange in man page: it says "If the name is known to be unique (PCRE_DUPNAMES was not set), you can find the number from the name by calling pcre_get_stringnumber()." and it says "These functions call pcre_get_stringnumber(), and if it succeeds, they then call pcre_copy_substring() or pcre_get_substring(), as appropriate." The last sentence is probably just a leftover from old version. get_named_substring() does the right thing, tested, and it is clearly intended to do the right thing. Finally, using DUPNAMES when named matches are unique is not something illegal, and documentation doesn't say it is. Docs say when it could be useful, as they are nice docs; they don't say "it is for this case", the say "it can be useful in this case".
Oops, pcre_get_stringnumber() actually returns randomish subpattern, it uses binary search to find the name. But indeed get_named_substring() does not use get_stringnumber().
Just to clarify: What I wanted to say is that if you have a pattern of a(?'m' b)c(?'m' d)e I think it is fine for g_regex_fetch_named (regex, "m", "abcde") to return either "b" or "d". The documentation should clearly indicate that this pattern violates the constraints of DUPNAMES.
(In reply to comment #7) > Just to clarify: > > What I wanted to say is that if you have a pattern of > > a(?'m' b)c(?'m' d)e > > I think it is fine for g_regex_fetch_named (regex, "m", "abcde") > to return either "b" or "d". No, it's not fine. pcre returns first subpattern matched, so glib should do the same. > The documentation should clearly > indicate that this pattern violates the constraints of DUPNAMES. There are no constraint like that (unless glib introduces its own constraints). It's what I was trying to say.
Patch at #419376 fixes this, everything is consistent (== works as pcre).