GNOME Bugzilla – Bug 496847
GMarkup utility function for attribute collection
Last modified: 2007-11-27 01:41:23 UTC
when using GMarkup you get callbacks for start element that contain the attribute names and values in strv's. this is utterly inconvenient. it would be nice to have a utility function that you could use like: static void my_start_element (GMarkupParseContext *context, const gchar *element_name, const gchar **attribute_names, const gchar **attribute_values, gpointer user_data, GError **error) { if (!strcmp (element_name, "mytag")) { gboolean flag1, flag2; char *attr1, *attr2; if (!g_markup_collect_attrs (element_name, attribute_names, attribute_values, error, "attr1", &attr1 "?attr2", &attr2, "?attr3", NULL, "~flag1", &flag1, "?~flag2", &flag2, NULL)) return; ... do stuff ... } } "attr1" means that attr1 is a required tag and it will be stored into the char *attr1. if attr1 is not found then 'error' will be set accordingly. "?attr2" means that attr2 is optional -- if the tag is not found then NULL is stored instead. "?attr3" (and the NULL pointer) means that optional attribute attr3 is accepted but ignored. "~flag1" means that flag1 is a mandatory boolean value. "mandatory" has the slightly unusual meaning that, if not present, FALSE is assumed. "?~flag2" means that flag2 is an optional boolean value. "optional" means that if the value is not present then much like NULL is stored for strings a value that is neither TRUE not FALSE is stored into flag2. the meanings of "optional" and "mandatory" for booleans are a bit whacked, but i think these are the two most important cases to support. perhaps they can be supported through other prefixes... if any attribute is encountered that isn't in the list of acceptable attributes then an error is also set for that case. if a boolean field fails to parse then an error is set for that case. in all cases of error FALSE it returned. with no errors, TRUE. you could then further proceed to define a rather bling macro #define G_MARKUP_COLLECT(args...) \ g_markup_collect_attrs (element_name, \ attribute_names, \ attribute_values, \ error, args, NULL)) effectively gaining you the advantage of not having to type in "element_name", "attribute_names", "attribute_values" and "error" when you use the standard names for these variables as arguments to your start_element function ((it would be cool if GMarkupParser had a interface where all of these things were passed in as part of a single structure!)). additional note: sometimes you want the results to be g_strdup()'d for you and other times not. maybe it makes sense to have another prefix character for this...
+1. If we get into having flags then ints and doubles are needed too. If I was to refine your proposed api, instead of inventing tri-state vars, I'd make it just not touch flag1 if the attr is not present. Then you can set it to -1, FALSE, or TRUE as you need. I don't like the idea of inventing yet another parsing syntax though. Maybe we can reuse parts of the GOption API, but if you look at the data involved, they are pretty much the same thing, one working on an array of input options, the other on part of an XML element.
problem with not touching the value is that the "default" value of a boolean attribute may depend on the other attributes in the tag consider GtkBuilder. the swapped='' boolean on a signal connection is TRUE by default if connecting to an object but false otherwise...
oh wait. now i understand what you mean. you yourself could set it to -1 before calling in and then see if it still had the -1 value on return. cool idea. seems like much less of a hack than mine :)
I'm not really convinced. The ?~ thing and interpreting booleans does not really belong on this layer and looks like a big hack to me. "?attr3" (and the NULL pointer) means that optional attribute attr3 is accepted but ignored. How is that useful ? If anything, I'd move the ? and similar modifiers out of the attribute name and into a separate argument, like if (!g_markup_collect_attrs (element_name, attribute_names, attribute_values, error, MANDATORY, "attr1", &attr1 OPTIONAL, "attr2", &attr2, BOOLEAN, "flag1", &flag1, OPTIONAL|BOOLEAN, "flag2", &flag2, NULL)) But I am not convinced that the syntax for attributes that are interpreted as booleans is uniform enough to warrant special support for booleans here. Also, this doesn't address the one performance issue that gmarkup has, namely that handling markup with lots of attributes is not very efficient (ebassi did some performance tests on this while working on the gtkrecent xbel parser).
you're probably right about the booleans-not-belonging-here thing. it really adds an awful lot of cruft to the API. if that leaves the only prefix as "?" for optional i think that's more reasonable than having a separate flags field... the use for NULL pointer is for when you want to allow but ignore certain attributes (like translation context or this strange "last_modified_time" on <signal> in GtkBuilder).
Optional attrs are useful when you want to make it (have an option for) erring on unknown attributes. As for the performance problem, you are right. And Pango switches on the first letter of the attribute first and strcmp after. I still think something ala GOption with an struct would be more useful. I may give it a try to reuse the same struct for this kind of thing.
Created attachment 99499 [details] [review] g_markup_collect_attributes() Modified from my original suggestion to get rid of the "?~magic". Instead, use an explicit "type" field. More typing, but saner. If you don't like the long names of the properly-namespaced enum you can always invent your own #define's. Has a magical "tristate" feature that you might not like but I really think is important....
Looks much better to me than the first version; here some more pedantic criticism: The docs for the collect type enum should perhaps explain the difference between G_MARKUP_COLLECT_BOOLEAN|G_MARKUP_COLLECT_OPTIONAL and G_MARKUP_COLLECT_TRISTATE. Does G_MARKUP_COLLECT_TRISTATE|G_MARKUP_COLLECT_OPTIONAL make sense ? The docs for G_MARKUP_COLLECT_BOOLEAN need to explain the exact attribute values that are accepted as TRUE/FALSE. The docs for G_MARKUP_COLLECT_OPTIONAL should state if the collection arguments stay untouched in case the argument is missing, or if they are set to some default value such as NULL or FALSE. Maybe you want to add a disclaimer that the GMarkupCollectType enumeration may grow in the future ? I dislike @returns in doc comments... I think the doc comments need to emphasize the point that you have to collect all arguments when using this function (ie the need to specify silently ignored attributes, too). It also feels like a shortcoming that there is no way to ignore all unknown attributes. Not clear to me what the last sentence in the doc comment means. The doc comments could do with an instructive example. Ah, while reading the implementation, I now understand that the last sentence is explaining the bug that the function may return an error after copying a number of attributes. It would clearly be preferable to clean that up in the error case. Also, the implementation is clearly not going to speed up any parsers, looping over the attribute names again and again. In practice, it is probably not going to be noticeable. The optimization you have to skip already-collected attributes is going to break obscure cases like g_markup_collect_attributes (elt, atts, vals, error, G_MARKUP_COLLECT_STRING, "foo", &foo, G_MARKUP_COLLECT_BOOLEAN, "foo", &boo, G_MARKUP_COLLECT_INVALID); g_print ("foo boolean value: %d (exact string was: %s)\n", boo, foo); I think we can live with this restriction, but it should be noticed in the documentation ("attributes can not be collected multiple times") /* constructivists rejoice! * neither false nor true... */ The law of the excluded middle is hardly a constructivist invention... Being an old SGML guy, I'd prefer if the error messages spoke about element "foo", not tag <foo>, because elements have attributes, not tags... In the same vein, I consider "<%s %s='%s'> is not a boolean value", a confusing error message, unless we actually know that the attribute in question was the only one in the parsed start tag. Better say something like "element '%s', attribute '%s': value '%s' cannot be parsed as a boolean value" /* i wonder what valgrind says... */ not a useful comment to put there. Just run valgrind to find out... + for (j = 0; j < i; j++) + if (strcmp (attribute_names[i], attribute_names[j]) == 0) + /* duplicate! */ + break; I believe this check is wrong. What will be duplicate is the attribute names in the varargs, not the attribute names given to you by the parser. The patch is missing the addition of the symbol to glib.symbols. It would be great to have a few tests for this new api somewhere in tests/
The reason I avoid the frees is because it would be expensive to, for example, maintain a linked list of the allocations that have been performed thus far... It also strikes me as strange that you've modified the pointer but then the result of your modification has become invalid. I suppose you could deal with this by keeping a linked list of the pointer addresses (rather than the pointer itself) and then free and replace with NULL... An idea I had just now is that instead of keeping this list perhaps I could just do a second va_start() and sort of replay my actions... is multiple va_start() portable? I don't understand your comment about my duplicate check. I believe it to be correct -- i'm iterating over the attribute_names[] array which comes from GMarkup from the XML file...
We already use multiple va_start() in g_strconcat(), so I'd say it is portable enough for glib's purposes. Regarding the duplicate check, the thing you are trying to catch is the one I mentioned above, right ? g_markup_collect_attributes (elt, atts, vals, error, G_MARKUP_COLLECT_STRING, "foo", &foo, G_MARKUP_COLLECT_BOOLEAN, "foo", &boo, G_MARKUP_COLLECT_INVALID); Not this one: <foo bar="bla" bar="bla"> If you are in fact trying to catch the latter, then I'm kind of opposed to add this check in a convenience function. It is fine if GMarkup is tightened up to catch this wellformedness violation, but that belongs in the parser itself, not an optional convenience function. The former should indeed be catched by the collection function, but to catch it, you need to check duplicates in the va_list, not in element_names.
I agree with you completely about the proper place to add the check. The reason I assumed that it wasn't there was because maybe some others did not agree with you -- specifically, those who parse large documents with large numbers of attributes. Having duplicate attribute checking in the parser itself would cause an O(n^2) (triangular) slowdown on building the attribute list (since you'd have to backcheck)... unless we did something fancy like hashtables... In any case, you'd introduce a lot of strcmps... What do you think? As for programmer error... this is C... I think a note in the docs saying "don't do this!" is enough.... :)
Yeah, we'd have go to hash tables or so for doing the duplicate attribute check. If we do that, we could possibly also allow the start_element function access to that hash table, addressing the performance problem of iterating the string lists. Anyway, I'm fine with treating the dupe collection issue as a documentation problem.
how do you propose we would allow the start_element function access to the hash table? new API? we're getting closer to a parser overhaul here :p
clarification: by 'new api' i meant to ask "new parser context call" vs. "new user callback that takes different arguments than the current one" also... about the docs: the gmarkup documentation is quite clear that even though gmarkup may accept documents that are not valid xml, that doesn't mean that they are valid input.... duplicate attributes are not valid xml. we're well within our rights to change this behaviour :)
Oh, I didn't mean to handle "attribute hash tables" in this bug report. It is more like a side topic. GMarkupParser is not really extensible at all. So you'd probably have to add something like g_markup_context_get_attributes_table() that would return NULL unless called inside the start_element callback. It would get a bit ugly, certainly. Lets maybe get back to attribute collection...
no matter how you slice it, the collecting and dupe-noticing activity should be coupled for performance reasons, i so i guess the dup-collection can stay here? i'll cook a new version
Created attachment 99645 [details] [review] new g_markup_collect_attributes with suggested changes
please ignore all changes to docs/reference/glib/tmpl/ i don't understand why these files are even under version control these days...
They are under version control because they contain valuable content... The patch looks generally good to me, just a few things remaining: - Will the mixed enum/flags status of the collection type enum give bindings heartburn ? Probably not... - Do you want to add a remark stating that the collection type enum may be extended ? - While I am sure that it is no real problem to have individual tests under different licenses, I'd ideally like to avoid it.
Created attachment 99692 [details] [review] final patch 2007-11-26 Ryan Lortie <desrt@desrt.ca> Add new function g_markup_collect_attributes (bug #496847). * glib/glib.symbols: add g_markup_collect_attributes * docs/reference/glib/glib-sections.txt: * glib/gmarkup.h: * glib/gmarkup.c: add g_markup_collect_attributes and new enumerated type GMarkupCollectType. Add new error code G_MARKUP_ERROR_MISSING_ATTRIBUTE that is thrown by the attribute collector. Committed revision 5947.