GNOME Bugzilla – Bug 304020
better parsing error handling in domparser
Last modified: 2012-02-15 09:54:33 UTC
Version details: libxml++-2.9.2 Distribution/Version: debian to obtain a message more precise than "Document not well formed." if a parsing error occured in DomParser i slightly modified the parse_context function of the pbject and add a fonction to format an error message given a xmlErrorPtr. Modified parse_context function: void DomParser::parse_context() { KeepBlanks k(KeepBlanks::Default); //The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData(): //and the implementation of xmlParseMemory(), in xmlSaxParseMemoryWithData(). initialize_context(); xmlParseDocument(context_); check_for_exception(); if((!context_->wellFormed)||(context_->errNo != 0)) { std::string errStr="Document not well-formed."; xmlErrorPtr tmpXmlErrorPtr=xmlCtxtGetLastError(context_); errStr+=" "+xml_format_error(tmpXmlErrorPtr); xmlpp::Parser::release_underlying(); throw xmlpp::parse_error(errStr); } doc_ = new Document(context_->myDoc); // This is to indicate to release_underlying that we took the // ownership on the doc. context_->myDoc = NULL; //Free the parse context, but keep the document alive so people can navigate the DOM tree: //TODO: Why not keep the context alive too? Parser::release_underlying(); check_for_exception(); } added xml_format_error: std::string DomParser::xml_format_error(xmlErrorPtr err) { if(err==NULL) return ""; std::ostringstream returned_error_string; switch(err->domain) { case XML_FROM_NONE: returned_error_string << "XML_FROM_NONE";break; case XML_FROM_PARSER: returned_error_string << "XML_FROM_PARSER";break; case XML_FROM_TREE: returned_error_string << "XML_FROM_TREE";break; case XML_FROM_NAMESPACE: returned_error_string << "XML_FROM_NAMESPACE";break; case XML_FROM_DTD: returned_error_string << "XML_FROM_DTD";break; case XML_FROM_HTML: returned_error_string << "XML_FROM_HTML";break; case XML_FROM_MEMORY: returned_error_string << "XML_FROM_MEMORY";break; case XML_FROM_OUTPUT: returned_error_string << "XML_FROM_OUTPUT";break; case XML_FROM_IO: returned_error_string << "XML_FROM_IO";break; case XML_FROM_FTP: returned_error_string << "XML_FROM_FTP";break; case XML_FROM_HTTP: returned_error_string << "XML_FROM_HTTP";break; case XML_FROM_XINCLUDE: returned_error_string << "XML_FROM_XINCLUDE";break; case XML_FROM_XPATH: returned_error_string << "XML_FROM_XPATH";break; case XML_FROM_XPOINTER: returned_error_string << "XML_FROM_XPOINTER";break; case XML_FROM_REGEXP: returned_error_string << "XML_FROM_REGEXP";break; case XML_FROM_DATATYPE: returned_error_string << "XML_FROM_DATATYPE";break; case XML_FROM_SCHEMASP: returned_error_string << "XML_FROM_SCHEMASP";break; case XML_FROM_SCHEMASV: returned_error_string << "XML_FROM_SCHEMASV";break; case XML_FROM_RELAXNGP: returned_error_string << "XML_FROM_RELAXNGP";break; case XML_FROM_RELAXNGV: returned_error_string << "XML_FROM_RELAXNGV";break; case XML_FROM_CATALOG: returned_error_string << "XML_FROM_CATALOG";break; case XML_FROM_C14N: returned_error_string << "XML_FROM_C14N";break; case XML_FROM_XSLT: returned_error_string << "XML_FROM_XSLT";break; case XML_FROM_VALID: returned_error_string << "XML_FROM_VALID";break; case XML_FROM_CHECK: returned_error_string << "XML_FROM_CHECK";break; case XML_FROM_WRITER: returned_error_string << "XML_FROM_WRITER";break; } switch (err->level) { case XML_ERR_NONE: returned_error_string << "(): "; break; case XML_ERR_WARNING: returned_error_string << "(warning): "; break; case XML_ERR_ERROR: returned_error_string << "(error): "; break; case XML_ERR_FATAL: returned_error_string << "(fatal): "; break; } if(err->message!=NULL) returned_error_string << err->message; else returned_error_string << "unknown error\n"; if(err->file!=NULL) { returned_error_string << " in file " << err->file ; } else { returned_error_string << " in Entity"; } returned_error_string << " on line " << err->line; return returned_error_string.str(); } in domparser.h: ... protected: virtual std::string xml_format_error(xmlErrorPtr err); ...
Could we have an actual patch please? That is much easier to review.
Created attachment 46538 [details] [review] targzipped file containing the two diffs beetween domparser.cc and domparser.h and the versions i use (prefixed with new_)
The patch looks ok, minus these 2 points : - The domparser.cc patch does not apply properly on cvs head version. - The domparser.h patch make domparser.h include a libxml header. This is not good : none of the libxml++ header does include a libxml header. The things we do when we need a pointer on libxml type in the header is writting a forward declaration of the needed structure. Otherwise it looks good.
ok, could you give me some more hint about the "forward declaration" stuff? I download the cvs head version now and make a new patch for it.
for example, in parser.h : 25 extern "C" { 26 struct _xmlParserCtxt; 27 } ... 99 _xmlParserCtxt* context_; instead of using directly xmlParserCtxtPtr.
It should also use Glib::ustring instead of std::string and the white space should be cleaned up to match the existing code. I'd also prefer to see human-readable errors in libxml rather than libxml++, if possible.
Having a closer look to the xmlError api (http://www.xmlsoft.org/html/libxml-xmlerror.html), I think there is something we can do to extend the error-reporting to the whole library. xmlpp::exception, or one of it's children, should take a xmlErrorPtr as parameter, and do the error message formatting in what(). This would put the error formatting code in a better place than domparser. This exception, in it's constructor/destructor, would use xmlCopyError and xmlResetError, so there is no memory handling to do with the result of getlasterror. Thoughts ?
yes, this would be the better way, as this error handling would be available for all the parsers. But doesn't it means that every call to throw exception will have to be changed?
Sooner or later yes. But as long as it's not breaking the ABI, it's not a problem (I count 45 occurences of the token "throw" in the source code).
Created attachment 47710 [details] [review] a tar bzipped file of patches for exception and parser_error classes I modify the exception and parse_error file following your idea; i cant test it now but it compiles. can you tell me if it'ok for you
I don't have time for a good review right now. However, - I see in exception.cc the string "Document not well-formed.". This is probably incorrect in most uses of xmlpp::exception. - The exception contructor takes a _xmlParserCtxt*. Again, if the goal is to make it generic, we should be able to use it on any error, not only parse errors. This means it should take a _xmlError*. Last thing. If we are going to make all the error management a better wrapper around libxml2 error handling, I think a free function like "wrap_libxmlpp_error(_xmlError const *)" should be usefull. It would, considering the content of the error, choose the right exception, instanciate it and throw it. Thoughts ?
I'd rather not add something that should be in libxml without even trying to add it to libxml. We'd just have to retrofit it again later.
libxml has a pretty good error handling system. The idea is just to make a better wrapping of it. So I don't really see what you mean by adding something that should be in libxml.
Oh, I see, this seems to be mostly about using xmlCtxtGetLastError(), which we do not use yet. However, this needs to be called at the time the expection is thrown, not some time later when the exception is examined. And this would be much easier to understand as a regular cvs patch. Please resubmit it using cvs diff -up > something.patch
Exactly. As said in comment #11, I'd except a function like wrap_libxmlpp_error(_xmlError const *), or a throw_ctxt_last_error() to throw the correct libxml++ exception.
Christophe, do you feel like updating your patch with this improvement?
Christophe?
(In reply to comment #17) > Christophe? > Hi Murray, The patch is not mine in the first place, and to be franc I have not touched libxml++ nor a c++ line in the past 3 years. Moreover I would hardly find some time to do this. So I do not think I am the right person to do this (and although I am officially still maintainer of libxml++ you have been doing the job since I announce my will to resign from this role - a big thank for that by the way). Best regards, Christophe
I'd still really like someone to provide a patch.
Created attachment 204589 [details] [review] patch: Parser: Throw more detailed error messages. Here's a fairly big patch. It turned out to be a bigger job than I thought at first (as often happens with software jobs). Formatting the last error after parsing is certainly an improvement, but it's still not nearly as good as it can be. libxml contains quite an advanced error handling. It can report all errors, if there are several errors in a parsed file. Certainly you want to see all errors at once, not just the last one. The problem with libxml's error reports is that everything is written to stderr. That's not bad for a program that is started from a command-line, but if it's started from a GUI, the user will probably not see stderr. Messages from validation are caught by callback functions Parser::callback_validity_error() and callback_validity_warning(). I have added similar callback functions to catch parser messages. All these callbacks get more information than callback_validity_*() have previously saved. I improved that a bit. Unfortunately it's not possible to take advantage of the functions in libxml that format the messages. E.g. xmlReportError() is a local (static) function in error.c. Formatting is not clearly separated from printing. (Should be in separate functions.) And some of the contents of struct _xmlError are poorly documented. The result is that the callbacks in xmlpp::Parser save messages that are not quite as good as what libxml can write to stderr, but it's still a great improvement from just "Document not well formed." --- An example of output from libxml++/examples/dom_parser: Exception caught: Parser error: File example_invalid2.xml, line 12, column 33 (fatal): EntityRef: expecting ';' Validity error: File example_invalid2.xml, line 10, column 5 (error): Element examplechild content does not follow the DTD, expecting (child_of_child)+, got (CDATA child_of_child ) --- If libxml writes to stderr, the result is: example_invalid2.xml:10: element examplechild: validity error : Element examplechild content does not follow the DTD, expecting (child_of_child)+, got (CDATA child_of_child ) </examplechild> ^ example_invalid2.xml:12: parser error : EntityRef: expecting ';' Some content. &wwwmurrayc ^ Exception caught: Document not well-formed. File example_invalid2.xml, line 12, column 33 (fatal): EntityRef: expecting ';'
OK. Many thanks. Please push and let's see how it feels.
http://git.gnome.org/browse/libxml++/commit/?id=a6ac0b12f2574570f101450b880329eba7cfd6f2
Forgot to mark new public functions with @newin{2,36}. Fixed with this commit http://git.gnome.org/browse/libxml++/commit/?id=c32f340e9c78c4cf795df12c67432c3c77cedc15