GNOME Bugzilla – Bug 539368
3-byte direct UTF8 triggers 2 calls to xmlpp::SaxParser::on_characters(const Glib::ustring&)
Last modified: 2020-11-12 09:29:07 UTC
Encountered with 3-byte diacriticized IPA symbols. Example: <label>g̚</label> results in a first call for "g", and a second call for "̚". This does not happen with 4-byte coded diacriticized symbols.
I think I can confirm this. With the attached patch to the SaxParser example, I get this output: node name=gjob:Application on_characters(): g on_characters(): on_end_element() on_characters(): node name=gjob:Category MySaxParser::on_characters(): Exception caught while converting text for std::cout: Invalid byte sequence in conversion input on_characters(): Development on_end_element() on_characters(): Presumably something is looking at the number of characters when it should be looking at the number of bytes, or vice versa.
Created attachment 113155 [details] sax_parser_bug.patch
CCing Daniel in case he can figure it out.
Note that I don't see that exception any more, proably because of my recent locale initialization fix in the examples, but I do see the two on_characters() calls where there should be one.
Caveat: I don't know the libxml or libxml++ API, so I have no idea how it is meant to work. That being said, I think what you are seeing here is a bug in libxml, if on_characters() is supposed to deliver blocks of text rather than single code points. It's not a question of bytes vs characters, but one of code points vs glyphs, i.e. if and how combining characters are handled.
Daniel, please add this to your list of things to look at for Openismus. I doubt that libxml (compared to libxml++) would get this wrong, but I guess you can find out.
It's a libxml bug, I reproduced it with git master of libxml2 and the testSAX program.
I know it's a 2010 bug but... is this really a bug? Has this been reported to libxml2 and/or fixed? In the Java SAX API, both org.xml.sax.DocumentHandler.characters() [0] and org.xml.sax.ContentHandler.characters() [1] documentations say that "The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks". I got the same issue with two-byte UTF8 characthers (eg. "é"), and "fixed" it by using a stringstream to append every chunk received between on_start_element and on_end_element. [0] http://docs.oracle.com/javase/6/docs/api/org/xml/sax/DocumentHandler.html#characters(char[], int, int) [1] http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html#characters(char[], int, int)
(In reply to comment #8) > is this really a bug? Has this been reported to libxml2 and/or fixed? It has been reported in libxml2 bug 619302. I don't think it has been fixed, that bug is still open. No libxml2 developer has commented it. I don't know if the libxml2 developers consider it a bug, but I guess not.
libxml++ has moved to https://github.com/libxmlplusplus/libxmlplusplus If this ticket is still valid in a recent version of libxml++, then please create a ticket at https://github.com/libxmlplusplus/libxmlplusplus/issues - thanks a lot!