After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 539368 - 3-byte direct UTF8 triggers 2 calls to xmlpp::SaxParser::on_characters(const Glib::ustring&)
3-byte direct UTF8 triggers 2 calls to xmlpp::SaxParser::on_characters(const ...
Status: RESOLVED OBSOLETE
Product: libxml++
Classification: Bindings
Component: SAX Parser
2.6.x
Other Linux
: Normal normal
: ---
Assigned To: Christophe de Vienne
Christophe de Vienne
Depends on: 619302
Blocks:
 
 
Reported: 2008-06-20 20:59 UTC by Elaine Tsiang YueLien
Modified: 2020-11-12 09:29 UTC
See Also:
GNOME target: ---
GNOME version: 2.21/2.22


Attachments
sax_parser_bug.patch (386 bytes, text/plain)
2008-06-21 08:57 UTC, Murray Cumming
Details

Description Elaine Tsiang YueLien 2008-06-20 20:59:24 UTC
Encountered with 3-byte diacriticized IPA symbols.

Example:

<label>g̚</label>

results in a first call for "g", and a second call for "̚". This does not happen with 4-byte coded diacriticized symbols.
Comment 1 Murray Cumming 2008-06-21 08:56:13 UTC
I think I can confirm this. With the attached patch to the SaxParser example, I get this output:

node name=gjob:Application
on_characters(): g
on_characters(): on_end_element()
on_characters(): 
      
node name=gjob:Category
MySaxParser::on_characters(): Exception caught while converting text for std::cout: Invalid byte sequence in conversion input
on_characters(): Development
on_end_element()
on_characters():

Presumably something is looking at the number of characters when it should be looking at the number of bytes, or vice versa.

Comment 2 Murray Cumming 2008-06-21 08:57:40 UTC
Created attachment 113155 [details]
sax_parser_bug.patch
Comment 3 Murray Cumming 2010-03-30 16:13:09 UTC
CCing Daniel in case he can figure it out.
Comment 4 Murray Cumming 2010-03-30 16:14:14 UTC
Note that I don't see that exception any more, proably because of my recent locale initialization fix in the examples, but I do see the two on_characters() calls where there should be one.
Comment 5 Daniel Elstner 2010-03-30 17:00:25 UTC
Caveat: I don't know the libxml or libxml++ API, so I have no idea how it is meant to work.

That being said, I think what you are seeing here is a bug in libxml, if on_characters() is supposed to deliver blocks of text rather than single code points.  It's not a question of bytes vs characters, but one of code points vs glyphs, i.e. if and how combining characters are handled.
Comment 6 Murray Cumming 2010-05-04 15:23:31 UTC
Daniel, please add this to your list of things to look at for Openismus. I doubt that libxml (compared to libxml++) would get this wrong, but I guess you can find out.
Comment 7 Daniel Elstner 2010-05-21 15:42:27 UTC
It's a libxml bug, I reproduced it with git master of libxml2 and the testSAX program.
Comment 8 Quentin Pradet 2012-10-25 06:56:01 UTC
I know it's a 2010 bug but... is this really a bug? Has this been reported to libxml2 and/or fixed?

In the Java SAX API, both org.xml.sax.DocumentHandler.characters() [0] and org.xml.sax.ContentHandler.characters() [1] documentations say that "The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks".

I got the same issue with two-byte UTF8 characthers (eg. "é"), and "fixed" it by using a stringstream to append every chunk received between on_start_element and on_end_element.

[0] http://docs.oracle.com/javase/6/docs/api/org/xml/sax/DocumentHandler.html#characters(char[], int, int)
[1] http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html#characters(char[], int, int)
Comment 9 Kjell Ahlstedt 2012-12-09 19:23:35 UTC
(In reply to comment #8)
> is this really a bug? Has this been reported to libxml2 and/or fixed?

It has been reported in libxml2 bug 619302. I don't think it has been fixed,
that bug is still open. No libxml2 developer has commented it. I don't know if
the libxml2 developers consider it a bug, but I guess not.
Comment 10 André Klapper 2020-11-12 09:29:07 UTC
libxml++ has moved to https://github.com/libxmlplusplus/libxmlplusplus

If this ticket is still valid in a recent version of libxml++, then please create a ticket at https://github.com/libxmlplusplus/libxmlplusplus/issues - thanks a lot!