Bug 539368 – 3-byte direct UTF8 triggers 2 calls to xmlpp::SaxParser::on_characters(const Glib::ustring&)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 539368 - 3-byte direct UTF8 triggers 2 calls to xmlpp::SaxParser::on_characters(const Glib::ustring&)


Summary:	3-byte direct UTF8 triggers 2 calls to xmlpp::SaxParser::on_characters(const ...


Status:	RESOLVED OBSOLETE

Product:	libxml++
Classification:	Bindings
Component:	SAX Parser
Version:	2.6.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Christophe de Vienne
QA Contact:	Christophe de Vienne

URL:
Whiteboard:

Depends on:	619302
Blocks:

Reported:	2008-06-20 20:59 UTC by Elaine Tsiang YueLien
Modified:	2020-11-12 09:29 UTC

See Also:
GNOME target:	---
GNOME version:	2.21/2.22

Attachments
sax_parser_bug.patch (386 bytes, text/plain) 2008-06-21 08:57 UTC, Murray Cumming	Details

Description Elaine Tsiang YueLien 2008-06-20 20:59:24 UTC

Encountered with 3-byte diacriticized IPA symbols.

Example:

<label>g̚</label>

results in a first call for "g", and a second call for "̚". This does not happen with 4-byte coded diacriticized symbols.

Comment 1 Murray Cumming 2008-06-21 08:56:13 UTC

I think I can confirm this. With the attached patch to the SaxParser example, I get this output:

node name=gjob:Application
on_characters(): g
on_characters(): on_end_element()
on_characters(): 
      
node name=gjob:Category
MySaxParser::on_characters(): Exception caught while converting text for std::cout: Invalid byte sequence in conversion input
on_characters(): Development
on_end_element()
on_characters():

Presumably something is looking at the number of characters when it should be looking at the number of bytes, or vice versa.

Comment 2 Murray Cumming 2008-06-21 08:57:40 UTC

Created attachment 113155 [details]
sax_parser_bug.patch

Comment 3 Murray Cumming 2010-03-30 16:13:09 UTC

CCing Daniel in case he can figure it out.

Comment 4 Murray Cumming 2010-03-30 16:14:14 UTC

Note that I don't see that exception any more, proably because of my recent locale initialization fix in the examples, but I do see the two on_characters() calls where there should be one.

Comment 5 Daniel Elstner 2010-03-30 17:00:25 UTC

Caveat: I don't know the libxml or libxml++ API, so I have no idea how it is meant to work.

That being said, I think what you are seeing here is a bug in libxml, if on_characters() is supposed to deliver blocks of text rather than single code points.  It's not a question of bytes vs characters, but one of code points vs glyphs, i.e. if and how combining characters are handled.

Comment 6 Murray Cumming 2010-05-04 15:23:31 UTC

Daniel, please add this to your list of things to look at for Openismus. I doubt that libxml (compared to libxml++) would get this wrong, but I guess you can find out.

Comment 7 Daniel Elstner 2010-05-21 15:42:27 UTC

It's a libxml bug, I reproduced it with git master of libxml2 and the testSAX program.

Comment 8 Quentin Pradet 2012-10-25 06:56:01 UTC

I know it's a 2010 bug but... is this really a bug? Has this been reported to libxml2 and/or fixed?

In the Java SAX API, both org.xml.sax.DocumentHandler.characters() [0] and org.xml.sax.ContentHandler.characters() [1] documentations say that "The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks".

I got the same issue with two-byte UTF8 characthers (eg. "é"), and "fixed" it by using a stringstream to append every chunk received between on_start_element and on_end_element.

[0] http://docs.oracle.com/javase/6/docs/api/org/xml/sax/DocumentHandler.html#characters(char[], int, int)
[1] http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html#characters(char[], int, int)

Comment 9 Kjell Ahlstedt 2012-12-09 19:23:35 UTC

(In reply to comment #8)
> is this really a bug? Has this been reported to libxml2 and/or fixed?

It has been reported in libxml2 bug 619302. I don't think it has been fixed,
that bug is still open. No libxml2 developer has commented it. I don't know if
the libxml2 developers consider it a bug, but I guess not.

Comment 10 André Klapper 2020-11-12 09:29:07 UTC

libxml++ has moved to https://github.com/libxmlplusplus/libxmlplusplus

If this ticket is still valid in a recent version of libxml++, then please create a ticket at https://github.com/libxmlplusplus/libxmlplusplus/issues - thanks a lot!