After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 619302 - Characters content split just before combining character
Characters content split just before combining character
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: general
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks: 539368
 
 
Reported: 2010-05-21 15:29 UTC by Daniel Elstner
Modified: 2021-07-05 13:22 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Test XML file with combining character sequence (69 bytes, text/xml)
2010-05-21 15:29 UTC, Daniel Elstner
Details
Test XML file with combining character sequences and other stuff (107 bytes, application/xml)
2011-12-13 15:09 UTC, Kjell Ahlstedt
Details

Description Daniel Elstner 2010-05-21 15:29:54 UTC
Created attachment 161657 [details]
Test XML file with combining character sequence

If a text element contains the combining character U+031A, the SAX parser will split the text at that point.  The characters callback is invoked twice: The first call supplies the text portion which ends in the base character just before U+031A, and the second call supplies the remainder beginning with the combining character.

There is no reason why the text should be split at that point, and splitting combining sequences in half makes subsequent processing really difficult.

The attached test XML file can be used to demonstrate the bug when fed into testSAX.
Comment 1 Murray Cumming 2010-06-13 19:48:35 UTC
Daniel, is this something you can fix?
Comment 2 Daniel Elstner 2010-06-13 20:37:44 UTC
No idea, I'll have to take a look at the code.  If any of the libxml developers could provide me with hints as to where to look, it would be very much appreciated.
Comment 3 Kjell Ahlstedt 2011-12-13 15:09:39 UTC
Created attachment 203344 [details]
Test XML file with combining character sequences and other stuff

I'm not a libxml developer, but I've used gdb to see from where the characters
callback function charactersDebug() in testSAX.c is called. The text is split
by some functions in parser.c:
  xmlParseCharData()
  xmlParseCharDataComplex()
  xmlParseReference()

There are several locations where a text is split.
- Each character reference in the form &qqq; is converted to a UTF-8 char, and
  sent in a separate call to charactersDebug().
- If the text starts with Ascii characters, followed by one or more non-Ascii
  characters, the text is split before the first non-Ascii character.
- xmlParseCharDataComplex() copies the parsed characters to a local fixed-size
  buffer. When the buffer is full, charactersDebug() is called with the chars
  parsed so far.

libxml does not handle combining characters specially. If a combining character
is preceded by only Ascii characters, the text will be split before the
combining character.

The command
  testSAX --sax2 testcase3.xml
where testcase3.xml is the attached XML file, produces the following output:

SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElementNs(testcase, NULL, NULL, 0, 0, 0)
SAX.characters(
	, 2)
SAX.startElementNs(test, NULL, NULL, 0, 0, 0)
SAX.characters(f, 1)
SAX.characters(€, 3)
SAX.characters(>, 1)
SAX.characters(g, 1)
SAX.characters(̚bar, 5)
SAX.endElementNs(test, NULL, NULL)
SAX.characters(
	, 2)
SAX.startElementNs(test, NULL, NULL, 0, 0, 0)
SAX.characters( f o, 4)
SAX.characters(ög̚bar , 9)
SAX.endElementNs(test, NULL, NULL)
SAX.characters(
, 1)
SAX.endElementNs(testcase, NULL, NULL)
SAX.endDocument()
Comment 4 GNOME Infrastructure Team 2021-07-05 13:22:09 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.