GNOME Bugzilla – Bug 619302
Characters content split just before combining character
Last modified: 2021-07-05 13:22:09 UTC
Created attachment 161657 [details] Test XML file with combining character sequence If a text element contains the combining character U+031A, the SAX parser will split the text at that point. The characters callback is invoked twice: The first call supplies the text portion which ends in the base character just before U+031A, and the second call supplies the remainder beginning with the combining character. There is no reason why the text should be split at that point, and splitting combining sequences in half makes subsequent processing really difficult. The attached test XML file can be used to demonstrate the bug when fed into testSAX.
Daniel, is this something you can fix?
No idea, I'll have to take a look at the code. If any of the libxml developers could provide me with hints as to where to look, it would be very much appreciated.
Created attachment 203344 [details] Test XML file with combining character sequences and other stuff I'm not a libxml developer, but I've used gdb to see from where the characters callback function charactersDebug() in testSAX.c is called. The text is split by some functions in parser.c: xmlParseCharData() xmlParseCharDataComplex() xmlParseReference() There are several locations where a text is split. - Each character reference in the form &qqq; is converted to a UTF-8 char, and sent in a separate call to charactersDebug(). - If the text starts with Ascii characters, followed by one or more non-Ascii characters, the text is split before the first non-Ascii character. - xmlParseCharDataComplex() copies the parsed characters to a local fixed-size buffer. When the buffer is full, charactersDebug() is called with the chars parsed so far. libxml does not handle combining characters specially. If a combining character is preceded by only Ascii characters, the text will be split before the combining character. The command testSAX --sax2 testcase3.xml where testcase3.xml is the attached XML file, produces the following output: SAX.setDocumentLocator() SAX.startDocument() SAX.startElementNs(testcase, NULL, NULL, 0, 0, 0) SAX.characters( , 2) SAX.startElementNs(test, NULL, NULL, 0, 0, 0) SAX.characters(f, 1) SAX.characters(€, 3) SAX.characters(>, 1) SAX.characters(g, 1) SAX.characters(̚bar, 5) SAX.endElementNs(test, NULL, NULL) SAX.characters( , 2) SAX.startElementNs(test, NULL, NULL, 0, 0, 0) SAX.characters( f o, 4) SAX.characters(ög̚bar , 9) SAX.endElementNs(test, NULL, NULL) SAX.characters( , 1) SAX.endElementNs(testcase, NULL, NULL) SAX.endDocument()
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.