GNOME Bugzilla – Bug 633166
XInclude large text: invalid character
Last modified: 2012-08-17 15:03:27 UTC
Created attachment 173245 [details] main.xml and xincluded text.txt that demonstrate the bug The bug appears when xincluding a special text file with <xi:include href="text.txt" parse="text"/>. I've been trying hard to make the text file as simple as possible. The simplest included file is attached. It includes only - 'x' (former alfanumeric and special characters) - 's' characters (former whitespace) - newline characters - accented characters in utf-8 encoding (ěščřžý...) The bug appears even if including the file with <xi:include href="text.txt" parse="text" encoding="utf-8"/>. I find the bug mysterious, because the text file is correctly included when text.txt is modified in ANY of the following ways: - the very first character is removed - the first line is removed (it contains no special characters) - the first special character is removed - the first two special characters are removed - the first line is moved to the end of file (including the newline char) It seems that the sizes of the included file matters... I'm testing xinclusion with: xmllint --xinclude main.xml I'm using libxml2 v. 2.7.7
Created attachment 214302 [details] Testcase for xinclude parse text
I confirm the bug is reproduced. $ xmllint --xinclude test.xml test.xml:5: element include: XInclude error : test.txt contains invalid char test.xml:5: element include: XInclude error : could not load test.txt, and no fallback was found <?xml version="1.0" encoding="utf-8"?> <test xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="test.txt" parse="text"/> </test> Testcase attached. xmllint: using libxml version 20708 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
This is a bug in xinclude.c, xmlXIncludeLoadTxt(). The problem occurs when a multibyte char crosses the boundary of the internal buffer (4000 bytes). At the end of the current buffer is an incomplete symbol, then test IS_CHAR returns an error.
Created attachment 214484 [details] [review] Patch Added fallback for multibyte char at buffer boundary. If no IS_CHAR in current position and position is close to the end (< 4 byte) of buffer, restart buffer read.
Created attachment 215055 [details] [review] Actual patch Attached fixed version of patch (don't duplicate buffer content). Thanks for his comments to Alexey Ponomarev <ponomarev@yandex-team.ru>.
Patch applied, thanks a lot Vitaly ! http://git.gnome.org/browse/libxml2/commit/?id=dce1c8baaeaa4f23874c59da91d9ecc0e31a787c Daniel