After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 739574 - Check serialized character data for validity
Check serialized character data for validity
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: general
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
: 337766 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2014-11-03 16:33 UTC by mirabilos
Modified: 2021-07-05 13:25 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test program / reproducer (891 bytes, text/x-csrc)
2014-11-03 16:33 UTC, mirabilos
  Details
patch v1 fixing this issue (9.22 KB, patch)
2014-11-24 13:16 UTC, mirabilos
none Details | Review
patch v2 fixing this issue (9.33 KB, patch)
2016-06-07 13:47 UTC, mirabilos
none Details | Review

Description mirabilos 2014-11-03 16:33:18 UTC
Created attachment 289919 [details]
test program / reproducer

I was debugging, for hours, why a PHP application (using the built-in SOAP extension) produced the following exception on a Java™ server:

org.jboss.ws.core.CommonSOAPFaultException: org.xml.sax.SAXParseException; lineNumber: 11; columnNumber: 388; An invalid XML character (Unicode: 0x8) was found in the element content of the document.

I’m aware that some chars (such as most C0 control characters, and others) need to be escaped as e.g. “” in XML, but I thought the SOAP extension would do that for me. So I asked coworkers, dug out SoapUI, and debugged further. Turns out that the on-the-wire SOAP that PHP created was bad.

So I dug into the PHP source and saw it using libxml2 functions for that: xmlNewTextLen and xmlDocDumpMemory, specifically.

This led to me writing a small test program (copying insignificantly from the libxml documentation and PHP) to exercise the encoding of text. Turns out that things like “<” are properly escaped, but C0 control characters (at least those I tested) aren’t.

tglase@tglase:~ $ printf '<m\bh&n>' | hd
00000000  3c 6d 08 68 26 6e 3e                              |<m.h&n>|
00000007
tglase@tglase:~ $ ./x $(printf '<m\bh&n>') | hd
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  65 6e 76 3a 45 6e 76 65  |F-8"?>.<env:Enve|
00000030  6c 6f 70 65 20 78 6d 6c  6e 73 3a 65 6e 76 3d 22  |lope xmlns:env="|
00000040  68 74 74 70 3a 2f 2f 77  77 77 2e 77 33 2e 6f 72  |http://www.w3.or|
00000050  67 2f 32 30 30 33 2f 30  35 2f 73 6f 61 70 2d 65  |g/2003/05/soap-e|
00000060  6e 76 65 6c 6f 70 65 22  3e 3c 65 6e 76 3a 42 6f  |nvelope"><env:Bo|
00000070  64 79 3e 26 6c 74 3b 6d  08 68 26 61 6d 70 3b 6e  |dy>&lt;m.h&amp;n|
00000080  26 67 74 3b 3c 2f 65 6e  76 3a 42 6f 64 79 3e 3c  |&gt;</env:Body><|
00000090  2f 65 6e 76 3a 45 6e 76  65 6c 6f 70 65 3e 0a     |/env:Envelope>.|
0000009f

As you can see in the output, “<m[BS]h&n>” becomes “&lt;m[BS]h&amp;n&gt;”. I would have expected, for example, “&lt;m&#8;h&amp;n&gt;”.

The test program is attached.
Comment 1 mirabilos 2014-11-03 16:46:34 UTC
(In reply to comment #0)

> As you can see in the output, “<m[BS]h&n>” becomes “&lt;m[BS]h&amp;n&gt;”. I
> would have expected, for example, “&lt;m&#8;h&amp;n&gt;”.

Or a CDATA section. Or &#x8; or something.
Comment 2 mirabilos 2014-11-19 10:40:00 UTC
After some research – http://stackoverflow.com/a/5742563/2171120 – I found out that the standard – http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char – says that, besides the XML text being required to be valid UTF-8 (or UTF-16), even after CDATA and entity decoding, only the following Unicode codepoints are valid in the document:

• U+0009
• U+000A
• U+000D
• U+0020‥U+D7FF
• U+E000‥U+FFFD
• U-00010000‥U-0010FFFF

Another author – http://blog.mark-mclaren.info/2007/02/invalid-xml-characters-when-valid-utf8_5873.html – recommends just stripping out the offending characters.

I’d suggest libxml should do the same and silently lose these characters when passed from an application.
Comment 3 mirabilos 2014-11-21 14:16:14 UTC
OK, I can see two codepaths in xmlsave.c which convert to entities:

• xmlEscapeEntities(), triggered by the example program already

• xmlAttrSerializeTxtContent(), not triggered by it.

This diff makes the example program do trigger the second case:

--- attachment.cgi      2014-11-21 15:15:10.578833538 +0100
+++ x.c 2014-11-21 15:15:50.907376752 +0100
@@ -26,2 +26,3 @@
         body = xmlNewChild(envelope, ns, "Body", NULL);
+       xmlNewProp(body, "fnord", argv[2]);
        text = xmlNewTextLen(argv[1], strlen(argv[1]));

Now it requires two arguments.
Comment 4 mirabilos 2014-11-21 14:41:01 UTC
• xmlEncodeEntitiesInternal() in entities.c may not issue &#x8; in latin1 mode either

• xmlEscapeContent() in xmlIO.c is the actual culprit for argv[1]

Working on it…
Comment 5 mirabilos 2014-11-21 14:52:53 UTC
Looking further: several places are missing the UTF-8 “decoder” completely, *and* it doesn’t ensure minimal encoding, making libxml2 vulnerable to character aliasing attacks.
Comment 6 mirabilos 2014-11-24 13:16:56 UTC
Created attachment 291355 [details] [review]
patch v1 fixing this issue

I’ve attached a patch replacing many uses of an ad-hōc UTF-8 decoder with a sane one, and using this decoder in more places to detect invalid things (not IS_CHAR or IS_BYTE_CHAR).

Applying this patch to Debian’s libxml2 (2.9.2+dfsg1-1) makes the test program I’ve attached earlier succeed.

Development of this patch has been sponsored by tarent solutions GmbH and one of its customers, and thus, German tax money ☺
Comment 7 mirabilos 2014-11-27 09:47:37 UTC
The patch needs feedback/review. One thing I found (backporting this to squeeze and wheezy in the meantime) is the patch to xmlIO.c:

+               if (outend - out < wl) break;

This should be (-Wsign-compare):

+               if ((size_t)(outend - out) < wl) break;


Also, I’ve been thinking that the “just skip the offending character” should be moved a bit higher, to _before_ doc->encoding is changed. (That got to be a huge memory leak, by the way. The more I look at libxml2 code………)
Comment 8 Kurt Roeckx 2015-09-15 18:04:15 UTC
I actually want to store some of those control codes in an XML file.  I find it unfortunate how the XML specification is written to exclude all of the C0 codes except for tab, CR and LF.  The unicode stardard it refers to is very explicit that it should just work and assigns a specific meaning to some of them, see chapter 23.1.

Skipping those characters silently corrupts data, I think that it just unacceptable behaviour.  The current version allows me to write it, it just doesn't want to read it back in.
Comment 9 mirabilos 2015-09-16 07:44:32 UTC
Yes, me too… at least escaped in the &#x0008; form… but, meh.

The bigger problem at hand here is that *other* implementations reject it too, such as several Java™ webservices (and the XML standard is the more specific one in here, overriding Unicode). libxml2 is buried beneath layers upon layers in the application I have to maintain, which sends SOAP calls from PHP.

Skipping those characters silently corrupts the input, but the input was not valid XML data in the first place, unfortunately. I thought long to consider aborting instead, but that would cause more trouble than removal of characters that should never have been there in the first place.
Comment 10 Kurt Roeckx 2015-09-17 12:15:37 UTC
&#x0008; doesn't work either, neither libxml2 nor other implementations wants to read that.

However, it seem Unicode has all the characters <= 0x1F also at U+2400 - U+241F, and I can just make use of those for my needs.
Comment 11 mirabilos 2015-09-17 12:20:04 UTC
On Thu, 17 Sep 2015, libxml2 wrote:

> &#x0008; doesn't work either, neither libxml2 nor other implementations wants
> to read that.

Yes, because it’s illegal in XML too. I wrote this because I agree
that outright forbiddiing these chars even encoded like this is… meh.

> However, it seem Unicode has all the characters <= 0x1F also at U+2400 -
> U+241F, and I can just make use of those for my needs.

OK.

(stupid bugzilla not accepting eMail replies…)
Comment 12 mirabilos 2016-06-07 13:47:04 UTC
Created attachment 329275 [details] [review]
patch v2 fixing this issue

I’ve indeed accidentally introduced a regression in the patch.

The attached patch v2 fixes the regression and is rebased against latest Debian sid upload (some fuzziness wrt. the recent security-related fixes, as I also covered one of them).

Please apply!
Comment 13 Daniel Veillard 2016-06-07 14:41:47 UTC
There is already code in libxml2 parser doing what xmlInternalUTF8decode tries to do.
I don't like the idea of silently dropping characters. When someone use libxml2
API and it takes an xmlChar * the contract is that the string is UTF-8 with chars
in the acceptable range for XML.
If you end up with characters in the tree that are not serializable an error should be raised, either relying on the parser on the receiving end (XML is strict for this) or failing the conversion. Dropping is not acceptable for XML
it's okay for HTML though.

Now the question is should libxml2 check all its API entry point to verify 
that xmlChar* strings handled are compliant. I didn't do that for performance,
I rely on the client respecting the API contract, it's way easier and cheaper
for the client side to implement this in general.

http://xmlsoft.org/FAQ.html#Developer question 12.

Now if you passed that data to libxml2 with an API not taking an xmlChar *
but a char * then I would have to fix this.

  thanks,

Daniel
Comment 14 Daniel Veillard 2016-06-07 14:47:28 UTC
BTW the reproducer pass argv[1] to xmlNewTextLen ie does an implicit
cast of char * to xmlChar *and violates the libxml2 API.
Since that code generate like 100 lines of warning and errors (it doesn't
even include the libxml2 headers) that's not a reproducer, sorry ...

Daniel

hinkpad2:~/XML -> make tst
gcc -g   -pedantic -W -Wformat -Wno-format-extra-args -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -Wno-long-long -Iinclude -o tst tst.c .libs/libxml2.a -lpthread -lm -lz -llzma
tst.c: In function ‘main’:
tst.c:4:9: error: unknown type name ‘xmlDoc’
         xmlDoc *doc;
         ^
tst.c:5:9: error: unknown type name ‘xmlNodePtr’
         xmlNodePtr envelope = NULL, body, method = NULL, head = NULL, text;
         ^
tst.c:5:31: error: ‘NULL’ undeclared (first use in this function)
         xmlNodePtr envelope = NULL, body, method = NULL, head = NULL, text;
                               ^
tst.c:5:31: note: each undeclared identifier is reported only once for each function it appears in
tst.c:6:9: error: unknown type name ‘xmlNsPtr’
         xmlNsPtr ns = NULL;
         ^
tst.c:7:5: error: unknown type name ‘xmlChar’
     xmlChar *buf;
     ^
tst.c:10:15: warning: implicit declaration of function ‘xmlNewDoc’ [-Wimplicit-function-declaration]
         doc = xmlNewDoc("1.0");
               ^
tst.c:10:9: warning: nested extern declaration of ‘xmlNewDoc’ [-Wnested-externs]
         doc = xmlNewDoc("1.0");
         ^
tst.c:10:13: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
         doc = xmlNewDoc("1.0");
             ^
tst.c:11:12: error: request for member ‘encoding’ in something not a structure or union
         doc->encoding = xmlCharStrdup("UTF-8");
            ^
tst.c:11:25: warning: implicit declaration of function ‘xmlCharStrdup’ [-Wimplicit-function-declaration]
         doc->encoding = xmlCharStrdup("UTF-8");
                         ^
tst.c:11:9: warning: nested extern declaration of ‘xmlCharStrdup’ [-Wnested-externs]
         doc->encoding = xmlCharStrdup("UTF-8");
         ^
tst.c:12:12: error: request for member ‘charset’ in something not a structure or union
         doc->charset = XML_CHAR_ENCODING_UTF8;
            ^
tst.c:12:24: error: ‘XML_CHAR_ENCODING_UTF8’ undeclared (first use in this function)
         doc->charset = XML_CHAR_ENCODING_UTF8;
                        ^
tst.c:13:16: warning: implicit declaration of function ‘xmlNewDocNode’ [-Wimplicit-function-declaration]
     envelope = xmlNewDocNode(doc, NULL, "Envelope", NULL);
                ^
tst.c:13:5: warning: nested extern declaration of ‘xmlNewDocNode’ [-Wnested-externs]
     envelope = xmlNewDocNode(doc, NULL, "Envelope", NULL);
     ^
tst.c:14:10: warning: implicit declaration of function ‘xmlNewNs’ [-Wimplicit-function-declaration]
     ns = xmlNewNs(envelope, "http://www.w3.org/2003/05/soap-envelope", "env");
          ^
tst.c:14:5: warning: nested extern declaration of ‘xmlNewNs’ [-Wnested-externs]
     ns = xmlNewNs(envelope, "http://www.w3.org/2003/05/soap-envelope", "env");
     ^
tst.c:15:5: warning: implicit declaration of function ‘xmlSetNs’ [-Wimplicit-function-declaration]
     xmlSetNs(envelope, ns);
     ^
tst.c:15:5: warning: nested extern declaration of ‘xmlSetNs’ [-Wnested-externs]
tst.c:16:9: warning: implicit declaration of function ‘xmlDocSetRootElement’ [-Wimplicit-function-declaration]
         xmlDocSetRootElement(doc, envelope);
         ^
tst.c:16:9: warning: nested extern declaration of ‘xmlDocSetRootElement’ [-Wnested-externs]
tst.c:18:16: warning: implicit declaration of function ‘xmlNewChild’ [-Wimplicit-function-declaration]
         body = xmlNewChild(envelope, ns, "Body", NULL);
                ^
tst.c:18:9: warning: nested extern declaration of ‘xmlNewChild’ [-Wnested-externs]
         body = xmlNewChild(envelope, ns, "Body", NULL);
         ^
tst.c:19:12: warning: implicit declaration of function ‘xmlNewTextLen’ [-Wimplicit-function-declaration]
     text = xmlNewTextLen(argv[1], strlen(argv[1]));
            ^
tst.c:19:5: warning: nested extern declaration of ‘xmlNewTextLen’ [-Wnested-externs]
     text = xmlNewTextLen(argv[1], strlen(argv[1]));
     ^
tst.c:19:35: warning: implicit declaration of function ‘strlen’ [-Wimplicit-function-declaration]
     text = xmlNewTextLen(argv[1], strlen(argv[1]));
                                   ^
tst.c:19:35: warning: incompatible implicit declaration of built-in function ‘strlen’
tst.c:19:35: note: include ‘<string.h>’ or provide a declaration of ‘strlen’
tst.c:20:5: warning: implicit declaration of function ‘xmlAddChild’ [-Wimplicit-function-declaration]
     xmlAddChild(body, text);
     ^
tst.c:20:5: warning: nested extern declaration of ‘xmlAddChild’ [-Wnested-externs]
tst.c:22:5: warning: implicit declaration of function ‘xmlDocDumpMemory’ [-Wimplicit-function-declaration]
     xmlDocDumpMemory(doc, &buf, &len);
     ^
tst.c:22:5: warning: nested extern declaration of ‘xmlDocDumpMemory’ [-Wnested-externs]
tst.c:23:5: warning: implicit declaration of function ‘write’ [-Wimplicit-function-declaration]
     write(1, buf, len);
     ^
tst.c:23:5: warning: nested extern declaration of ‘write’ [-Wnested-externs]
tst.c:5:58: warning: unused variable ‘head’ [-Wunused-variable]
         xmlNodePtr envelope = NULL, body, method = NULL, head = NULL, text;
                                                          ^
tst.c:5:43: warning: unused variable ‘method’ [-Wunused-variable]
         xmlNodePtr envelope = NULL, body, method = NULL, head = NULL, text;
                                           ^
tst.c:2:10: warning: unused parameter ‘argc’ [-Wunused-parameter]
 main(int argc, char *argv[])
          ^
Makefile:2917: recipe for target 'tst' failed
make: *** [tst] Error 1
thinkpad2:~/XML ->
Comment 15 mirabilos 2016-06-07 14:56:13 UTC
As for point 1, xmlInternalUTF8decode does not “try” but does its job rather well, unlike the numerous embedded code copies in libxml2 that are all different from each other and just *invite* security bugs.

As for point 2, silently dropping characters was the fastest way I could get the application I was working on to, well, work. If you read above, I did in fact suggest erroring out as an alternative.

As for the API, I actually do not know, as there are several levels of abstractions between the application in question and libxml2 – it’s parsing Kolab and iCalendar/iMip data and sending SOAP requests to some server, and more related things, but my contact with XML is only via (a thin wrapper around) PHP’s SOAP functionality.

The interesting thing here is that
① the iMip input *can* contain some control characters, and
② http://php.net/manual/en/soapclient.soapcall.php does *not* explicitly forbid them. Of course, this is not *your* fault again, just trying to paint the whole picture first.

Dropping the invalid-in-XML valid Unicode characters in libxml2 was the safest way to implement this (it’s running in production, has been for a while) given that libxml2 accepted these characters before and does not always have explicit, easy to call, error handling in those places. Furthermore, doing it in this central place was less code than handling each individual string assigned to an array/structure that COULD, POSSIBLY, be involved in an application-level SOAP call (PHP *is* a pain in the arse, after all).

I’m actually okay with libxml2 returning errors *if* the way to do so was well-defined before already (so that the existing stack on top of libxml2 picks those errors up and handles them correctly). But do note that, even in the case you should decide to do that, my changes to UTF-8 handling are still required, from a security PoV.

However!

As things are, I’m looking at it like this:

• Current upstream/Debian libxml2 silently accepts these characters and sends them out, breaking XML on the wire.
• My patch improves this situation, to continue silently accepting these characters, but making the XML on the wire valid; for valid inputs, nothing changes.
• Returning errors for these characters would be a “level 2” correction with deeper impact on applications.

Therefore I propose to accept my patch as-is as a “first measure”, including to ensure it’ll be included in the _next_ Debian stable release (I’ve already backported it to as far as squeeze, but I’d prefer stopping to build my own libxml2 packages), as it doesn’t break anything on the application layer, and *then* look into changing all the “silent accepts” into error returns *later*, in a development version that needs widespread testing.

The patch v2 as-is could even be considered a security fix… after all, there’s some DoS potential, asides from correcting UTF-8 input decoding.
Comment 16 mirabilos 2016-06-07 14:57:52 UTC
(In reply to Daniel Veillard from comment #14)

> BTW the reproducer pass argv[1] to xmlNewTextLen ie does an implicit
> cast of char * to xmlChar *and violates the libxml2 API.

Ye gods, then please just write one that uses the correct API. As I said, there are tons of layers between me and that arcane API of yours, and I merely tried to read enough of the confounding documentation to get it working at all, in order to make my point visible.
Comment 17 Nick Wellnhofer 2017-06-14 11:50:20 UTC
*** Bug 337766 has been marked as a duplicate of this bug. ***
Comment 18 Nick Wellnhofer 2017-06-14 11:57:33 UTC
I agree with Daniel that we shouldn't change the default behavior of the serializer. Ideally, there would be two new serialization modes. One that checks for UTF-8 and Char validity and returns an error if invalid data was found. Then another one that tries to sanitize invalid data. This is best done by replacing invalid characters with the Unicode replacement character U+FFFD.
Comment 19 GNOME Infrastructure Team 2021-07-05 13:25:21 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.