Bug 631551 – xmlTextWriterWriteAttribute leaks memory contents when processing malformed utf-8

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 631551 - xmlTextWriterWriteAttribute leaks memory contents when processing malformed utf-8


Summary:	xmlTextWriterWriteAttribute leaks memory contents when processing malformed u...


Status:	RESOLVED NOTABUG

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-10-06 18:34 UTC by Kees Cook
Modified:	2011-12-27 09:39 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
proof of concept leaker (1.30 KB, text/x-csrc) 2010-10-06 18:34 UTC, Kees Cook	Details

Description Kees Cook 2010-10-06 18:34:27 UTC

Created attachment 171840 [details]
proof of concept leaker

When xmlTextWriterWriteAttribute is passed invalid utf-8, it will walk past the end of the string and continue processing bytes beyond it, leading to a memory contents leak.

$ cc -I/usr/include/libxml2 xmlwriter.c   -o xmlwriter -lxml2
$ ./xmlwriter
<input value="&#x40;super secret info"/>

The above should have shown &#xe0;&#x81; not leaked past to other memory.

Comment 1 Kees Cook 2010-10-06 18:35:12 UTC

Seen via PHP too:
http://bugs.php.net/bug.php?id=52998

Comment 2 Daniel Veillard 2011-01-25 03:16:57 UTC

Well the libxml2 API expects 0 terminated UTF-8 as input strings when
xmlChar * is being used for the parameters. The caller has to check
this. Checking is actually fairly expensive and there is no reason
to make people who have a correct UTF-8 framework pay the price of
that check just because others don't.
xmlTextWriterWriteAttribute is just one of the thousand places in the
API where that expectation is in place,

 http://xmlsoft.org/FAQ.html#Developer see item 12 at the end of the page

unfortunately, not a bug from libxml2 perspective but an indication that
the PHP bindings don't do the UTF-8 checking on input and that's bad/dangerous

Daniel

Comment 3 Stas Malyshev 2011-12-27 08:34:27 UTC

Is the proposed solution for PHP to verify every string passed to any libxml2 function for utf-8 validity? This seems to make no sense as it would be prohibitively expensive and a wrong place to do it. I also don't see how "people who have a correct UTF-8 framework" can avoid paying the price of the check since you suggest the only way to use this API is exactly to run this check every time you pass something to libxml2 - since it's not going to check itself?

I would suggest instead for libxml2 to make a pass through the data while outputting it and ensure utf-8 is valid (i.e. does not contain unexpected \0-s). Since libxml2 accesses the data anyway (as demonstrated by it going past \0 and writing other data which should not be written) adding the checks for valid utf-8 sequence should not be too expensive and the checks to ensure utf-8 sequences are valid should not be too complicated. Is there any reason why this can not be done?

Comment 4 Daniel Veillard 2011-12-27 09:04:12 UTC

Example of one such framework is glib based strings which are all UTF-8.
I don't know how PHP stores the encoding of strings, if it doesn't then
that's just random data because an array of bytes without an associated
encoding doesn't have any semantic. I usually point at 
  http://www.joelonsoftware.com/articles/Unicode.html
which makes a good case of explaining this better than me. So basically
either your programming framework use an unified encoding for all strings
then you always have to convert to UTF-8 (and if that framework uses
UTF-8 you have nothing to do), conversion and checking goes along together
fine. that's basically something you should do on data input in PHP (i.e.
check the encoding of strings, possibly convert).
 And if PHP have 'strings' but doesn't know what their encoding is
then I'm afraid it's a problem as there is no way to make a good use of
libxml2 API in that context. The problem is not to try to catch possible
error on output, it's to make sure of the semantic of the operations, at
the XML level the encoding must be known, to guarantee that we are using
characters from the allowed character set of XML:
  http://www.w3.org/TR/REC-xml/#NT-Char

Daniel

Comment 5 Stas Malyshev 2011-12-27 09:39:47 UTC

You can not know what the string contains if you accept data from outside, unless you specifically check it. In PHP context that means that you are saying in order to use libxml2 API we would have to check every string that is passed to libxml2 for utf-8 validity. Attaching or not attaching anything to these strings is irrelevant - though PHP strings, just as C strings, has nothing attached to them - it's up to the application to interpret data in strings. PHP does not enforce and can not enforce any encoding or any meaning or standard on all strings in the language, since different applications use strings for different purposes.

I do not see why it is not possible for libxml2 to detect simple situation where it tries to read utf-8 sequence and sees \0 instead - \0 is not a valid character in utf-8 and utf-8 is simple enough encoding that this could be done in small number of if's (most of which will never even be executed since 6-byte utf-8 sequences are very rare in practice and most are 2-3 bytes so in practice only 2 or 3 if's will be executed). This would also not incur additional performance costs as libxml2 does the pass on the data anyway and actually knows enough about utf-8 to ignore \0 at the end of the string and read past it when interpreting the data as utf-8, so most of the proposed checks are done anyway, it just doesn't check for specific \0. 

This solution would make possible to operate libxml2 safely, while what you are suggesting - checking every input passed into libxml2 - would have very negative impact on performance and would still not ensure secure operation of libxml2 as you can never be sure that some string did not slip by into libxml2 without verification by some unexpected code path. It would be much more secure to detect the problem at the point where it happens - when libxml2 reads past \0 while interpreting utf-8. Of course, it would be better if libxml2 accepted string length and ensured that it does not read past it, but if it relies on \0 to signify the end of the string, it should not read past it in any situation. Even if the data passed to it is not valid for XML, it should fail gracefully and not just read random memory data. Of course, it's not always possible but in this case it definitely looks possible to check for it.