GNOME Bugzilla – Bug 631551
xmlTextWriterWriteAttribute leaks memory contents when processing malformed utf-8
Last modified: 2011-12-27 09:39:47 UTC
Created attachment 171840 [details] proof of concept leaker When xmlTextWriterWriteAttribute is passed invalid utf-8, it will walk past the end of the string and continue processing bytes beyond it, leading to a memory contents leak. $ cc -I/usr/include/libxml2 xmlwriter.c -o xmlwriter -lxml2 $ ./xmlwriter <input value="@super secret info"/> The above should have shown à not leaked past to other memory.
Seen via PHP too: http://bugs.php.net/bug.php?id=52998
Well the libxml2 API expects 0 terminated UTF-8 as input strings when xmlChar * is being used for the parameters. The caller has to check this. Checking is actually fairly expensive and there is no reason to make people who have a correct UTF-8 framework pay the price of that check just because others don't. xmlTextWriterWriteAttribute is just one of the thousand places in the API where that expectation is in place, http://xmlsoft.org/FAQ.html#Developer see item 12 at the end of the page unfortunately, not a bug from libxml2 perspective but an indication that the PHP bindings don't do the UTF-8 checking on input and that's bad/dangerous Daniel
Is the proposed solution for PHP to verify every string passed to any libxml2 function for utf-8 validity? This seems to make no sense as it would be prohibitively expensive and a wrong place to do it. I also don't see how "people who have a correct UTF-8 framework" can avoid paying the price of the check since you suggest the only way to use this API is exactly to run this check every time you pass something to libxml2 - since it's not going to check itself? I would suggest instead for libxml2 to make a pass through the data while outputting it and ensure utf-8 is valid (i.e. does not contain unexpected \0-s). Since libxml2 accesses the data anyway (as demonstrated by it going past \0 and writing other data which should not be written) adding the checks for valid utf-8 sequence should not be too expensive and the checks to ensure utf-8 sequences are valid should not be too complicated. Is there any reason why this can not be done?
Example of one such framework is glib based strings which are all UTF-8. I don't know how PHP stores the encoding of strings, if it doesn't then that's just random data because an array of bytes without an associated encoding doesn't have any semantic. I usually point at http://www.joelonsoftware.com/articles/Unicode.html which makes a good case of explaining this better than me. So basically either your programming framework use an unified encoding for all strings then you always have to convert to UTF-8 (and if that framework uses UTF-8 you have nothing to do), conversion and checking goes along together fine. that's basically something you should do on data input in PHP (i.e. check the encoding of strings, possibly convert). And if PHP have 'strings' but doesn't know what their encoding is then I'm afraid it's a problem as there is no way to make a good use of libxml2 API in that context. The problem is not to try to catch possible error on output, it's to make sure of the semantic of the operations, at the XML level the encoding must be known, to guarantee that we are using characters from the allowed character set of XML: http://www.w3.org/TR/REC-xml/#NT-Char Daniel
You can not know what the string contains if you accept data from outside, unless you specifically check it. In PHP context that means that you are saying in order to use libxml2 API we would have to check every string that is passed to libxml2 for utf-8 validity. Attaching or not attaching anything to these strings is irrelevant - though PHP strings, just as C strings, has nothing attached to them - it's up to the application to interpret data in strings. PHP does not enforce and can not enforce any encoding or any meaning or standard on all strings in the language, since different applications use strings for different purposes. I do not see why it is not possible for libxml2 to detect simple situation where it tries to read utf-8 sequence and sees \0 instead - \0 is not a valid character in utf-8 and utf-8 is simple enough encoding that this could be done in small number of if's (most of which will never even be executed since 6-byte utf-8 sequences are very rare in practice and most are 2-3 bytes so in practice only 2 or 3 if's will be executed). This would also not incur additional performance costs as libxml2 does the pass on the data anyway and actually knows enough about utf-8 to ignore \0 at the end of the string and read past it when interpreting the data as utf-8, so most of the proposed checks are done anyway, it just doesn't check for specific \0. This solution would make possible to operate libxml2 safely, while what you are suggesting - checking every input passed into libxml2 - would have very negative impact on performance and would still not ensure secure operation of libxml2 as you can never be sure that some string did not slip by into libxml2 without verification by some unexpected code path. It would be much more secure to detect the problem at the point where it happens - when libxml2 reads past \0 while interpreting utf-8. Of course, it would be better if libxml2 accepted string length and ensured that it does not read past it, but if it relies on \0 to signify the end of the string, it should not read past it in any situation. Even if the data passed to it is not valid for XML, it should fail gracefully and not just read random memory data. Of course, it's not always possible but in this case it definitely looks possible to check for it.