GNOME Bugzilla – Bug 148115
xmlCheckUTF8() bug
Last modified: 2004-12-22 21:47:04 UTC
If it finds a character with a binary representation such as: 10xxxxxx 10xxxxxx it takes it as a valid 2 byte UTF-8 character. The test is wrong, it just checks for the MSB being on ( c & 0x80 ), then it checks that the following byte starts with 10xxxxxx (which is true) and the next is to check if it is a 3 byte code (starts with 111xxxxx) which fails.. immediately accepting it as a 2 byte code. To fix this, change this: if (c & 0x80) { if ((utf[ix + 1] & 0xc0) != 0x80) return(0); to this: if (c & 0x80) { if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80) return(0); Hope it helps.
Okay, it seems to make sense, so I applied it, but I would feel better if you could give 2 XML tests case one passing that test and one failing it so we can add them to the test suite and confirm behaviour. thanks, Daniel
Daniel, I was using that function standalone to verify a Jabber stream (buggy) before feeding the parser (www.neosmt.com). The parser seemed to detect the error and fail while xmlCheckUTF8() could not detect the malformation, giving me no chance to fix it beforehand. I found the bug chasing a real life situation. I think the idea of two XML one passing and one failing would not be suitable as the parser seemed to detect the malformation while the function did not. Simply feed xmlCheckUTF8() with a character like 10xxxxxx 10xxxxxx (in bin) and the previous version will accept it as a 2 byte character code which is wrong (should be 10xxxxxx 11xxxxxx). Let me know if I can be of any help. Diego
Okay, no problem. Then it's fixed, thanks, Daniel
Hmmm, sorry, but I can't agree with the fix. I think that if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80) should be if ((c & 0xc0) == 0x80 || (utf[ix + 1] & 0xc0) != 0x80) in order to accept 2, 3 or 4-byte UTF8 sequences (see xml mailing list http://mail.gnome.org/archives/ xml/2004-August/msg00191.html for discussion on this). I have committed this change to CVS. Bill
Bill, I think you are completely right, my mistake!!! At least we finally got to a fix as the original code was wrong. Thank you, Diego