After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 148115 - xmlCheckUTF8() bug
xmlCheckUTF8() bug
Status: RESOLVED FIXED
Product: libxml
Classification: Deprecated
Component: general
unspecified
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
Daniel Veillard
Depends on:
Blocks:
 
 
Reported: 2004-07-21 19:16 UTC by dtartara
Modified: 2004-12-22 21:47 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description dtartara 2004-07-21 19:16:25 UTC
If it finds a character with a binary representation such as:

10xxxxxx 10xxxxxx 

it takes it as a valid 2 byte UTF-8 character. The test is wrong, it just 
checks for the MSB being on ( c & 0x80 ), then it checks that the following 
byte starts with 10xxxxxx (which is true) and the next is to check if it is a 3 
byte code (starts with 111xxxxx) which fails.. immediately accepting it as a 2 
byte code.
To fix this, change this:

        if (c & 0x80) {
	    if ((utf[ix + 1] & 0xc0) != 0x80)
	        return(0);

to this:
        if (c & 0x80) {
	    if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
	        return(0);


Hope it helps.
Comment 1 Daniel Veillard 2004-08-14 18:22:53 UTC
Okay, it seems to make sense, so I applied it, but I would feel better if
you could give 2 XML tests case one passing that test and one failing it so
we can add them to the test suite and confirm behaviour.

  thanks,

Daniel
Comment 2 dtartara 2004-08-15 11:41:45 UTC
Daniel,
I was using that function standalone to verify a Jabber stream (buggy) before 
feeding the parser (www.neosmt.com). The parser seemed to detect the error and 
fail while xmlCheckUTF8() could not detect the malformation, giving me no 
chance to fix it beforehand. I found the bug chasing a real life situation. 
I think the idea of two XML one passing and one failing would not be suitable 
as the parser seemed to detect the malformation while the function did not. 
Simply feed xmlCheckUTF8() with a character like 10xxxxxx 10xxxxxx (in bin) and 
the previous version will accept it as a 2 byte character code which is wrong 
(should be 10xxxxxx 11xxxxxx).
Let me know if I can be of any help.
Diego
Comment 3 Daniel Veillard 2004-08-15 12:10:08 UTC
Okay, no problem.
Then it's fixed,

  thanks,

Daniel
Comment 4 William M. Brack 2004-08-28 01:33:03 UTC
Hmmm, sorry, but I can't agree with the fix.  I think that
  if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
should be
  if ((c & 0xc0) == 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
in order to accept 2, 3 or 4-byte UTF8 sequences (see xml mailing list http://mail.gnome.org/archives/
xml/2004-August/msg00191.html for discussion on this).  I have committed this change to CVS.

Bill
Comment 5 dtartara 2004-08-28 22:36:16 UTC
Bill, 
I think you are completely right, my mistake!!! At least we finally got to a 
fix as the original code was wrong.
Thank you,
Diego