Bug 148115 – xmlCheckUTF8() bug

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 148115 - xmlCheckUTF8() bug


Summary:	xmlCheckUTF8() bug


Status:	RESOLVED FIXED

Product:	libxml
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	Daniel Veillard

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-07-21 19:16 UTC by dtartara
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description dtartara 2004-07-21 19:16:25 UTC

If it finds a character with a binary representation such as:

10xxxxxx 10xxxxxx 

it takes it as a valid 2 byte UTF-8 character. The test is wrong, it just 
checks for the MSB being on ( c & 0x80 ), then it checks that the following 
byte starts with 10xxxxxx (which is true) and the next is to check if it is a 3 
byte code (starts with 111xxxxx) which fails.. immediately accepting it as a 2 
byte code.
To fix this, change this:

        if (c & 0x80) {
	    if ((utf[ix + 1] & 0xc0) != 0x80)
	        return(0);

to this:
        if (c & 0x80) {
	    if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
	        return(0);


Hope it helps.

Comment 1 Daniel Veillard 2004-08-14 18:22:53 UTC

Okay, it seems to make sense, so I applied it, but I would feel better if
you could give 2 XML tests case one passing that test and one failing it so
we can add them to the test suite and confirm behaviour.

  thanks,

Daniel

Comment 2 dtartara 2004-08-15 11:41:45 UTC

Daniel,
I was using that function standalone to verify a Jabber stream (buggy) before 
feeding the parser (www.neosmt.com). The parser seemed to detect the error and 
fail while xmlCheckUTF8() could not detect the malformation, giving me no 
chance to fix it beforehand. I found the bug chasing a real life situation. 
I think the idea of two XML one passing and one failing would not be suitable 
as the parser seemed to detect the malformation while the function did not. 
Simply feed xmlCheckUTF8() with a character like 10xxxxxx 10xxxxxx (in bin) and 
the previous version will accept it as a 2 byte character code which is wrong 
(should be 10xxxxxx 11xxxxxx).
Let me know if I can be of any help.
Diego

Comment 3 Daniel Veillard 2004-08-15 12:10:08 UTC

Okay, no problem.
Then it's fixed,

  thanks,

Daniel

Comment 4 William M. Brack 2004-08-28 01:33:03 UTC

Hmmm, sorry, but I can't agree with the fix.  I think that
  if ((c & 0xc0) != 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
should be
  if ((c & 0xc0) == 0x80 || (utf[ix + 1] & 0xc0) != 0x80)
in order to accept 2, 3 or 4-byte UTF8 sequences (see xml mailing list http://mail.gnome.org/archives/
xml/2004-August/msg00191.html for discussion on this).  I have committed this change to CVS.

Bill

Comment 5 dtartara 2004-08-28 22:36:16 UTC

Bill, 
I think you are completely right, my mistake!!! At least we finally got to a 
fix as the original code was wrong.
Thank you,
Diego