Bug 310229 – htmlParseScript bad parse UTF-8 characters

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 310229 - htmlParseScript bad parse UTF-8 characters


Summary:	htmlParseScript bad parse UTF-8 characters


Status:	VERIFIED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.20
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-07-13 14:39 UTC by Jiri Netolicky
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch to correct parsing UTF-8 characters in htmlParseScript (1.06 KB, patch) 2005-07-13 14:41 UTC, Jiri Netolicky	none	Details \| Review
C code and HMTL page wich exhibit th problem (997 bytes, application/x-compressed-tar) 2005-07-13 15:43 UTC, Jiri Netolicky		Details

Description Jiri Netolicky 2005-07-13 14:39:32 UTC

Please describe the problem:
When htmlParseScript founds multibyte charcters (for exmaple 'ří' which is 0xc5
0x99 0xc3 0xad) the NEXT macro which call xmlNextChar correctly move to the next
char, but:

buf[nbchar++] = cur;

on line 2664 copy ONLY ONE BYTE of two byte sequence to the buffer which is then
passed to cdataBlockSAXFunc function. I hope the code for htmlParseScritp should
be similar to htmlParseCharData.


Steps to reproduce:
1. try parsing any html document with some multibyte chracters in script tag
htmlCreatePushParserCtxt must be called with XML_CHAR_ENCODING_UTF8.

Actual results:


Expected results:


Does this happen every time?
yes

Other information:

Comment 1 Jiri Netolicky 2005-07-13 14:41:18 UTC

Created attachment 49093 [details] [review]
Patch to correct parsing UTF-8 characters in htmlParseScript

Comment 2 Daniel Veillard 2005-07-13 15:00:37 UTC

If you have a single standalone test example exhibiting the problem that
would help :-)

Daniel

Comment 3 Jiri Netolicky 2005-07-13 15:43:04 UTC

Created attachment 49102 [details]
C code and HMTL page wich exhibit th problem

Compare output ot this example program with for example 'od -tx1z index.html'.
You will see that every second byte from 2-byte UTF-8 sequence is missing in
buffer
passed to cdataBlockSAXFunc

Comment 4 Daniel Veillard 2005-07-13 16:38:22 UTC

Excellent, patch looks fine, applied and commited. I also added the 
example to the regression test suite,

  thanks a lot,

Daniel

Comment 5 qiuyingbo 2005-07-14 08:32:56 UTC

It seem that the patch import a new bug..

>>>>>>>>>>>>>
+	COPY_BUF(l,buf,nbchar,cur);
 	if (nbchar >= HTML_PARSER_BIG_BUFFER_SIZE) {
>>>>>>>>>>>>>

if nbchar == HTML_PARSER_BIG_BUFFER_SIZE-1, and the LOOP will continue...
then we meet a multibyte char, and the utf-8 length == 3

COPY_BUF(3, buf, nbchar, cur) will overflow the 'buf'

htmlParseCharData() define: xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
but htmlParseScript() define: xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 1];

the patch should be

@@ -2629,10 +2629,10 @@
 htmlParseScript(htmlParserCtxtPtr ctxt) {
-     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 1];
+     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
.................

Comment 6 Daniel Veillard 2005-07-14 08:57:58 UTC

Right ! Like htmlParseCharData() good catch !

  fixed in CVS, thanks a lot !

Daniel

Comment 7 Daniel Veillard 2005-09-05 09:00:41 UTC

This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel