After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 310229 - htmlParseScript bad parse UTF-8 characters
htmlParseScript bad parse UTF-8 characters
Status: VERIFIED FIXED
Product: libxml2
Classification: Platform
Component: general
2.6.20
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2005-07-13 14:39 UTC by Jiri Netolicky
Modified: 2009-08-15 18:40 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch to correct parsing UTF-8 characters in htmlParseScript (1.06 KB, patch)
2005-07-13 14:41 UTC, Jiri Netolicky
none Details | Review
C code and HMTL page wich exhibit th problem (997 bytes, application/x-compressed-tar)
2005-07-13 15:43 UTC, Jiri Netolicky
  Details

Description Jiri Netolicky 2005-07-13 14:39:32 UTC
Please describe the problem:
When htmlParseScript founds multibyte charcters (for exmaple 'ří' which is 0xc5
0x99 0xc3 0xad) the NEXT macro which call xmlNextChar correctly move to the next
char, but:

buf[nbchar++] = cur;

on line 2664 copy ONLY ONE BYTE of two byte sequence to the buffer which is then
passed to cdataBlockSAXFunc function. I hope the code for htmlParseScritp should
be similar to htmlParseCharData.


Steps to reproduce:
1. try parsing any html document with some multibyte chracters in script tag
htmlCreatePushParserCtxt must be called with XML_CHAR_ENCODING_UTF8.

Actual results:


Expected results:


Does this happen every time?
yes

Other information:
Comment 1 Jiri Netolicky 2005-07-13 14:41:18 UTC
Created attachment 49093 [details] [review]
Patch to correct parsing UTF-8 characters in htmlParseScript
Comment 2 Daniel Veillard 2005-07-13 15:00:37 UTC
If you have a single standalone test example exhibiting the problem that
would help :-)

Daniel
Comment 3 Jiri Netolicky 2005-07-13 15:43:04 UTC
Created attachment 49102 [details]
C code and HMTL page wich exhibit th problem

Compare output ot this example program with for example 'od -tx1z index.html'.
You will see that every second byte from 2-byte UTF-8 sequence is missing in
buffer
passed to cdataBlockSAXFunc
Comment 4 Daniel Veillard 2005-07-13 16:38:22 UTC
Excellent, patch looks fine, applied and commited. I also added the 
example to the regression test suite,

  thanks a lot,

Daniel
Comment 5 qiuyingbo 2005-07-14 08:32:56 UTC
It seem that the patch import a new bug..

>>>>>>>>>>>>>
+	COPY_BUF(l,buf,nbchar,cur);
 	if (nbchar >= HTML_PARSER_BIG_BUFFER_SIZE) {
>>>>>>>>>>>>>

if nbchar == HTML_PARSER_BIG_BUFFER_SIZE-1, and the LOOP will continue...
then we meet a multibyte char, and the utf-8 length == 3

COPY_BUF(3, buf, nbchar, cur) will overflow the 'buf'

htmlParseCharData() define: xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
but htmlParseScript() define: xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 1];

the patch should be

@@ -2629,10 +2629,10 @@
 htmlParseScript(htmlParserCtxtPtr ctxt) {
-     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 1];
+     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
.................
Comment 6 Daniel Veillard 2005-07-14 08:57:58 UTC
Right ! Like htmlParseCharData() good catch !

  fixed in CVS, thanks a lot !

Daniel
Comment 7 Daniel Veillard 2005-09-05 09:00:41 UTC
This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel