GNOME Bugzilla – Bug 310229
htmlParseScript bad parse UTF-8 characters
Last modified: 2009-08-15 18:40:50 UTC
Please describe the problem: When htmlParseScript founds multibyte charcters (for exmaple 'ří' which is 0xc5 0x99 0xc3 0xad) the NEXT macro which call xmlNextChar correctly move to the next char, but: buf[nbchar++] = cur; on line 2664 copy ONLY ONE BYTE of two byte sequence to the buffer which is then passed to cdataBlockSAXFunc function. I hope the code for htmlParseScritp should be similar to htmlParseCharData. Steps to reproduce: 1. try parsing any html document with some multibyte chracters in script tag htmlCreatePushParserCtxt must be called with XML_CHAR_ENCODING_UTF8. Actual results: Expected results: Does this happen every time? yes Other information:
Created attachment 49093 [details] [review] Patch to correct parsing UTF-8 characters in htmlParseScript
If you have a single standalone test example exhibiting the problem that would help :-) Daniel
Created attachment 49102 [details] C code and HMTL page wich exhibit th problem Compare output ot this example program with for example 'od -tx1z index.html'. You will see that every second byte from 2-byte UTF-8 sequence is missing in buffer passed to cdataBlockSAXFunc
Excellent, patch looks fine, applied and commited. I also added the example to the regression test suite, thanks a lot, Daniel
It seem that the patch import a new bug.. >>>>>>>>>>>>> + COPY_BUF(l,buf,nbchar,cur); if (nbchar >= HTML_PARSER_BIG_BUFFER_SIZE) { >>>>>>>>>>>>> if nbchar == HTML_PARSER_BIG_BUFFER_SIZE-1, and the LOOP will continue... then we meet a multibyte char, and the utf-8 length == 3 COPY_BUF(3, buf, nbchar, cur) will overflow the 'buf' htmlParseCharData() define: xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5]; but htmlParseScript() define: xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 1]; the patch should be @@ -2629,10 +2629,10 @@ htmlParseScript(htmlParserCtxtPtr ctxt) { - xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 1]; + xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5]; .................
Right ! Like htmlParseCharData() good catch ! fixed in CVS, thanks a lot ! Daniel
This should be closed by release of libxml2-2.6.21, thanks, Daniel