GNOME Bugzilla – Bug 310333
HTMLtree.c: htmlDocDump() cannot dump as UTF-8 encoding
Last modified: 2009-08-15 18:40:50 UTC
Please describe the problem: I have writen a simple program 'htmllint' to convert a html page to xml page. I want the output was utf-8 encoded. but the output will be 'HTML' encoded, such as '控制脚本底' unicode string. >>>>>> htmllint.c #include <stdio.h> #include <libxml/HTMLparser.h> #include <libxml/HTMLtree.h> int main(int argc, char *argv[]) { int i; htmlDocPtr doc; if (argc < 3) { printf("usage: %s filename [encoding] [encoding2]\n", argv[0]); return 1; } doc = htmlParseFile(argv[1], argv[2]); htmlSetMetaEncoding(doc, argv[3]); htmlDocDump(stdout, doc); xmlFreeDoc(doc); return 0; } Steps to reproduce: 1. compile htmllint.c 2. wget a multibyte html page. "http://www.sina.com.cn" 3. run "#./htmllint index.html GB18030 UTF-8 2>/dev/null >index.xml" 4. then you will see index.xml just contain "新浪首页"... Actual results: Expected results: Does this happen every time? Other information:
Created attachment 49155 [details] [review] the patch for the bug because the xmlDocument's internal charset encoding is 'UTF-8', if set the MetaEncoding to 'UTF-8', the programe will skip '''if (enc != cur->charset) { ... } ''' block, and the 'handler' remain 'NULL'
Okay fixed in a similar way, this in CVS, thanks, Daniel
This should be closed by release of libxml2-2.6.21, thanks, Daniel