After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 310333 - HTMLtree.c: htmlDocDump() cannot dump as UTF-8 encoding
HTMLtree.c: htmlDocDump() cannot dump as UTF-8 encoding
Status: VERIFIED FIXED
Product: libxml2
Classification: Platform
Component: general
2.6.20
Other All
: Normal minor
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2005-07-14 10:43 UTC by qiuyingbo
Modified: 2009-08-15 18:40 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
the patch for the bug (425 bytes, patch)
2005-07-14 10:49 UTC, qiuyingbo
none Details | Review

Description qiuyingbo 2005-07-14 10:43:02 UTC
Please describe the problem:
I have writen a simple program 'htmllint' to convert a html page to xml page. I
want the output was utf-8 encoded. but the output will be 'HTML' encoded, such
as '控制脚本底' unicode string.

>>>>>> htmllint.c
#include <stdio.h>
#include <libxml/HTMLparser.h>
#include <libxml/HTMLtree.h>

int main(int argc, char *argv[])
{
    int i;
    htmlDocPtr  doc;
    if (argc < 3) {
        printf("usage: %s filename [encoding] [encoding2]\n", argv[0]);
        return 1;
    }
    doc = htmlParseFile(argv[1], argv[2]);
    htmlSetMetaEncoding(doc, argv[3]);
    htmlDocDump(stdout, doc);
    xmlFreeDoc(doc);
    return 0;
}


Steps to reproduce:
1. compile htmllint.c
2. wget a multibyte html page. "http://www.sina.com.cn"
3. run "#./htmllint index.html GB18030 UTF-8 2>/dev/null >index.xml"
4. then you will see index.xml just contain "&#26032;&#28010;&#39318;&#39029;"...

Actual results:


Expected results:


Does this happen every time?


Other information:
Comment 1 qiuyingbo 2005-07-14 10:49:43 UTC
Created attachment 49155 [details] [review]
the patch for the bug

because the xmlDocument's internal charset encoding is 'UTF-8', if set the
MetaEncoding to 'UTF-8',

the programe will skip '''if (enc != cur->charset) { ... } ''' block, and the
'handler' remain  'NULL'
Comment 2 Daniel Veillard 2005-08-08 14:12:37 UTC
Okay fixed in a similar way, this in CVS,

 thanks,

Daniel
Comment 3 Daniel Veillard 2005-09-05 08:59:46 UTC
This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel