Bug 310333 – HTMLtree.c: htmlDocDump() cannot dump as UTF-8 encoding

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 310333 - HTMLtree.c: htmlDocDump() cannot dump as UTF-8 encoding


Summary:	HTMLtree.c: htmlDocDump() cannot dump as UTF-8 encoding


Status:	VERIFIED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.20
Hardware:	Other All

Importance:	Normal minor
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-07-14 10:43 UTC by qiuyingbo
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
the patch for the bug (425 bytes, patch) 2005-07-14 10:49 UTC, qiuyingbo	none	Details \| Review

Description qiuyingbo 2005-07-14 10:43:02 UTC

Please describe the problem:
I have writen a simple program 'htmllint' to convert a html page to xml page. I
want the output was utf-8 encoded. but the output will be 'HTML' encoded, such
as '&#25511;&#21046;&#33050;&#26412;&#24213;' unicode string.

>>>>>> htmllint.c
#include <stdio.h>
#include <libxml/HTMLparser.h>
#include <libxml/HTMLtree.h>

int main(int argc, char *argv[])
{
    int i;
    htmlDocPtr  doc;
    if (argc < 3) {
        printf("usage: %s filename [encoding] [encoding2]\n", argv[0]);
        return 1;
    }
    doc = htmlParseFile(argv[1], argv[2]);
    htmlSetMetaEncoding(doc, argv[3]);
    htmlDocDump(stdout, doc);
    xmlFreeDoc(doc);
    return 0;
}


Steps to reproduce:
1. compile htmllint.c
2. wget a multibyte html page. "http://www.sina.com.cn"
3. run "#./htmllint index.html GB18030 UTF-8 2>/dev/null >index.xml"
4. then you will see index.xml just contain "&#26032;&#28010;&#39318;&#39029;"...

Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 qiuyingbo 2005-07-14 10:49:43 UTC

Created attachment 49155 [details] [review]
the patch for the bug

because the xmlDocument's internal charset encoding is 'UTF-8', if set the
MetaEncoding to 'UTF-8',

the programe will skip '''if (enc != cur->charset) { ... } ''' block, and the
'handler' remain  'NULL'

Comment 2 Daniel Veillard 2005-08-08 14:12:37 UTC

Okay fixed in a similar way, this in CVS,

 thanks,

Daniel

Comment 3 Daniel Veillard 2005-09-05 08:59:46 UTC

This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel