After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 159547 - escaping versus UTF8 in xmlNodeDump
escaping versus UTF8 in xmlNodeDump
Status: VERIFIED FIXED
Product: libxml2
Classification: Platform
Component: general
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2004-11-26 11:41 UTC by Petr Pajas
Modified: 2011-02-25 02:33 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Petr Pajas 2004-11-26 11:41:33 UTC
calling xmlNodeDump on a root-node of a UTF8 encoded document
para0.xml
<?xml version='1.0' encoding='utf-8'?>
<para>...some UTF8 characters here...</para>

under <= 2.6.8 and >= 2.6.15 behaves differently:

$ LD_PRELOAD=/net/su/h/local2-rh8/lib/libxml2.so.2.6.8 ./parseprint para0.xml
<para>ì¹èø¾øýáíùú»òï</para>

$ LD_PRELOAD=/net/su/h/local2-rh8/lib/libxml2.so.2.6.15 ./parseprint para0.xml
<para>&#x11B;&#x161;&#x10D;&#x159;&#x17E;&#x159;&#xFD;&#xE1;&#xED;&#x16F;&#xFA;&#x165;&#x148;&#x10F;</para>

i.e. prior to 2.6.8 non-ascii characters are UTF8, post 2.6.15 non-ascii
characters are escaped.

parseprint.c is as follows:

#include <stdio.h>
#include <libxml/parser.h>
#include <libxml/parserInternals.h>
#include <libxml/tree.h>
int
main(int argc, char **argv)
{
    xmlDoc *doc = NULL;
    xmlNode *root_element = NULL;
    const xmlChar *ret = NULL;
    xmlParserCtxtPtr ctxt;

    if (argc != 2) return(1);

    LIBXML_TEST_VERSION

    /* libxml2-2.4 API, so that we can link against older versions too */
    ctxt = xmlCreateFileParserCtxt(argv[1]);
    xmlParseDocument(ctxt);
    doc = ctxt->myDoc;
    ctxt->myDoc = NULL;
    xmlFreeParserCtxt(ctxt);

    if (doc == NULL) {
        printf("error: could not parse file %s\n", argv[1]);
    }

    /*Get the root element node */
    root_element = xmlDocGetRootElement(doc);
    xmlBufferPtr buffer;
    buffer = xmlBufferCreate();
    xmlNodeDump( buffer,
                 doc,
                 root_element, 0, 0);

    if ( xmlBufferLength(buffer) > 0 ) {
      ret = xmlBufferContent( buffer );
    }
    printf("%s\n",ret);
    xmlFreeDoc(doc);
    xmlCleanupParser();
    return 0;
}
Comment 1 Daniel Veillard 2005-03-31 15:26:05 UTC
In practice character references are safer since they will work even
if the encoding is mislabelled for example in the HTTP headers (which
is the case most of the time). Defaulting to UTF-8 makes some sense,
but also carries some risks.
Real solution is to use APIs defining the encoding.
I tentatively reverted the behaviour in CVS to follow your suggestion
but this is a risky business and this may be changed again if this breaks
too many users.

Daniel
Comment 2 Daniel Veillard 2005-09-05 08:59:58 UTC
This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel
Comment 3 Rodrigo Kellermann Ferreira 2011-02-24 18:11:13 UTC
Hi,

I'm try to move my servers from Centos 4 ( libxml2-2.6.16-12.8.i386.rpm )   to Centos 5  ( libxml2-2.6.26-2.1.2.8.el5_5.1.i386.rpm )  and I have problems with applications that use libxml2.

This problems are caused by this behavior change of libxml2 at this bug.

I'm my opinion this is very serious change, and shouldnt have been made, it endeed the compatibility between versions.

It's too late to modify it for compatibility of very older version 2.6.8.


Att.,

Rodrigo Kellermann Ferreira
Comment 4 Daniel Veillard 2011-02-25 02:33:35 UTC
Sorry, no, I won't backport any of this on RHEL, especially that late,
and since there is no RHEL bug against this behaviour.
Just grab a more recent source rpm and rebuild it locally, that's my answer

Daniel