Bug 389843 – SAX Parser: any entity in an attribute value converted to & when getEntity handler is provided

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 389843 - SAX Parser: any entity in an attribute value converted to & when getEntity handler is provided


Summary:	SAX Parser: any entity in an attribute value converted to & when getEntit...


Status:	VERIFIED WONTFIX

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-12-26 21:32 UTC by Oleg.Kravtsov
Modified:	2007-06-17 17:22 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Oleg.Kravtsov 2006-12-26 21:32:43 UTC

Please describe the problem:
I use SAX parser and would like any entities not to be replaced.
In order to prevent entities expanding I add getEntity callback function in 
xmlSAXHandler data structure, which is passed as a parameter to xmlSAXUserParseFile() function.

my_getEntity() function might look like the following:

static xmlEntityPtr
my_get_entity(void *user_data, const xmlChar *name)
{
    static xmlEntity  ent;

    ent.etype = XML_INTERNAL_PREDEFINED_ENTITY;

    if (strcmp(name, "lt") == 0)
    {
        ent.name = "lt";
        ent.orig = (char *)"&lt;";
        ent.content = (char *)"&lt;";
    }
    else if (strcmp(name, "gt") == 0)
    {
        ent.name = "gt";
        ent.orig = (char *)"&gt;";
        ent.content = (char *)"&gt;";
    }
    else if (...)
    {
        ...
    }

    return &ent;
}

In fact I do not know any other way to prevent expanding, otherwise I would 
use it, but anyway.

That works fine while SAX parser processes DATA parts, but it works badly
when it's time to parse entities in attribute values.



Steps to reproduce:
1. Modify getEntityDebug() function in testSAX.c file as follows:

static xmlEntityPtr
getEntityDebug(void *ctx ATTRIBUTE_UNUSED, const xmlChar *name)
{
    callbacks++;
    if (quiet)
        return(NULL);
    fprintf(stdout, "SAX.getEntity(%s)\n", name);
#if 0
    return(NULL);
#else
    {
    static xmlEntity  ent;

    ent.etype = XML_INTERNAL_PREDEFINED_ENTITY;

    if (strcmp(name, "lt") == 0)
    {
        ent.name = (const char *)"lt";
        ent.orig = (char *)"&lt;";
        ent.content = (char *)"&lt;";
    }
    else if (strcmp(name, "gt") == 0)
    {
        ent.name = (const char *)"gt";
        ent.orig = (char *)"&gt;";
        ent.content = (char *)"&gt;";
    }
    else if (strcmp(name, "amp") == 0)
    {
        ent.name = (const char *)"amp";
        ent.orig = (char *)"&amp;";
        ent.content = (char *)"&amp;";
    }
    else if (strcmp(name, "quot") == 0)
    {
        ent.name = (const char *)"quot";
        ent.orig = (char *)"&quot;";
        ent.content = (char *)"&quot;";
    }
    else if (strcmp(name, "apos") == 0)
    {
        ent.name = (const char *)"apos";
        ent.orig = (char *)"&apos;";
        ent.content = (char *)"&apos;";
    }
    else
        return NULL;

    return &ent;
    }
#endif
}

2. rebuild testSAX application

3. Run testSAX with the following XML file:

<?xml version="1.0"?>
<test>
  <tag attr="&lt; &amp; &gt; &quot; &apos;"/>
</test>


Actual results:

4. The result will be:

SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElement(test)
SAX.characters(
  , 3)
SAX.getEntity(lt)
SAX.getEntity(amp)
SAX.getEntity(gt)
SAX.getEntity(quot)
SAX.getEntity(apos)
SAX.startElement(tag, attr='&#38; &#38; &#38; &#38; &#38;')
SAX.endElement(tag)
SAX.characters(
, 1)
SAX.endElement(test)
SAX.endDocument()

=========

So you see that all the entities were expanded as &#38; , which is the code
of ampersand character. Well I do not mind that &amp; entity was converted in 
such a dirty way, but it is definitely incorrect to convert all the entities
to be ampersands!


Expected results:


Does this happen every time?


Other information:
Why it happens:
In file parser.c function xmlParseAttValueComplex() there are the following lines:

...
            } else {
                ent = xmlParseEntityRef(ctxt);
                if ((ent != NULL) &&
                    (ent->etype == XML_INTERNAL_PREDEFINED_ENTITY)) {
                    if (len > buf_size - 10) {
                        growBuffer(buf);
                    }
                    if ((ctxt->replaceEntities == 0) &&
                        (ent->content[0] == '&')) {
                        buf[len++] = '&';
                        buf[len++] = '#';
                        buf[len++] = '3';
                        buf[len++] = '8';
                        buf[len++] = ';';
                    } else {
                        buf[len++] = ent->content[0];
                    }
                } else if ((ent != NULL) &&
...

Here we can see that in case the following condition is true:
"ent->content[0] == '&'" we output &#38;
But for the case of any entity this is always true! because the first element 
of content when we parse entity is '&' character (I debugged this with GDB).

Any comments will be appreciated as I've got stuck with converting entities.
The best behaviour, which I would regard as the right solution is to output
entities according to the information I provided in my callback, otherwise
getEntity callback is just hype when it is used for attributes.

Some more words to explain why it is a problem:
When I want to define some private entities, and the way they converted 
I would define getEntity() callback and put there those mappings as I 
currently do with standards ones.

I've just check that this works for DATA regions, but attributes.

Comment 1 Daniel Veillard 2007-06-17 13:28:33 UTC

"I use SAX parser and would like any entities not to be replaced."

 that's just not possible. SAX was not designed with this in mind,
and that's why libxml2 SAX processing of attribute values if you
don't ask for entity substitution is so strange.

Sorry no way to fix SAX. And the only way to preserve entities in
attribute values is to NOT use SAX.

Daniel

Comment 2 Oleg.Kravtsov 2007-06-17 15:52:34 UTC

Well, I opend it a half a year ago, and I really think this is a bug.
I spent some time to investigate it and gave a detailed report for it, but 
well of course it is up to you to fix it or not.

I would mark this as WONTFIX to empasize that it is still a BUG to give people
a chance to have a look at the bug report, because being marked as NOTABUG will 
just hide my report.

Comment 3 Oleg.Kravtsov 2007-06-17 15:53:00 UTC

Changing state to WONTFIX...

Comment 4 Daniel Veillard 2007-06-17 17:22:04 UTC

&amp; is one of the 4 predefined entities, they can't be overrided
they are quite specific, SAX or any parser interface ever expect to
see them referenced in the flow of data back.

Suppose you define an entity "name" (in the internal subset to make
things simpler) with a value of "veillard".
Now tell me how you would expect SAX to report entities in the attribute
values in the element start for doc . 

<DOCTYPE doc [
<!ENTITY name "veillard">
]>
<doc a1="&veillard;" a2="&amp;veillard;"/>

Basically SAX was not designed for being able to preserve entities,
especially in attribute values. When you run libxml2 with your own SAX
handlers you should ask for entity substitution, otherwise you hit
libxml2 internal behaviour to try to workaround the fact that:
   - SAX was not designed to allow keeping entities in attribute
     values
   - libxml2 being an editor toolkit I wanted to preserve entity references.

What you're complaining about is a mode of operation I do not expect people
to use. It is not a bug, it's that SAX is not designed for what you want.
If you want me to acknoledge a bug in SAX mode (there certainly is some)
please report about the processing when entities are being substitued.
SAX and entities processing is a mess, that's one of the main reason
I tell people to not use SAX unless they need to shave microseconds, and 
ask them to use the reader instead.

Daniel