GNOME Bugzilla – Bug 389843
SAX Parser: any entity in an attribute value converted to & when getEntity handler is provided
Last modified: 2007-06-17 17:22:04 UTC
Please describe the problem: I use SAX parser and would like any entities not to be replaced. In order to prevent entities expanding I add getEntity callback function in xmlSAXHandler data structure, which is passed as a parameter to xmlSAXUserParseFile() function. my_getEntity() function might look like the following: static xmlEntityPtr my_get_entity(void *user_data, const xmlChar *name) { static xmlEntity ent; ent.etype = XML_INTERNAL_PREDEFINED_ENTITY; if (strcmp(name, "lt") == 0) { ent.name = "lt"; ent.orig = (char *)"<"; ent.content = (char *)"<"; } else if (strcmp(name, "gt") == 0) { ent.name = "gt"; ent.orig = (char *)">"; ent.content = (char *)">"; } else if (...) { ... } return &ent; } In fact I do not know any other way to prevent expanding, otherwise I would use it, but anyway. That works fine while SAX parser processes DATA parts, but it works badly when it's time to parse entities in attribute values. Steps to reproduce: 1. Modify getEntityDebug() function in testSAX.c file as follows: static xmlEntityPtr getEntityDebug(void *ctx ATTRIBUTE_UNUSED, const xmlChar *name) { callbacks++; if (quiet) return(NULL); fprintf(stdout, "SAX.getEntity(%s)\n", name); #if 0 return(NULL); #else { static xmlEntity ent; ent.etype = XML_INTERNAL_PREDEFINED_ENTITY; if (strcmp(name, "lt") == 0) { ent.name = (const char *)"lt"; ent.orig = (char *)"<"; ent.content = (char *)"<"; } else if (strcmp(name, "gt") == 0) { ent.name = (const char *)"gt"; ent.orig = (char *)">"; ent.content = (char *)">"; } else if (strcmp(name, "amp") == 0) { ent.name = (const char *)"amp"; ent.orig = (char *)"&"; ent.content = (char *)"&"; } else if (strcmp(name, "quot") == 0) { ent.name = (const char *)"quot"; ent.orig = (char *)"""; ent.content = (char *)"""; } else if (strcmp(name, "apos") == 0) { ent.name = (const char *)"apos"; ent.orig = (char *)"'"; ent.content = (char *)"'"; } else return NULL; return &ent; } #endif } 2. rebuild testSAX application 3. Run testSAX with the following XML file: <?xml version="1.0"?> <test> <tag attr="< & > " '"/> </test> Actual results: 4. The result will be: SAX.setDocumentLocator() SAX.startDocument() SAX.startElement(test) SAX.characters( , 3) SAX.getEntity(lt) SAX.getEntity(amp) SAX.getEntity(gt) SAX.getEntity(quot) SAX.getEntity(apos) SAX.startElement(tag, attr='& & & & &') SAX.endElement(tag) SAX.characters( , 1) SAX.endElement(test) SAX.endDocument() ========= So you see that all the entities were expanded as & , which is the code of ampersand character. Well I do not mind that & entity was converted in such a dirty way, but it is definitely incorrect to convert all the entities to be ampersands! Expected results: Does this happen every time? Other information: Why it happens: In file parser.c function xmlParseAttValueComplex() there are the following lines: ... } else { ent = xmlParseEntityRef(ctxt); if ((ent != NULL) && (ent->etype == XML_INTERNAL_PREDEFINED_ENTITY)) { if (len > buf_size - 10) { growBuffer(buf); } if ((ctxt->replaceEntities == 0) && (ent->content[0] == '&')) { buf[len++] = '&'; buf[len++] = '#'; buf[len++] = '3'; buf[len++] = '8'; buf[len++] = ';'; } else { buf[len++] = ent->content[0]; } } else if ((ent != NULL) && ... Here we can see that in case the following condition is true: "ent->content[0] == '&'" we output & But for the case of any entity this is always true! because the first element of content when we parse entity is '&' character (I debugged this with GDB). Any comments will be appreciated as I've got stuck with converting entities. The best behaviour, which I would regard as the right solution is to output entities according to the information I provided in my callback, otherwise getEntity callback is just hype when it is used for attributes. Some more words to explain why it is a problem: When I want to define some private entities, and the way they converted I would define getEntity() callback and put there those mappings as I currently do with standards ones. I've just check that this works for DATA regions, but attributes.
"I use SAX parser and would like any entities not to be replaced." that's just not possible. SAX was not designed with this in mind, and that's why libxml2 SAX processing of attribute values if you don't ask for entity substitution is so strange. Sorry no way to fix SAX. And the only way to preserve entities in attribute values is to NOT use SAX. Daniel
Well, I opend it a half a year ago, and I really think this is a bug. I spent some time to investigate it and gave a detailed report for it, but well of course it is up to you to fix it or not. I would mark this as WONTFIX to empasize that it is still a BUG to give people a chance to have a look at the bug report, because being marked as NOTABUG will just hide my report.
Changing state to WONTFIX...
& is one of the 4 predefined entities, they can't be overrided they are quite specific, SAX or any parser interface ever expect to see them referenced in the flow of data back. Suppose you define an entity "name" (in the internal subset to make things simpler) with a value of "veillard". Now tell me how you would expect SAX to report entities in the attribute values in the element start for doc . <DOCTYPE doc [ <!ENTITY name "veillard"> ]> <doc a1="&veillard;" a2="&veillard;"/> Basically SAX was not designed for being able to preserve entities, especially in attribute values. When you run libxml2 with your own SAX handlers you should ask for entity substitution, otherwise you hit libxml2 internal behaviour to try to workaround the fact that: - SAX was not designed to allow keeping entities in attribute values - libxml2 being an editor toolkit I wanted to preserve entity references. What you're complaining about is a mode of operation I do not expect people to use. It is not a bug, it's that SAX is not designed for what you want. If you want me to acknoledge a bug in SAX mode (there certainly is some) please report about the processing when entities are being substitued. SAX and entities processing is a mess, that's one of the main reason I tell people to not use SAX unless they need to shave microseconds, and ask them to use the reader instead. Daniel