GNOME Bugzilla – Bug 541529
xsl:output/@encoding may produce character references in element and attribute names
Last modified: 2021-07-05 13:24:18 UTC
Please describe the problem: Using xsl:output/@encoding, we can control the output encoding. Characters in text nodes that are not available in the selected encoding are converted to character references. Characters in element or attribute names that are not available in the selected encoding, however, must result in run-time errors. LibXSLT does not report run-time errors. Instead, it converts said element and attribute names using character references, which are illegal in XML element and attribute names. Steps to reproduce: Run an identity transform on an XML document containing element and attribute names that are not available in the specified output encoding. mludwig@forelle:~/Werkstatt/xsl > cat Uebelkeit.xml <Urmel> <Vorspeise>Süßkirschen mit Käsesoße</Vorspeise> <Übelkeit möglicherweise="beträchtlich"/> </Urmel> mludwig@forelle:~/Werkstatt/xsl > cat Uebelkeit-output-encoding.xsl <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output encoding="US-ASCII"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:transform> mludwig@forelle:~/Werkstatt/xsl > xsltproc Uebelkeit-output-encoding.xsl Uebelkeit.xml Actual results: Invalid XML is output. It is invalid in containing character references in element and attribute names. <?xml version="1.0" encoding="US-ASCII"?> <Urmel> <Vorspeise>Süßkirschen mit Käsesoße</Vorspeise> <Übelkeit möglicherweise="beträchtlich"/> </Urmel> Expected results: A run-time error should be reported. For example, the processor Saxon (version 9.0.0.4, Java) says: SERE0008: Element name contains a character (decimal + 220) not available in the selected encoding Transformation failed: Run-time errors were reported Does this happen every time? Yes. Other information:
http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200807/msg00057.html
Not a libxslt bug, a libxml2 one. Or rather an efficiency trade-off as explained on-list ----------------------------------------------- You ask for something impossible. You get a non-xml document instead of getting an immediate failure. It's a trade-off, unrelated to libxslt, it's actually in libxml2. The transcoding is done on a preserialized UTF-8 document (or document fragment), detecting the error means each time a character is not serializable in the target encoding, when issuing the escaped sequence to do a rewind lookup and try to guess (it's guessing because at that point you're manipulating strings there is no notion of document structure) if you're within markup or within content. Basically it makes everybody pay a rather hight cost for the few who asked for something impossible. The current state is there since the beginning of libxml2 (nearly a decade) so the bug is extremely uncommon. This makes me even less comfortable with the expansion of the cost. Again, it's a trade-off, a concious one, for more informations see libxml2 encoding.c around line 2057 that's where the escaping is done. If you see another way to handle this not penalizing heavilly the normal process, I'm all for fixing this. But right now I don't see a solution. ----------------------------------------------- Daniel
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.