GNOME Bugzilla – Bug 362552
html entities in attribute values get corrupted
Last modified: 2006-10-17 15:56:41 UTC
Please describe the problem: some entities in attibute values get corrupted for entities like š œ Ÿ In normal textnodes everything is OK. Steps to reproduce: 1. Let xmllint parse the document below in html mode Actual results: Entities in "value" attribute get corrupted. Expected results: Does this happen every time? Yes Other information: <html> <body> scaron: š, nbsp: , auml: ä, oelig: œ, Yuml: Ÿ, yuml: ÿ, rarr: → <input type="text" name="hae" value="scaron: š .... nbsp: auml: ä oelig: œ Yuml: Ÿ yuml: ÿ"/> </body> </html>
Created attachment 74801 [details] html file that triggers the error
Not corrupted, output as their UTF-8 code point, the value and content is exact. It just doesn't have the form you expect, and in general that can't be garanteed. Not a bug, at best a request for enhancement Daniel
I get: <input type="text" name="hae" value="scaron: a .... nbsp: auml: ä oelig: S Yuml: x yuml: ÿ"> and xmllint --debug --html ./x.html HTML DOCUMENT URL=./x.html standalone=true DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd ELEMENT html ELEMENT body TEXT content= scaron: #C5#A1, nbsp: #C2#A0, auml: #C3#A4, oelig:... ELEMENT input ATTRIBUTE type TEXT content=text ATTRIBUTE name TEXT content=hae ATTRIBUTE value TEXT content=scaron: a .... nbsp: #C2#A0 auml: #C3#A4 oelig:... Some entities (with code > 255) are broken.
Can you be more specific : what entity ? How broken ? Remember that 1/ cut and past of a terminal output means *nothing* it depends what encoding the terminal expects its output in and how he manages something different 2/ --debug dumps the internal form, i.e. UTF-8 so one characters are encoded with 2 bytes sometimes 3 or 4 depending on the code point. Daniel
Created attachment 74852 [details] simple testcase When parsing this file the value of the "input" element gets corrupted. š => a <input type="text" name="test" value="š"> becomes: <input type="text" name="test" value="a"> This seems to happen with all entities with a code above 255.
Created attachment 74853 [details] Output of /usr/bin/xmllint --html --output result.html ./simple.html
Okay with the simple test case it was relatively easy to find and fix the problem: paphio:~/XML -> xmllint --html ../74852.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> <form> <input type="text" name="test" value="š"> </form> </body></html> It was just an 'historical' cast to xmlChar reducing the attribute :-\ thanks for the report, this should be fixed in CVS now ! I also added the test to the regression suite, Daniel