GNOME Bugzilla – Bug 682474
Wrong encoded character switches the encoding
Last modified: 2021-07-05 13:23:44 UTC
A single wrong encoded character causes the switch of the encoding from UTF-8 to ISO-8859-1. After that all correct encoded UTF-8 non-ASCII characters are replaced with wrong characters, for example: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> </head> <body> <p>ö xE4 ö/p> </body> </html> where xE4 is 'ä' encoded ISO-8859-1, is read as: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> </head> <body> <p>ö xE4 ö/p> </body> </html> (Reproduced with the version 2.8.0.)
Created attachment 222155 [details] [review] Preliminary bug fix Of course the behavior (switching to ISO-8859-1) is a feature and no bug; maybe the HTML parser should get an option that can disable it, but in the patch I simply removed the switching to ISO-8859-1.
that's a bit too brutal as a fix, admitedly switching to ISO-8859-1 is also a rather brutal behaviour and we probably ought to do something more intelligent. One way to converge to a solution might be to look at the suggested behaviour for HTML-5, I assume they have described this kind of corner cases, and then mimic that in libxml2 HTML parser. That sounds the best way forward, what do you think ? Daniel
Review of attachment 222155 [details] [review]: That's a bit brutal, let's see if there isn't a better way
(In reply to comment #2) If I understand it right, HTML5 requires that each byte that cannot be decoded as UTF-8 has to be replaced with U+FFFD REPLACEMENT CHARACTER: 2.4 UTF-8 . . . One byte in the range 80 to BF not preceded by a byte in the range 80 to FD One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as part of a sequence Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER. Igor
Yeah, we may also face transcoding errors, but at that point we should be able to assume the flow is UTF-8, that's what the parser actually consumes. Not for 2.9.0 which is imminent but that would be something to add in one of the following releases. thanks ! Daniel
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.