After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 682474 - Wrong encoded character switches the encoding
Wrong encoded character switches the encoding
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: general
git master
Other Windows
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2012-08-22 13:53 UTC by Igor Ignatyuk
Modified: 2021-07-05 13:23 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Preliminary bug fix (332 bytes, patch)
2012-08-22 14:14 UTC, Igor Ignatyuk
none Details | Review

Description Igor Ignatyuk 2012-08-22 13:53:07 UTC
A single wrong encoded character causes the switch of the encoding from UTF-8 to ISO-8859-1. After that all correct encoded UTF-8 non-ASCII characters are replaced with wrong characters, for example:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  </head>
  <body>
    <p>ö xE4 ö/p>
  </body>
</html>

where xE4 is 'ä' encoded ISO-8859-1,

is read as:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  </head>
  <body>
    <p>ö xE4 ö/p>
  </body>
</html>

(Reproduced with the version 2.8.0.)
Comment 1 Igor Ignatyuk 2012-08-22 14:14:44 UTC
Created attachment 222155 [details] [review]
Preliminary bug fix

Of course the behavior (switching to ISO-8859-1) is a feature and no bug; maybe the HTML parser should get an option that can disable it, but in the patch I simply removed the switching to ISO-8859-1.
Comment 2 Daniel Veillard 2012-09-07 13:30:37 UTC
that's a bit too brutal as a fix, admitedly switching to ISO-8859-1
is also a rather brutal behaviour and we probably ought to do something
more intelligent.
One way to converge to a solution might be to look at the suggested behaviour
for HTML-5, I assume they have described this kind of corner cases, and
then mimic that in libxml2 HTML parser. That sounds the best way forward,

  what do you think ?

Daniel
Comment 3 Daniel Veillard 2012-09-07 13:31:26 UTC
Review of attachment 222155 [details] [review]:

That's a bit brutal, let's see if there isn't a better way
Comment 4 Igor Ignatyuk 2012-09-10 07:23:57 UTC
(In reply to comment #2)

If I understand it right, HTML5 requires that each byte that cannot be decoded as UTF-8 has to be replaced with U+FFFD REPLACEMENT CHARACTER:

2.4 UTF-8
. . .
One byte in the range 80 to BF not preceded by a byte in the range 80 to FD
One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte
One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as part of a sequence
    Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.

Igor
Comment 5 Daniel Veillard 2012-09-10 15:00:26 UTC
Yeah, we may also face transcoding errors, but at that point
we should be able to assume the flow is UTF-8, that's what the
parser actually consumes. Not for 2.9.0 which is imminent but
that would be something to add in one of the following releases.

  thanks !

Daniel
Comment 6 GNOME Infrastructure Team 2021-07-05 13:23:44 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.