Bug 682474 – Wrong encoded character switches the encoding

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 682474 - Wrong encoded character switches the encoding


Summary:	Wrong encoded character switches the encoding


Status:	RESOLVED OBSOLETE

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-08-22 13:53 UTC by Igor Ignatyuk
Modified:	2021-07-05 13:23 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Preliminary bug fix (332 bytes, patch) 2012-08-22 14:14 UTC, Igor Ignatyuk	none	Details \| Review

Description Igor Ignatyuk 2012-08-22 13:53:07 UTC

A single wrong encoded character causes the switch of the encoding from UTF-8 to ISO-8859-1. After that all correct encoded UTF-8 non-ASCII characters are replaced with wrong characters, for example:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  </head>
  <body>
    <p>ö xE4 ö/p>
  </body>
</html>

where xE4 is 'ä' encoded ISO-8859-1,

is read as:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  </head>
  <body>
    <p>ö xE4 Ã¶/p>
  </body>
</html>

(Reproduced with the version 2.8.0.)

Comment 1 Igor Ignatyuk 2012-08-22 14:14:44 UTC

Created attachment 222155 [details] [review]
Preliminary bug fix

Of course the behavior (switching to ISO-8859-1) is a feature and no bug; maybe the HTML parser should get an option that can disable it, but in the patch I simply removed the switching to ISO-8859-1.

Comment 2 Daniel Veillard 2012-09-07 13:30:37 UTC

that's a bit too brutal as a fix, admitedly switching to ISO-8859-1
is also a rather brutal behaviour and we probably ought to do something
more intelligent.
One way to converge to a solution might be to look at the suggested behaviour
for HTML-5, I assume they have described this kind of corner cases, and
then mimic that in libxml2 HTML parser. That sounds the best way forward,

  what do you think ?

Daniel

Comment 3 Daniel Veillard 2012-09-07 13:31:26 UTC

Review of attachment 222155 [details] [review]:

That's a bit brutal, let's see if there isn't a better way

Comment 4 Igor Ignatyuk 2012-09-10 07:23:57 UTC

(In reply to comment #2)

If I understand it right, HTML5 requires that each byte that cannot be decoded as UTF-8 has to be replaced with U+FFFD REPLACEMENT CHARACTER:

2.4 UTF-8
. . .
One byte in the range 80 to BF not preceded by a byte in the range 80 to FD
One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte
One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as part of a sequence
    Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.

Igor

Comment 5 Daniel Veillard 2012-09-10 15:00:26 UTC

Yeah, we may also face transcoding errors, but at that point
we should be able to assume the flow is UTF-8, that's what the
parser actually consumes. Not for 2.9.0 which is imminent but
that would be something to add in one of the following releases.

  thanks !

Daniel

Comment 6 GNOME Infrastructure Team 2021-07-05 13:23:44 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.