After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 534230 - HTMLparser truncates encoded document
HTMLparser truncates encoded document
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: htmlparser
2.6.x
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2008-05-21 18:06 UTC by Marius Konitzer
Modified: 2021-07-05 13:26 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
HTML file that causes the problem (4.40 KB, text/html)
2008-05-21 18:09 UTC, Marius Konitzer
Details

Description Marius Konitzer 2008-05-21 18:06:39 UTC
Please describe the problem:
Parsing the attached (and valid) HTML file with libxml2's HTMLparser leads to a truncated document (try "xmllint --html utf8.html"). The same bug occurs on the corresponding iso-8859-* encoded document, too, so it isn't limited to a particular encoding. Having a look at HTMLparser.c and friends didn't really help, however I tried to narrow it down a bit.

The problem seems to be that on the first non-ASCII character read

      /*
       * Humm this is bad, do an automatic flow conversion
       */
      xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_8859_1);
      ctxt->charset = XML_CHAR_ENCODING_UTF8;
      return(xmlCurrentChar(ctxt, len));

(HTMLparser.c:386-391) is executed _before_ the <meta>-tag is parsed and

      if ((http) && (content != NULL))
          htmlCheckEncoding(ctxt, content);

(HTMLparser.c:3404-3405) is executed, which happens some characters later.

Accordingly each of the following makes the problem vanish:
* commenting out lines 3404-3405 of HTMLparser.c
* commenting out lines 389-390 of HTMLparser.c
* swapping lines 4 <-> 5 of utf8.html (which is effectively the same as the above)

Don't know if this is the right trace, any hints or ideas?

Steps to reproduce:
xmllint --html utf8.html


Actual results:
xmllint prints HTML code with truncated text between <p> and </p> tags.

Expected results:
xmllint should print the document with complete text between <p> and </p> tags.

Does this happen every time?
yes

Other information:
Comment 1 Marius Konitzer 2008-05-21 18:09:29 UTC
Created attachment 111288 [details]
HTML file that causes the problem
Comment 2 GNOME Infrastructure Team 2021-07-05 13:26:24 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.