GNOME Bugzilla – Bug 789714
crash: xmlParserPrintFileContextInternal mangles utf8
Last modified: 2021-07-05 13:22:59 UTC
I've hit a crasher in itstool that I believe is libxml2's fault. The crash happens during error reporting with a custom error handler. I'm attaching a python program that reliably crashes. It only crashes under Python 3, not Python 2, for some reason. I've tracked this down to xmlParserPrintFileContextInternal, specifically these lines: while ((n++ < (sizeof(content)-1)) && (cur > base) && (*(cur) != '\n') && (*(cur) != '\r')) cur--; If the size of content is reached while in the middle of a multi-byte UTF-8 character, this will result in broken UTF-8 being passed around, and I think somewhere in Python's callback mechanisms that broken UTF-8 causes a segfault. The solution, I think, is to add something like this: while (!is_a_valid_first_byte_for_a_utf8_character(*cur)) cur++; I've made up that function name. I'm hopeful such a function exists already in libxml2 or its deps. Maybe iconv?
Created attachment 362639 [details] Program that reliably crashes under py3
I think this could be resolved to this downstream bug: https://bugzilla.opensuse.org/show_bug.cgi?id=1065270 it carries a patch - which I have applied on my python3/libxml2 integration and the attached TEST.py from comment#1 results in this: > python3 TEST.py Entity: line 1: parser error : Opening and ending tag mismatch: p line 1 and key вая клавишу key href="help:gnome-help/keyboard-key-super">Super</key> ^ Entity: line 1: parser error : Extra content at the end of the document вая клавишу key href="help:gnome-help/keyboard-key-super">Super</key> ^ => probably not perfect, but no crash
That's definitely the same bug. The output you pasted is what I would expect. Maybe not the best error output, but certainly the expected error output. My proposal was to fix the UTF-8 mangling where it happens. That patch adjust for malformed UTF-8 in another place. That patch has the advantage of catching other garbage input before handing it off to crash inside Python. But it does seem cleaner to me to just never create broken UTF-8. Any chance we could get a patch landed and a release?
Yes, this should be fixed in xmlParserPrintFileContextInternal. This function has a couple of other issues regarding UTF-8: - The end of the error message could be a truncated UTF-8 sequence as well. - The contents beyond the current position in the stream could contain invalid UTF-8. - The function should return up to 80 Unicode characters instead of bytes. - The position of the caret indicator should be based on Unicode characters, not bytes.
*** Bug 791691 has been marked as a duplicate of this bug. ***
This is also tracked as Gitlab issue #64: https://gitlab.gnome.org/GNOME/libxml2/issues/64
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.