GNOME Bugzilla – Bug 668154
GTK+ Crashes on Non BMP utf-8 charpoints
Last modified: 2013-02-26 21:14:52 UTC
I noticed today that X-Chat on windows crashes on a certain string in my logfile. I have since been able to reproduce this issue thanks to bviktor on #xchat on other environments than X-Chat itself. For instance the problematic string can be pasted into almost any GTK application on windows and then crash the application. How to obtain such a string? Write a non BMP character (I assume that is the problem. I have tried with the following one and a few others) into a textfile, open it with notepad (make sure it's opening the file as utf-8) and copy the string into the clipboard. Then further paste the string in a input field in a GTK application (for instance the gimp filename field in the save dialog). The result is that the application crashes. Python code for generating the character: f = open('test.txt', 'w') f.write(u'\U000e005f\r\n'.encode('utf-8')) f.close() It's annoying since you can crash pidgin and x-chat on windows by copy pasting this string into a public IRC channel. Since GTK is also used for wireshark which is often deployed to analyze data transmitted from potentially untrustworthy sources I would consider this a critical bug.
I've failed to reproduce this using the described methods against gtk+-bundle_2.24.8-20111122_win32.zip With what combination of versions of libglib-2.0-0.dll and libgtk-win32-2.0-0.dll are you encountering this issue? And on what Windows versions?
It happens with the version of GTK bundled with the latest release versions of XChat WDK, Gimp and/or Wireshark both 32bit and 64bit. XChat WDK works around this bug by shipping a plugin that removes all non BMP characters but I think not all non BMP characters cause this issue. To reproduce just download the latest version of XCHat WDK, delete all plugins in the plugin folder and copy/paste in the unicode character that the above script generates.
I want to inform you that this bug is now frequently being exploited on various IRC networks to crash clients, and has been since at least late January. 2012-01-29 XX:11:26< [removed]> hello<U+E005F> 2012-01-29 XX:11:27-!- Nazbook [~Internet@35556308.1c2d962f.[removed].comcast.net] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:31< Blankflank> fuck 2012-01-29 XX:11:33-!- Juuza [~Juuza@fbb8a3c.2e1bf8a9.[removed].rr.com] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:35-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:38< Blankflank> did anyone else just get that message from the spambot? 2012-01-29 XX:11:39-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has joined #mylittlepony 2012-01-29 XX:11:40-!- mode/#mylittlepony [+h PK] by ChanServ 2012-01-29 XX:11:40-!- Hoppip [~Pyritie@34c8b132.29655c0f.[removed].co.uk] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:42-!- Ellaye [~Fgtbls@cc59707.231698ba.[removed].cox.net] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:42-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:46-!- Tenk [~Tenk@1a9e3c85.2b32ee69.[removed].clearwire-wmx.net] has quit [Read error: Connection reset by peer] 2012-01-29 XX:11:49-!- Juuza [~Juuza@fbb8a3c.2e1bf8a9.[removed].rr.com] has joined #mylittlepony 2012-01-29 XX:11:51< Blankflank> what 2012-01-29 XX:11:51-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has joined #mylittlepony 2012-01-29 XX:11:53-!- mode/#mylittlepony [+h PK] by ChanServ 2012-01-29 XX:11:53< Blankflank> just happened 2012-01-29 XX:11:54-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has quit [Read error: Connection reset by peer]
if applications are shipping their own internal version of GTK+ then there is nothing that the GTK+ team can do: you have to ask the maintainers of those applications to update the version of GTK+ they use and ship. at most, developers like Dieter can check if the issue happens with the current version of GTK+.
(In reply to comment #4) > if applications are shipping their own internal version of GTK+ then there is > nothing that the GTK+ team can do: you have to ask the maintainers of those > applications to update the version of GTK+ they use and ship. at most, > developers like Dieter can check if the issue happens with the current version > of GTK+. In Wireshark's case we ship both 32-bit and 64-bit installers. According to http://www.gtk.org/download/win64.php the latest available version of the 64-bit bundle is gtk+-bundle_2.22.1-20101229_win64.zip. Are there plans to release a newer version any time soon?
I'm able to recreate this with the 2.24.10 bundle in Pidgin. It looks like this is actually a Pango bug that's triggering an assertion:
Weird - bugzilla ate the useful part of my text - hopefully it works better this time: (In reply to comment #6) > I'm able to recreate this with the 2.24.10 bundle in Pidgin. > > It looks like this is actually a Pango bug that's triggering an assertion: Pango:ERROR:basic-win32.c:445:convert_log_clusters_to_byte_offsets: assertion failed: (glyphs->log_clusters[glyphix] < n_chars) Bug 515484 had the same symptom, but this appears to be a different issue.
The issue can be recreated using gtk-demo.exe from the "gtk+-bundle_2.24.10-20120208_win32" bundle. Navigate to "Text Widget"->"Hypertext" Copy and paste the following text into the Hypertext text buffer (including "START" and "END") START
Created attachment 208791 [details] Testing input file Bugzilla ate the rest of my message :(. (In reply to comment #8) > The issue can be recreated using gtk-demo.exe from the > "gtk+-bundle_2.24.10-20120208_win32" bundle. > > Navigate to "Text Widget"->"Hypertext" > > Copy and paste the following text into the Hypertext text buffer (including > "START" and "END") > START I've attached "text.bin" which is actually a text file, but I wanted to make sure that it didn't get mangled by bugzilla, so I renamed it. * Download "text.bin" * Open the file as a UTF-8 encoded file (I've found that jEdit (http://www.jedit.org/) does a great job of handling encodings via the File->"Reload with Encoding" functionality) ** It should show up as "START END" (Depending on your font, there will be a unrecognized character font in the middle) * Copy the whole thing and paste it at the end of the Hypertext text buffer * Hit backspace 4 times - after the 4th hit (deleting the "E" in "END") the assertion will be triggered ** The crash can also be triggered by just copying and pasting the character between "START" and "END"
Created attachment 208829 [details] gdb session gdb session: 1) breakpoint set on convert_log_clusters_to_byte_offsets 2) run gtk-demo, Text Widget/Hypertext 3) paste the character between START and END from attachment #208791 [details] 4) breakpoint hits, get full backtrace 5) step through code, print some variables 6) finally hit assert... So, somehow glyphs->log_clusters[2]==2 and n_chars==2 but convert_log_clusters_to_byte_offsets wants glyphs->log_clusters[2] < n_chars
I don't have time to work on this, but if anyone has I can point them in the right direction.
Created attachment 236716 [details] [review] Patch Here's a patch to start with. I *think* it's complete, but have not tested it at all.
Review of attachment 236716 [details] [review]: I haven't had a chance to test it yet, I'll try to do so this evening (UTC-5). I did have one question/comment about the patch. ::: modules/basic/basic-win32.c @@ +434,3 @@ + if (g_utf8_get_char (p) > 0xFFFF) + { + byte_offset[charix] = p - text; Is it intentional that the additional entry in byte_offset would point to the first byte in the character, or should it be advanced?
(In reply to comment #13) > Review of attachment 236716 [details] [review]: > > I haven't had a chance to test it yet, I'll try to do so this evening (UTC-5). > I did have one question/comment about the patch. > > ::: modules/basic/basic-win32.c > @@ +434,3 @@ > + if (g_utf8_get_char (p) > 0xFFFF) > + { > + byte_offset[charix] = p - text; > > Is it intentional that the additional entry in byte_offset would point to the > first byte in the character, or should it be advanced? Yes, that is the desired behaviour. For both items in a UTF-16 surrogate pair we want to point to the same byte index, which is the beginning of the character. Though, in reality, the second item should never be accessed. It's more about correctly accounting for it, than the actual value we store there.
I've been able to test this patch and it appears to work well. With the unpatched Pango 1.29.4, I am able to consistently recreate the issue and when I re-built 1.29.4 with the patch applied, the text buffer can handle the non-BMP character without problems.
Bug 694534 has a different approach to a solution that avoids the extra memory allocation of attachment 236716 [details] [review].
I'm not at all certain, but I'm concerned that the patch here may not be the right approach. If I'm reading the patch correctly, it converts the `byte_offset' array to a mapping from UTF-16 code units to offsets, instead of a mapping from Unicode code points to offsets. I expect that glyphs->log_clusters[] to map to offsets for characters instead of UTF-16 code units. The patch with Bug 694534 addresses the UTF-16 vs. character mismatch closer to its source... I think.
*** Bug 694534 has been marked as a duplicate of this bug. ***
Ok, I've read the other bug now. Here's my assessment: Matthew's analysis is interesting. It does sound like a bug in Uniscribe, and one that will be addressed by his patch. Somehow I failed to notice the existence of unichar_index(), and assumed that the convert_log_clusters_to_byte_offsets() is receiving cluster values straight out of Uniscribe and hence has to deal with surrogates. Here's my proposed fix: remove the use of unichar_index(), and apply my patch. Currently, the existence of unichar_index() in there is making the code take quadratic time to compute cluster values. Without it, it would be linear. Not that it's a huge deal. Does that sound reasonable?
That would work as far as I can tell, and it sounds like the best fix.
I pushed out the following changes that fix mingw32 build for me, except for the stupid -Werror stuff in configure.ac [1]: commit 547221b486473ed9b7f85634ce162f937e5912b1 Author: Behdad Esfahbod <behdad@behdad.org> Date: Mon Feb 25 22:48:03 2013 -0500 [win32] Fix atomic ops on mingw* Bug 682896 - glib doesn't build on mingw32 commit e1ccae841658854a5db0d907edb2b1f2c0a68ef5 Author: Behdad Esfahbod <behdad@behdad.org> Date: Mon Feb 25 22:01:11 2013 -0500 [win32] Add fallback implementations for gatomic.c on mingw32 Bug 682896 - glib doesn't build on mingw32 [1] diff --git a/configure.ac b/configure.ac index 8745702..e133ce7 100644 --- a/configure.ac +++ b/configure.ac @@ -3614,15 +3614,6 @@ case "$host" in esac AC_SUBST(GLIB_HIDDEN_VISIBILITY_CFLAGS) -dnl Compiler flags; macro originates from systemd -dnl See https://bugzilla.gnome.org/show_bug.cgi?id=608953 -CC_CHECK_FLAGS_APPEND([with_cflags], [CFLAGS], [\ - -Wall -Wstrict-prototypes -Werror=declaration-after-statement \ - -Werror=missing-prototypes -Werror=implicit-function-declaration \ - -Werror=pointer-arith -Werror=init-self -Werror=format-security \ - -Werror=format=2 -Werror=missing-include-dirs]) -CFLAGS="$with_cflags $CFLAGS" - #
Err. Wrong bug. Ignore last comment.
Ok, pushed out a fix. Please test.
I applied the committed fix to vanilla 1.29.4 and it works as intended in regard to the test case outlined above.