Bug 668154 – GTK+ Crashes on Non BMP utf-8 charpoints

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 668154 - GTK+ Crashes on Non BMP utf-8 charpoints


Summary:	GTK+ Crashes on Non BMP utf-8 charpoints


Status:	RESOLVED FIXED

Product:	gtk+
Classification:	Platform
Component:	Backend: Win32
Version:	unspecified
Hardware:	Other Windows

Importance:	Normal critical
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-bugs

URL:
Whiteboard:

Duplicates:	694534 (view as bug list)
Depends on:
Blocks:

Reported:	2012-01-18 02:56 UTC by Armin Ronacher
Modified:	2013-02-26 21:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Testing input file (16 bytes, application/octet-stream) 2012-03-01 15:24 UTC, Daniel Atallah		Details
gdb session (16.01 KB, text/plain) 2012-03-01 22:50 UTC, Dieter Verfaillie		Details
Patch (1.16 KB, patch) 2013-02-19 03:53 UTC, Behdad Esfahbod	none	Details \| Review

Description Armin Ronacher 2012-01-18 02:56:41 UTC

I noticed today that X-Chat on windows crashes on a certain string in my logfile.  I have since been able to reproduce this issue thanks to bviktor on #xchat on other environments than X-Chat itself.

For instance the problematic string can be pasted into almost any GTK application on windows and then crash the application.

How to obtain such a string?  Write a non BMP character (I assume that is the problem.  I have tried with the following one and a few others) into a textfile, open it with notepad (make sure it's opening the file as utf-8) and copy the string into the clipboard.

Then further paste the string in a input field in a GTK application (for instance the gimp filename field in the save dialog).  The result is that the application crashes.

Python code for generating the character:

    f = open('test.txt', 'w')
    f.write(u'\U000e005f\r\n'.encode('utf-8'))
    f.close()

It's annoying since you can crash pidgin and x-chat on windows by copy pasting this string into a public IRC channel.  Since GTK is also used for wireshark which is often deployed to analyze data transmitted from potentially untrustworthy sources I would consider this a critical bug.

Comment 1 Dieter Verfaillie 2012-01-31 11:22:27 UTC

I've failed to reproduce this using the described methods
against gtk+-bundle_2.24.8-20111122_win32.zip

With what combination of versions of libglib-2.0-0.dll and
libgtk-win32-2.0-0.dll are you encountering this issue? And
on what Windows versions?

Comment 2 Armin Ronacher 2012-01-31 11:25:27 UTC

It happens with the version of GTK bundled with the latest release versions of XChat WDK, Gimp and/or Wireshark both 32bit and 64bit.  XChat WDK works around this bug by shipping a plugin that removes all non BMP characters but I think not all non BMP characters cause this issue.

To reproduce just download the latest version of XCHat WDK, delete all plugins in the plugin folder and copy/paste in the unicode character that the above script generates.

Comment 3 castledown 2012-02-14 17:13:02 UTC

I want to inform you that this bug is now frequently being exploited on various IRC networks to crash clients, and has been since at least late January.

2012-01-29 XX:11:26< [removed]> hello<U+E005F>
2012-01-29 XX:11:27-!- Nazbook [~Internet@35556308.1c2d962f.[removed].comcast.net] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:31< Blankflank> fuck
2012-01-29 XX:11:33-!- Juuza [~Juuza@fbb8a3c.2e1bf8a9.[removed].rr.com] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:35-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:38< Blankflank> did anyone else just get that message from the spambot?
2012-01-29 XX:11:39-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has joined #mylittlepony
2012-01-29 XX:11:40-!- mode/#mylittlepony [+h PK] by ChanServ
2012-01-29 XX:11:40-!- Hoppip [~Pyritie@34c8b132.29655c0f.[removed].co.uk] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:42-!- Ellaye [~Fgtbls@cc59707.231698ba.[removed].cox.net] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:42-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:46-!- Tenk [~Tenk@1a9e3c85.2b32ee69.[removed].clearwire-wmx.net] has quit [Read error: Connection reset by peer]
2012-01-29 XX:11:49-!- Juuza [~Juuza@fbb8a3c.2e1bf8a9.[removed].rr.com] has joined #mylittlepony
2012-01-29 XX:11:51< Blankflank> what
2012-01-29 XX:11:51-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has joined #mylittlepony
2012-01-29 XX:11:53-!- mode/#mylittlepony [+h PK] by ChanServ
2012-01-29 XX:11:53< Blankflank> just happened
2012-01-29 XX:11:54-!- PK [~nope.avi@1b5b45e4.2e01205d.[removed].qwest.net] has quit [Read error: Connection reset by peer]

Comment 4 Emmanuele Bassi (:ebassi) 2012-02-14 17:27:22 UTC

if applications are shipping their own internal version of GTK+ then there is nothing that the GTK+ team can do: you have to ask the maintainers of those applications to update the version of GTK+ they use and ship. at most, developers like Dieter can check if the issue happens with the current version of GTK+.

Comment 5 Gerald Combs 2012-02-14 17:40:27 UTC

(In reply to comment #4)
> if applications are shipping their own internal version of GTK+ then there is
> nothing that the GTK+ team can do: you have to ask the maintainers of those
> applications to update the version of GTK+ they use and ship. at most,
> developers like Dieter can check if the issue happens with the current version
> of GTK+.

In Wireshark's case we ship both 32-bit and 64-bit installers. According to http://www.gtk.org/download/win64.php the latest available version of the 64-bit bundle is gtk+-bundle_2.22.1-20101229_win64.zip. Are there plans to release a newer version any time soon?

Comment 6 Daniel Atallah 2012-03-01 00:18:30 UTC

I'm able to recreate this with the 2.24.10 bundle in Pidgin.

It looks like this is actually a Pango bug that's triggering an assertion:

Comment 7 Daniel Atallah 2012-03-01 00:21:42 UTC

Weird - bugzilla ate the useful part of my text - hopefully it works better this time:

(In reply to comment #6)
> I'm able to recreate this with the 2.24.10 bundle in Pidgin.
> 
> It looks like this is actually a Pango bug that's triggering an assertion:

 Pango:ERROR:basic-win32.c:445:convert_log_clusters_to_byte_offsets: assertion failed: (glyphs->log_clusters[glyphix] < n_chars)

 Bug 515484 had the same symptom, but this appears to be a different issue.

Comment 8 Daniel Atallah 2012-03-01 15:06:53 UTC

The issue can be recreated using gtk-demo.exe from the "gtk+-bundle_2.24.10-20120208_win32" bundle.

Navigate to "Text Widget"->"Hypertext"

Copy and paste the following text into the Hypertext text buffer (including "START" and "END")
START

Comment 9 Daniel Atallah 2012-03-01 15:24:09 UTC

Created attachment 208791 [details]
Testing input file

 Bugzilla ate the rest of my message :(.

(In reply to comment #8)
> The issue can be recreated using gtk-demo.exe from the
> "gtk+-bundle_2.24.10-20120208_win32" bundle.
> 
> Navigate to "Text Widget"->"Hypertext"
> 
> Copy and paste the following text into the Hypertext text buffer (including
> "START" and "END")
> START

 I've attached "text.bin" which is actually a text file, but I wanted to make sure that it didn't get mangled by bugzilla, so I renamed it.

  * Download "text.bin"
  * Open the file as a UTF-8 encoded file (I've found that jEdit (http://www.jedit.org/) does a great job of handling encodings via the File->"Reload with Encoding" functionality)
  ** It should show up as "START END" (Depending on your font, there will be a unrecognized character font in the middle)
  * Copy the whole thing and paste it at the end of the Hypertext text buffer
  * Hit backspace 4 times - after the 4th hit (deleting the "E" in "END") the assertion will be triggered
  ** The crash can also be triggered by just copying and pasting the character between "START" and "END"

Comment 10 Dieter Verfaillie 2012-03-01 22:50:30 UTC

Created attachment 208829 [details]
gdb session

gdb session:

1) breakpoint set on convert_log_clusters_to_byte_offsets
2) run gtk-demo, Text Widget/Hypertext
3) paste the character between START and END from attachment #208791 [details]
4) breakpoint hits, get full backtrace
5) step through code, print some variables
6) finally hit assert...

So, somehow glyphs->log_clusters[2]==2 and n_chars==2 but
convert_log_clusters_to_byte_offsets wants
glyphs->log_clusters[2] < n_chars

Comment 11 Behdad Esfahbod 2012-05-28 02:26:19 UTC

I don't have time to work on this, but if anyone has I can point them in the right direction.

Comment 12 Behdad Esfahbod 2013-02-19 03:53:06 UTC

Created attachment 236716 [details] [review]
Patch

Here's a patch to start with.  I *think* it's complete, but have not tested it at all.

Comment 13 Daniel Atallah 2013-02-19 16:50:18 UTC

Review of attachment 236716 [details] [review]:

I haven't had a chance to test it yet, I'll try to do so this evening (UTC-5).
I did have one question/comment about the patch.

::: modules/basic/basic-win32.c
@@ +434,3 @@
+      if (g_utf8_get_char (p) > 0xFFFF)
+      {
+	byte_offset[charix] = p - text;

Is it intentional that the additional entry in byte_offset would point to the first byte in the character, or should it be advanced?

Comment 14 Behdad Esfahbod 2013-02-19 17:12:10 UTC

(In reply to comment #13)
> Review of attachment 236716 [details] [review]:
> 
> I haven't had a chance to test it yet, I'll try to do so this evening (UTC-5).
> I did have one question/comment about the patch.
> 
> ::: modules/basic/basic-win32.c
> @@ +434,3 @@
> +      if (g_utf8_get_char (p) > 0xFFFF)
> +      {
> +    byte_offset[charix] = p - text;
> 
> Is it intentional that the additional entry in byte_offset would point to the
> first byte in the character, or should it be advanced?

Yes, that is the desired behaviour.  For both items in a UTF-16 surrogate pair we want to point to the same byte index, which is the beginning of the character.  Though, in reality, the second item should never be accessed.  It's more about correctly accounting for it, than the actual value we store there.

Comment 15 Daniel Atallah 2013-02-21 04:52:28 UTC

I've been able to test this patch and it appears to work well.

With the unpatched Pango 1.29.4, I am able to consistently recreate the issue and when I re-built 1.29.4 with the patch applied, the text buffer can handle the non-BMP character without problems.

Comment 16 Daniel Atallah 2013-02-24 15:18:34 UTC

Bug 694534 has a different approach to a solution that avoids the extra memory allocation of attachment 236716 [details] [review].

Comment 17 Matthew Flatt 2013-02-24 15:30:45 UTC

I'm not at all certain, but I'm concerned that the patch here may not be
the right approach. If I'm reading the patch correctly, it converts the
`byte_offset' array to a mapping from UTF-16 code units to offsets,
instead of a mapping from Unicode code points to offsets. I expect
that glyphs->log_clusters[] to map to offsets for characters instead
of UTF-16 code units.

The patch with Bug 694534 addresses the UTF-16 vs. character
mismatch closer to its source... I think.

Comment 18 Behdad Esfahbod 2013-02-24 23:43:01 UTC

*** Bug 694534 has been marked as a duplicate of this bug. ***

Comment 19 Behdad Esfahbod 2013-02-24 23:54:13 UTC

Ok, I've read the other bug now.  Here's my assessment:

Matthew's analysis is interesting.  It does sound like a bug in Uniscribe, and one that will be addressed by his patch.

Somehow I failed to notice the existence of unichar_index(), and assumed that the convert_log_clusters_to_byte_offsets() is receiving cluster values straight out of Uniscribe and hence has to deal with surrogates.

Here's my proposed fix: remove the use of unichar_index(), and apply my patch.  Currently, the existence of unichar_index() in there is making the code take quadratic time to compute cluster values.  Without it, it would be linear.  Not that it's a huge deal.

Does that sound reasonable?

Comment 20 Matthew Flatt 2013-02-25 14:29:05 UTC

That would work as far as I can tell, and it sounds like the best fix.

Comment 21 Behdad Esfahbod 2013-02-26 04:25:58 UTC

I pushed out the following changes that fix mingw32 build for me, except for the stupid -Werror stuff in configure.ac [1]:

commit 547221b486473ed9b7f85634ce162f937e5912b1
Author: Behdad Esfahbod <behdad@behdad.org>
Date:   Mon Feb 25 22:48:03 2013 -0500

    [win32] Fix atomic ops on mingw*
    
    Bug 682896 - glib doesn't build on mingw32

commit e1ccae841658854a5db0d907edb2b1f2c0a68ef5
Author: Behdad Esfahbod <behdad@behdad.org>
Date:   Mon Feb 25 22:01:11 2013 -0500

    [win32] Add fallback implementations for gatomic.c on mingw32
    
    Bug 682896 - glib doesn't build on mingw32



[1]

diff --git a/configure.ac b/configure.ac
index 8745702..e133ce7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -3614,15 +3614,6 @@ case "$host" in
 esac
 AC_SUBST(GLIB_HIDDEN_VISIBILITY_CFLAGS)
 
-dnl Compiler flags; macro originates from systemd
-dnl See https://bugzilla.gnome.org/show_bug.cgi?id=608953
-CC_CHECK_FLAGS_APPEND([with_cflags], [CFLAGS], [\
-   -Wall -Wstrict-prototypes -Werror=declaration-after-statement \
-   -Werror=missing-prototypes -Werror=implicit-function-declaration \
-   -Werror=pointer-arith -Werror=init-self -Werror=format-security \
-   -Werror=format=2 -Werror=missing-include-dirs])
-CFLAGS="$with_cflags $CFLAGS"
-
 #

Comment 22 Behdad Esfahbod 2013-02-26 04:27:18 UTC

Err.  Wrong bug.  Ignore last comment.

Comment 23 Behdad Esfahbod 2013-02-26 04:48:44 UTC

Ok, pushed out a fix.  Please test.

Comment 24 Daniel Atallah 2013-02-26 21:14:52 UTC

I applied the committed fix to vanilla 1.29.4 and it works as intended in regard to the test case outlined above.