GNOME Bugzilla – Bug 769126
Can't type astral plane characters into a GtkEntry using the Windows 10 touch keyboard
Last modified: 2016-07-28 16:16:32 UTC
In https://github.com/hexchat/hexchat/issues/1778 a HexChat user reported that the Windows 10 touch keyboard can't be used to type some emoji characters into a GtkEntry. On debugging I see that for astral plane characters, Windows sends two WM_KEYDOWN wparam=VK_PACKET messages with the corresponding utf-16 surrogates (compared to just one message for BMP characters). For example, for http://www.fileformat.info/info/unicode/char/1F60D/index.htm Windows sends one message with ToUnicode() == 0xD83D and another with ToUnicode() == 0xDE0D. gdk_event_translate handles these messages separately, so it generates two key events in the GTK message queue with event->key.keyval == 0x0100D83D and 0x0100DE0d respectively (the result of gdk_unicode_to_keyval). When the GtkEntry processes the event, the call stack is gtk_entry_key_press -> gtk_im_context_filter_keypress -> gtk_im_context_simple_filter_keypress -> no_sequence_matches -> gtk_im_context_simple_commit_char. This function asserts that the event's keyval is a valid gunichar (`g_return_if_fail (g_unichar_validate (ch))`), which it isn't. So the two messages are ignored and the character doesn't appear in the GtkEntry. So there's a disconnect between gdk-win32 and GtkEntry. Either gdk_event_translate needs to aggregate VK_PACKET messages with surrogates and queue an event with a complete unicode codepoint, or GtkEntry needs to aggregate GdkKeyEvents with surrogate keyvals.
(IMO the aggregation should only be done for VK_PACKET messages, so gdk should be the one to do it, not GtkEntry.)
How does GTK handle this on *nix, if gtk_im_context_simple_commit_char() does not support surrogates? Does it have, as you suggest, a buffer that sits somewhere before GtkImContext and accumulates surrogates until a full unicode character can be emitted?
I don't know how to enter astral plane characters on *nix. Let me look it up and try.
From cursory googling, it seems that no DE has no way to type individual surrogates and combine them into a codepoint. I see x11 has compose key shortcuts for some characters, and a more general "ALT + codepoint" method. Both of these can only be used to type a complete codepoint, not utf-16 suurogates. So this might only be a problem for VK_PACKET messages.
So, the fix then is to look at the value brought to us by VK_PACKET. If it's & 0xD800, we stash it and wait for a next keydown, and on next keydown we either combine it with the second value from VK_PACKET (if it's & 0xDC00) or throw the stashed value away (otherwise). If it's not & 0xD800, we pass it normally. Is the above correct?
Yeah, and for keyup too (probably separate from the stashed value for keydown).
Created attachment 332036 [details] [review] GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET I can only say two things: 1) This code compiles. 2) This code probably joins surrogate pairs correctly (my small testcase program, which used the same logic for joining surrogate pairs, did). However, i don't know how to test this, as i don't have Windows 10 readily available.
Created attachment 332040 [details] [review] GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET v2: * Fixed a few holes in the logic and a copypaste typo
No problem. I can test it for you.
Created attachment 332045 [details] [review] Port of attachment 332036 [details] [review] to gtk2 I ported your patch to gtk2 (attached) and it seems to work fine with the W10 touch keyboard. Both BMP and astral plane characters are getting typed into the GtkEntry correctly. Thanks!
Hi, I do have Windows 10, and it seems to me that this is some locale-dependent issue (as I couldn't reproduce it), but seems that the patch did not introduce any ill-effects for me, plus it built fine. I'd say, if the formatting looks okay to Nacho, we should indeed push the patch. With blessings, thank you!
Looks good to me, fwiw
Chun-wei: I don't see how it could be locale-dependent. (FWIW I tested this on en-US locale.) Are you sure you tried with the emoji that are in the astral plane? If yes, then assuming you tested with gtk3, maybe this has already been fixed in gdk3 / gtk3 in some other way?
Installed Windows 10 today. I can now reproduce this bug and verify that the fix works. Something that no one mentioned so far: you need to run GTK applications with PANGOCAIRO_BACKEND=fc , otherwise GtkEntry font will not support exotic unicode characters (such as emoji) and they will look like placeholders (however, they actually are different codepoints and, for example, copying and pasting them into a browser will produce expected result).
Yes, that is expected.
Hi LRN, I should have mentioned the PANGOCAIRO_BACKEND=fc too. Actually this means that the PangoWin32 backend needs some updating, I hope I could find time for it, but obviously this is going to be in another bug. --- Hi Arnav, Yes, I did indeed test on GTK-3.x as I am working on something there lately, but I think there are good chances that the fix is also needed there as well for this. The thing is I have a CJK version of Windows 10 and it seems to me that I was able to get the Japanese emoji's to show (as I see in the animation of the user that the user posted) without the patch (but with PANGOCAIRO_BACKEND=fc set, as LRN noted--but even without it, I can still have the Japanese emoji's displayed). Hope this clears it up a bit. With blessings, and cheers!
CJK must be some kind of wonder-locale then, because with en_US locale all i got with unpatched GTK3 were lots of warnings: > (gtk3-demo.exe:1108): Gtk-CRITICAL **: gtk_im_context_simple_commit_char: assertion 'g_unichar_validate (ch)' failed for 90% of all smileys and other exotic characters that touch keyboard is able to type. Maybe CJK uses 2-byte wchars with a different encoding scheme, one that doesn't rely on surrogates? I'm not really familiar with multibyte encodings other than unicode. Oh-kay, i think at this point this patch should be good to go into gtk+ master, as long as it compiles and doesn't break anything. I'll try the gtk2 backport later today, or maybe tomorrow, and if it works i'll push both.
Created attachment 332285 [details] [review] GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (gtk-2-24) Port of attachment 332040 [details] [review] to gtk-2-24 branch. Attachment 332045 [details], now obsolete, had screwed up indentation.
Comment on attachment 332285 [details] [review] GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (gtk-2-24) Attachment 332285 [details] pushed into branch gtk-2-24 as b7c92fb - GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET
Attachment 332040 [details] pushed as 2233566 - GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET
>Attachment 332045 [details], now obsolete, had screwed up indentation. For my info, both 332045 and 332285 mix tabs and spaces for indentation in a way that only works if tabs are 4 spaces. What makes one "screwed up" and the other not?
Because attachment 332285 [details] [review] mixes tabs and spaces for indentation in a way that only works if tabs are 8 spaces, not 4. Which is why i usually convert indentation to spaces-only - can't screw that up with wrong assumptions about tab size. In this case i didn't, for some reason.