Bug 769126 – Can't type astral plane characters into a GtkEntry using the Windows 10 touch keyboard

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 769126 - Can't type astral plane characters into a GtkEntry using the Windows 10 touch keyboard


Summary:	Can't type astral plane characters into a GtkEntry using the Windows 10 touch...


Status:	RESOLVED FIXED

Product:	gtk+
Classification:	Platform
Component:	Backend: Win32
Version:	2.24.x
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-07-24 09:01 UTC by Arnav Singh
Modified:	2016-07-28 16:16 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (4.64 KB, patch) 2016-07-24 14:30 UTC, LRN	none	Details \| Review
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (4.54 KB, patch) 2016-07-24 17:37 UTC, LRN	committed	Details \| Review
Port of attachment 332036 to gtk2 (3.96 KB, patch) 2016-07-24 18:57 UTC, Arnav Singh	none	Details \| Review
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (gtk-2-24) (4.33 KB, patch) 2016-07-28 15:50 UTC, LRN	committed	Details \| Review

Description Arnav Singh 2016-07-24 09:01:25 UTC

In https://github.com/hexchat/hexchat/issues/1778 a HexChat user reported that the Windows 10 touch keyboard can't be used to type some emoji characters into a GtkEntry.

On debugging I see that for astral plane characters, Windows sends two WM_KEYDOWN wparam=VK_PACKET messages with the corresponding utf-16 surrogates (compared to just one message for BMP characters). For example, for http://www.fileformat.info/info/unicode/char/1F60D/index.htm Windows sends one message with ToUnicode() == 0xD83D and another with ToUnicode() == 0xDE0D. gdk_event_translate handles these messages separately, so it generates two key events in the GTK message queue with event->key.keyval == 0x0100D83D and 0x0100DE0d respectively (the result of gdk_unicode_to_keyval).

When the GtkEntry processes the event, the call stack is gtk_entry_key_press -> gtk_im_context_filter_keypress -> gtk_im_context_simple_filter_keypress -> no_sequence_matches -> gtk_im_context_simple_commit_char. This function asserts that the event's keyval is a valid gunichar (`g_return_if_fail (g_unichar_validate (ch))`), which it isn't. So the two messages are ignored and the character doesn't appear in the GtkEntry.

So there's a disconnect between gdk-win32 and GtkEntry. Either gdk_event_translate needs to aggregate VK_PACKET messages with surrogates and queue an event with a complete unicode codepoint, or GtkEntry needs to aggregate GdkKeyEvents with surrogate keyvals.

Comment 1 Arnav Singh 2016-07-24 09:05:30 UTC

(IMO the aggregation should only be done for VK_PACKET messages, so gdk should be the one to do it, not GtkEntry.)

Comment 2 LRN 2016-07-24 09:08:04 UTC

How does GTK handle this on *nix, if gtk_im_context_simple_commit_char() does not support surrogates? Does it have, as you suggest, a buffer that sits somewhere before GtkImContext and accumulates surrogates until a full unicode character can be emitted?

Comment 3 Arnav Singh 2016-07-24 09:22:10 UTC

I don't know how to enter astral plane characters on *nix. Let me look it up and try.

Comment 4 Arnav Singh 2016-07-24 09:31:56 UTC

From cursory googling, it seems that no DE has no way to type individual surrogates and combine them into a codepoint. I see x11 has compose key shortcuts for some characters, and a more general "ALT + codepoint" method. Both of these can only be used to type a complete codepoint, not utf-16 suurogates. So this might only be a problem for VK_PACKET messages.

Comment 5 LRN 2016-07-24 10:13:12 UTC

So, the fix then is to look at the value brought to us by VK_PACKET. If it's & 0xD800, we stash it and wait for a next keydown, and on next keydown we either combine it with the second value from VK_PACKET (if it's & 0xDC00) or throw the stashed value away (otherwise). If it's not & 0xD800, we pass it normally.

Is the above correct?

Comment 6 Arnav Singh 2016-07-24 10:44:13 UTC

Yeah, and for keyup too (probably separate from the stashed value for keydown).

Comment 7 LRN 2016-07-24 14:30:32 UTC

Created attachment 332036 [details] [review]
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET

I can only say two things:
1) This code compiles.
2) This code probably joins surrogate pairs correctly (my small testcase program, which used the same logic for joining surrogate pairs, did).

However, i don't know how to test this, as i don't have Windows 10 readily available.

Comment 8 LRN 2016-07-24 17:37:59 UTC

Created attachment 332040 [details] [review]
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET

v2:
* Fixed a few holes in the logic and a copypaste typo

Comment 9 Arnav Singh 2016-07-24 18:15:53 UTC

No problem. I can test it for you.

Comment 10 Arnav Singh 2016-07-24 18:57:32 UTC

Created attachment 332045 [details] [review]
Port of attachment 332036 [details] [review] to gtk2

I ported your patch to gtk2 (attached) and it seems to work fine with the W10 touch keyboard. Both BMP and astral plane characters are getting typed into the GtkEntry correctly. Thanks!

Comment 11 Fan, Chun-wei 2016-07-25 06:44:29 UTC

Hi,

I do have Windows 10, and it seems to me that this is some locale-dependent issue (as I couldn't reproduce it), but seems that the patch did not introduce any ill-effects for me, plus it built fine.  I'd say, if the formatting looks okay to Nacho, we should indeed push the patch.

With blessings, thank you!

Comment 12 Matthias Clasen 2016-07-25 12:16:44 UTC

Looks good to me, fwiw

Comment 13 Arnav Singh 2016-07-26 17:53:51 UTC

Chun-wei: I don't see how it could be locale-dependent. (FWIW I tested this on en-US locale.) Are you sure you tried with the emoji that are in the astral plane?

If yes, then assuming you tested with gtk3, maybe this has already been fixed in gdk3 / gtk3 in some other way?

Comment 14 LRN 2016-07-27 00:05:04 UTC

Installed Windows 10 today.
I can now reproduce this bug and verify that the fix works.

Something that no one mentioned so far: you need to run GTK applications with PANGOCAIRO_BACKEND=fc , otherwise GtkEntry font will not support exotic unicode characters (such as emoji) and they will look like placeholders (however, they actually are different codepoints and, for example, copying and pasting them into a browser will produce expected result).

Comment 15 Arnav Singh 2016-07-27 02:39:11 UTC

Yes, that is expected.

Comment 16 Fan, Chun-wei 2016-07-27 05:32:29 UTC

Hi LRN,

I should have mentioned the PANGOCAIRO_BACKEND=fc too.  Actually this means that the PangoWin32 backend needs some updating, I hope I could find time for it, but obviously this is going to be in another bug.

---
Hi Arnav,

Yes, I did indeed test on GTK-3.x as I am working on something there lately, but I think there are good chances that the fix is also needed there as well for this.

The thing is I have a CJK version of Windows 10 and it seems to me that I was able to get the Japanese emoji's to show (as I see in the animation of the user that the user posted) without the patch (but with PANGOCAIRO_BACKEND=fc set, as LRN noted--but even without it, I can still have the Japanese emoji's displayed).

Hope this clears it up a bit.

With blessings, and cheers!

Comment 17 LRN 2016-07-27 05:45:08 UTC

CJK must be some kind of wonder-locale then, because with en_US locale all i got with unpatched GTK3 were lots of warnings:
> (gtk3-demo.exe:1108): Gtk-CRITICAL **: gtk_im_context_simple_commit_char: assertion 'g_unichar_validate (ch)' failed
for 90% of all smileys and other exotic characters that touch keyboard is able to type. Maybe CJK uses 2-byte wchars with a different encoding scheme, one that doesn't rely on surrogates? I'm not really familiar with multibyte encodings other than unicode.

Oh-kay, i think at this point this patch should be good to go into gtk+ master, as long as it compiles and doesn't break anything. I'll try the gtk2 backport later today, or maybe tomorrow, and if it works i'll push both.

Comment 18 LRN 2016-07-28 15:50:51 UTC

Created attachment 332285 [details] [review]
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (gtk-2-24)

Port of attachment 332040 [details] [review] to gtk-2-24 branch.
Attachment 332045 [details], now obsolete, had screwed up indentation.

Comment 19 LRN 2016-07-28 15:53:32 UTC

Comment on attachment 332285 [details] [review]
GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET (gtk-2-24)

Attachment 332285 [details] pushed into branch gtk-2-24 as b7c92fb - GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET

Comment 20 LRN 2016-07-28 15:56:06 UTC

Attachment 332040 [details] pushed as 2233566 - GDK W32: Support UTF-16 surrogate pairs passed via VK_PACKET

Comment 21 Arnav Singh 2016-07-28 16:10:51 UTC

>Attachment 332045 [details], now obsolete, had screwed up indentation.

For my info, both 332045 and 332285 mix tabs and spaces for indentation in a way that only works if tabs are 4 spaces. What makes one "screwed up" and the other not?

Comment 22 LRN 2016-07-28 16:16:32 UTC

Because attachment 332285 [details] [review] mixes tabs and spaces for indentation in a way that only works if tabs are 8 spaces, not 4.

Which is why i usually convert indentation to spaces-only - can't screw that up with wrong assumptions about tab size. In this case i didn't, for some reason.