GNOME Bugzilla – Bug 537457
Support compose sequences that produce two+ codepoints
Last modified: 2018-04-17 08:18:26 UTC
Currently, GTK+ IM supports compose sequences that produce a single codepoint.
This is not sufficient for several keyboard layouts and languages.
There have been requests such as bug #341341 and bug #345254 which can be easily solved with the existing infrastructure (the compose sequence itself and the resulting codepoints correspond to each other, therefore we do not need to store additional codepoints per compose sequence).
However, there are compose sequences that cannot be optimised in the same way that Latin compose sequences can. In this case, we would need an additional table, and include searching to that table as well when searching through the compose sequences.
An example of such compose sequences is
# Khmer digraphs
# A keystroke has to generate several characters, so they are defined
# in this file
<U17fb> : "ុះ"
<U17fc> : "ុំ"
<U17fd> : "េះ"
<U17fe> : "ោះ"
<U17ff> : "ាំ"
At the moment there are only these 5 compose sequences for Khmer that require the support.
A simple table that would hold these compose sequences should suffice for now.
I beleive this is related to https://bugs.freedesktop.org/show_bug.cgi?id=8195
I get conflicting reports on whether in XKB one is able to have compose sequences that produce 2+ codepoints.
A related report by Danilo gave me the impression that it is possible to have compose sequences that produce two or more codepoints, in XKB. And GTK+ was lacking.
Implementing such a feature in GTK+ but not in XKB (X.org) would not be prudent (we would effectively fork the compose sequences from what we have in Xorg).
It would be great if someone could verify that compose sequences in XKB that produce 2+ codepoints, do really work.
Well, I implemented a similar solution for Arabic and it does work. What I did is adding some code points in the Arabic XKB layout then mapping it to 2 code points in the X Compose file, when using xim input method I get the expected result with GTK; one key produces to code points. I opened a bug report for those compose sequences, see https://bugs.freedesktop.org/show_bug.cgi?id=16426.
Simos, XKB has nothing to do with compose sequences. If you are talking about compose sequences on the X layer, you mean XIM.
Matthias, thanks for the clarification.
Khaled, one issue regarding these compose sequences. I am wondering whether the following format is supported:
<UFEFB> : "لا" U0644 U0612 # LAM WITH ALEF
<UFEF7> : "لأ" U0644 U0618 # LAM WITH ALEF WITH HAMZA ABOVE
(the Uxxxx examples are for demonstration only; I probably got the wrong codepoints).
If that syntax is supported, it would make it easier when parsing, reading, storing, etc. Could you please try it out?
It doesn't seem to be supported, I replaced the other sequences with this:
<UFEFB> : "ﻻ" U0644 U0627 # LAM WITH ALEF
<UFEF7> : "ﻷ" U0644 U0623 # LAM WITH ALEF WITH HAMZA ABOVE
<UFEF9> : "ﻹ" U0644 U0625 # LAM WITH ALEF WITH HAMZA BELOW
<UFEF5> : "ﻵ" U0644 U0623 # LAM WITH ALEF WITH MADDA ABOVE
But it then got ignored at all and I get the original code point.
Created attachment 118339 [details] [review]
Updated gtkimcontextsimple.c, adds check_compose_multi(), for multiple codepoints.
Adds a function that checks a new custom table for compose sequences.
This table allows for compose sequences to be made of more than one codepoints.
Created attachment 118341 [details] [review]
New file, autogenerated, with compose sequences from upstream (X.Org)
Autogenerated file from script;
The script parses the X.Org Compose file and identifies sequences that produce 2+ characters. The script checks the types of sequences and produces a table with the biggest sequence size, codepoint length to accommodate all sequences.
Covers Khmer and Arabic.
+ for (i = 0; i < compose_multi_max_codepoint_len; i++ )
+ gtk_im_context_simple_commit_char (GTK_IM_CONTEXT (context_simple), seq[compose_multi_max_sequence_len + i]);
Does this mean every multi-char sequence must produce the exact same number of chars ?
There is a wierd empty comment in the patch:
That should be removed.
gtkimcontextsimpleseqs.h has a nice comment explaining how it was generated.
Is the multi-sequence table also generated that way ? Would be nice to have a
comment in there.
See bug 114430 for an old bug and patch about the same thing.
The updated script compose-parse.py goes through the X.org Compose file and reads all compose sequences that produce more than one Unicode character.
Then, it finds what's the longest compose sequence, and the longest Unicode string that is produced by each sequence.
Finally, it creates a custom table and sets two variables,
const gint compose_max_sequence_len = 1;
const gint compose_max_codepoint_len = 2;
which means that for the current set of compose sequences, the compose sequences has max length 1 (current situation with Khmer, Arabic), and produce Unicode characters of max length 2.
At a later date, when running the script again, the above values may change. For example, the sequence length may increase. The script produces the proper variables, and the code continues to work.
With bug 114430, the size of the compose sequence table increases by around
3832 compose sequences * 2 bytes = 7664 bytes (or 15328 bytes when we support Plane1). This is due to the NUL that is added to the strings.
With this patch, the tables size increase by around 50 (or 100, Plane1) bytes.
I see that bug 114430 provides a more elegant solution.
I am happy to work with either solution you suggest.
> which means that for the current set of compose sequences, the compose
> sequences has max length 1 (current situation with Khmer, Arabic), and produce
> Unicode characters of max length 2.
So it is true that all sequences in the multi compose table must produce codepoints of the same length ? I kinda expected something like
+ for (i = 0; i < compose_multi_max_codepoint_len && seq[compose_multi_max_sequence_len + i] != 0; i++ )
+ gtk_im_context_simple_commit_char (GTK_IM_CONTEXT
(context_simple), seq[compose_multi_max_sequence_len + i]);
to allow for shorter codepoints, padded with zeros.
> I am happy to work with either solution you suggest.
I think I'll go with the more compact tables. But maybe we can steal some of the refactoring from the other patch (the various commit helpers).
(In reply to comment #12)
> > which means that for the current set of compose sequences, the compose
> > sequences has max length 1 (current situation with Khmer, Arabic), and produce
> > Unicode characters of max length 2.
> So it is true that all sequences in the multi compose table must produce
> codepoints of the same length ?
There could be sequences that produce three or more codepoints. It might be rare, but I expect that it could very well happen.
The number of codepoints is determined by the person who writes the keyboard layout, and the choices she makes in the design of the layout.
> I kinda expected something like
> + for (i = 0; i < compose_multi_max_codepoint_len &&
> seq[compose_multi_max_sequence_len + i] != 0; i++ )
> + gtk_im_context_simple_commit_char (GTK_IM_CONTEXT
> (context_simple), seq[compose_multi_max_sequence_len + i]);
> to allow for shorter codepoints, padded with zeros.
Indeed I missed that part for shorter codepoints.
> > I am happy to work with either solution you suggest.
> I think I'll go with the more compact tables. But maybe we can steal some of
> the refactoring from the other patch (the various commit helpers).
I'll be looking into these in the following weeks.
*** Bug 114430 has been marked as a duplicate of this bug. ***
Simos, any update on this ?
We could still get this in 2.16, if a new patch shows up quickly
Getting this in also is a requirement for fixing some behaviour that annoys users on Windows. So please try to get this in soonish.
They expect to be able to type a dead accent key twice and then actually get two copies of the corresponding spacing accent. Mainly this seems to be used for frivolous purposes like silly nicknames in IRC and for emoticons, so it is not that serious, but many users or Pidgin and XChat seem to be annoyed.
Another more serious issue that can be fixed only if this bug is fixed is that at least on the "US International" keyboard, there is no separate plain apostrophe key, just a dead acute (which looks like an apostrophe, though) that has a weird expected behaviour: It should combine only with a small number of following characters, not 's' for instance, and if some other character follows, an ASCII apostrophe and that following character are expected to be input.
I guess in general, the expected behaviour on Windows is that if the following key after a dead accent key is not something that can be combined with the accent, you should not get a beep and both keys discarded, but two characters: the corresponding spacing accent (or, in some cases, apostrophe instead of spacing acute), and the second key's corresponding character. My personal opinion is that it would be more cool if we then would in true Unicode fashion get the codepoint for the second key and the codepoint for a combining accent, but I guess we should match "native" behaviour.
Hmm, now that I look in the patch in comment #7 more closely, do I understand correctly that it as such already *is* fully possible to just call gtk_im_context_simple_commit_char() several times? If so, then the bugs I just marked as depending on this bug, don't actually depend on this bug. I will experiment.
Yes indeed. Removing the dependency info.
Created attachment 127475 [details] [review]
Adds support for multiple codepoints (updates check_table())
This is a reworked patch, that expands check_table() so that it works for the new type of compose table where the codepoint length can be bigger than 1.
- gint row_stride = table->max_seq_len + 2;
+ gint row_stride = table->max_seq_len + table->max_codepoint_len;
I am not sure why the row_stride used to be '+2' with the old table. Shouldn't it be +1 (one codepoint)?
2. I changed "struct _GtkComposeTable" so that it now contains an extra field for 'max_codepoint_len'. I suppose this is an API change which requires special dealing for the inclusion.
Created attachment 127476 [details] [review]
Autogenerated file with compose sequences that produce >1 codepoints
This is a new file (generated with compose-parse.py, http://github.com/simos/compose-parse/)
Simos, I don't see why GtkComposeTable would have API relevance. It is not exposed in the headers.
However, there are some things that need more work here:
- To learn about the +2, look at the docs of gtk_im_context_simple_add_table: the two guint16 are interpreted as the high and low words of a gunicode value.
- The add_table function needs to set max_codepoint_len to 1 (or 2, depending if you want to count codepoints or guint16 words)
Any chance of getting this fixed for gtk 3?
no, I don't think we have the time to finish this up for 3.0.
it can still happen for 3.2
2008 -----> 2013
Any news ??
We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.
It still here. Ohhh ten years bug still no solve !!!
Mosaab Alzoubi: Age of a ticket is entirely irrelevant. If you want to see a bug solved, you have to provide a software patch. Open source projects do not have unlimited developers and developers are free to work on what they want. Thanks.
There is already a patch waiting for review since 2009, see comment 21.
As announced a while ago, we are migrating to gitlab, and bugs that haven't seen activity in the last year or so will be not be migrated, but closed out in bugzilla.
If this bug is still relevant to you, you can open a new issue describing the symptoms and how to reproduce it with gtk 3.22.x or master in gitlab:
I have posted https://gitlab.gnome.org/GNOME/gtk/issues/186