Bug 537457 – Support compose sequences that produce two+ codepoints

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 537457 - Support compose sequences that produce two+ codepoints


Summary:	Support compose sequences that produce two+ codepoints


Status:	RESOLVED OBSOLETE

Product:	gtk+
Classification:	Platform
Component:	Input Methods
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Simos Xenitellis
QA Contact:	gtk-bugs

URL:	https://gitlab.gnome.org/GNOME/gtk/is...
Whiteboard:

Duplicates:	114430 (view as bug list)
Depends on:
Blocks:

Reported:	2008-06-09 18:55 UTC by Simos Xenitellis
Modified:	2018-04-17 08:18 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Updated gtkimcontextsimple.c, adds check_compose_multi(), for multiple codepoints. (4.24 KB, patch) 2008-09-09 02:49 UTC, Simos Xenitellis	reviewed	Details \| Review
New file, autogenerated, with compose sequences from upstream (X.Org) (1.47 KB, patch) 2008-09-09 02:52 UTC, Simos Xenitellis	reviewed	Details \| Review
Adds support for multiple codepoints (updates check_table()) (1.75 KB, patch) 2009-01-29 18:11 UTC, Simos Xenitellis	needs-work	Details \| Review
Autogenerated file with compose sequences that produce >1 codepoints (2.00 KB, patch) 2009-01-29 18:13 UTC, Simos Xenitellis	none	Details \| Review

Description Simos Xenitellis 2008-06-09 18:55:11 UTC

Currently, GTK+ IM supports compose sequences that produce a single codepoint.
This is not sufficient for several keyboard layouts and languages.

There have been requests such as bug #341341 and bug #345254 which can be easily solved with the existing infrastructure (the compose sequence itself and the resulting codepoints correspond to each other, therefore we do not need to store additional codepoints per compose sequence).

However, there are compose sequences that cannot be optimised in the same way that Latin compose sequences can. In this case, we would need an additional table, and include searching to that table as well when searching through the compose sequences.

An example of such compose sequences is

# Khmer digraphs
# A keystroke has to generate several characters, so they are defined
# in this file

<U17fb>    :   "ុះ"
<U17fc>    :   "ុំ"
<U17fd>    :   "េះ"
<U17fe>    :   "ោះ"
<U17ff>    :   "ាំ"

At the moment there are only these 5 compose sequences for Khmer that require the support.

A simple table that would hold these compose sequences should suffice for now.

Comment 1 Abderrahim Kitouni 2008-06-12 16:53:57 UTC

I beleive this is related to https://bugs.freedesktop.org/show_bug.cgi?id=8195

Comment 2 Simos Xenitellis 2008-06-19 11:40:14 UTC

I get conflicting reports on whether in XKB one is able to have compose sequences that produce 2+ codepoints.

A related report by Danilo gave me the impression that it is possible to have compose sequences that produce two or more codepoints, in XKB. And GTK+ was lacking.

Implementing such a feature in GTK+ but not in XKB (X.org) would not be prudent (we would effectively fork the compose sequences from what we have in Xorg). 

It would be great if someone could verify that compose sequences in XKB that produce 2+ codepoints, do really work.

Comment 3 Khaled Hosny 2008-06-19 12:56:50 UTC

Well, I implemented a similar solution for Arabic and it does work. What I did is adding some code points in the Arabic XKB layout then mapping it to 2 code points in the X Compose file, when using xim input method I get the expected result with GTK; one key produces to code points. I opened a bug report for those compose sequences, see https://bugs.freedesktop.org/show_bug.cgi?id=16426.

Comment 4 Matthias Clasen 2008-06-19 14:01:30 UTC

Simos, XKB has nothing to do with compose sequences. If you are talking about compose sequences on the X layer, you mean XIM.

Comment 5 Simos Xenitellis 2008-06-19 23:31:51 UTC

Matthias, thanks for the clarification.

Khaled, one issue regarding these compose sequences. I am wondering whether the following format is supported:

<UFEFB>	:	"لا" U0644 U0612 # LAM WITH ALEF
<UFEF7>	:	"لأ" U0644 U0618 # LAM WITH ALEF WITH HAMZA ABOVE

(the Uxxxx examples are for demonstration only; I probably got the wrong codepoints).

If that syntax is supported, it would make it easier when parsing, reading, storing, etc. Could you please try it out?

Comment 6 Khaled Hosny 2008-06-20 00:08:05 UTC

It doesn't seem to be supported, I replaced the other sequences with this:

<UFEFB> :       "ﻻ" U0644 U0627 # LAM WITH ALEF
<UFEF7> :       "ﻷ" U0644 U0623 # LAM WITH ALEF WITH HAMZA ABOVE
<UFEF9> :       "ﻹ" U0644 U0625 # LAM WITH ALEF WITH HAMZA BELOW
<UFEF5> :       "ﻵ" U0644 U0623 # LAM WITH ALEF WITH MADDA ABOVE

But it then got ignored at all and I get the original code point.

Comment 7 Simos Xenitellis 2008-09-09 02:49:15 UTC

Created attachment 118339 [details] [review]
Updated gtkimcontextsimple.c, adds check_compose_multi(), for multiple codepoints.

Adds a function that checks a new custom table for compose sequences.
This table allows for compose sequences to be made of more than one codepoints.

Comment 8 Simos Xenitellis 2008-09-09 02:52:14 UTC

Created attachment 118341 [details] [review]
New file, autogenerated, with compose sequences from upstream (X.Org)

Autogenerated file from script;
The script parses the X.Org Compose file and identifies sequences that produce 2+ characters. The script checks the types of sequences and produces a table with the biggest sequence size, codepoint length to accommodate all sequences.

Covers Khmer and Arabic.

Comment 9 Matthias Clasen 2008-09-27 04:53:03 UTC

+          for (i = 0; i < compose_multi_max_codepoint_len; i++ )
+	  	gtk_im_context_simple_commit_char (GTK_IM_CONTEXT (context_simple), seq[compose_multi_max_sequence_len + i]);

Does this mean every multi-char sequence must produce the exact same number of chars ? 


There is a wierd empty comment in the patch:

+/*
+ *
+ *
+ *
+ *
+ *
+ */

That should be removed.


gtkimcontextsimpleseqs.h has a nice comment explaining how it was generated.
Is the multi-sequence table also generated that way ? Would be nice to have a
comment in there.

Comment 10 Matthias Clasen 2008-09-28 03:36:40 UTC

See bug 114430 for an old bug and patch about the same thing.

Comment 11 Simos Xenitellis 2008-09-28 10:03:27 UTC

The updated script compose-parse.py goes through the X.org Compose file and reads all compose sequences that produce more than one Unicode character.
Then, it finds what's the longest compose sequence, and the longest Unicode string that is produced by each sequence.
Finally, it creates a custom table and sets two variables, 

const gint compose_max_sequence_len = 1;
const gint compose_max_codepoint_len = 2;

which means that for the current set of compose sequences, the compose sequences has max length 1 (current situation with Khmer, Arabic), and produce Unicode characters of max length 2.

At a later date, when running the script again, the above values may change. For example, the sequence length may increase. The script produces the proper variables, and the code continues to work.

With bug 114430, the size of the compose sequence table increases by around

3832 compose sequences * 2 bytes = 7664 bytes (or 15328 bytes when we support Plane1). This is due to the NUL that is added to the strings.

With this patch, the tables size increase by around 50 (or 100, Plane1) bytes.

I see that bug 114430 provides a more elegant solution. 

I am happy to work with either solution you suggest.

Comment 12 Matthias Clasen 2008-09-29 03:44:22 UTC

> which means that for the current set of compose sequences, the compose
> sequences has max length 1 (current situation with Khmer, Arabic), and produce
> Unicode characters of max length 2.

So it is true that all sequences in the multi compose table must produce codepoints of the same length ? I kinda expected something like

+          for (i = 0; i < compose_multi_max_codepoint_len && seq[compose_multi_max_sequence_len + i] != 0; i++ )
+               gtk_im_context_simple_commit_char (GTK_IM_CONTEXT
(context_simple), seq[compose_multi_max_sequence_len + i]);

to allow for shorter codepoints, padded with zeros.


> I am happy to work with either solution you suggest.

I think I'll go with the more compact tables. But maybe we can steal some of the refactoring from the other patch (the various commit helpers).

Comment 13 Simos Xenitellis 2008-09-29 22:00:15 UTC

(In reply to comment #12)
> > which means that for the current set of compose sequences, the compose
> > sequences has max length 1 (current situation with Khmer, Arabic), and produce
> > Unicode characters of max length 2.
> 
> So it is true that all sequences in the multi compose table must produce
> codepoints of the same length ? 

There could be sequences that produce three or more codepoints. It might be rare, but I expect that it could very well happen.

The number of codepoints is determined by the person who writes the keyboard layout, and the choices she makes in the design of the layout.

> I kinda expected something like
> 
> +          for (i = 0; i < compose_multi_max_codepoint_len &&
> seq[compose_multi_max_sequence_len + i] != 0; i++ )
> +               gtk_im_context_simple_commit_char (GTK_IM_CONTEXT
> (context_simple), seq[compose_multi_max_sequence_len + i]);
> 
> to allow for shorter codepoints, padded with zeros.

Indeed I missed that part for shorter codepoints.

> 
> > I am happy to work with either solution you suggest.
> 
> I think I'll go with the more compact tables. But maybe we can steal some of
> the refactoring from the other patch (the various commit helpers).
> 

I'll be looking into these in the following weeks.

Comment 14 Matthias Clasen 2008-11-12 16:13:48 UTC

*** Bug 114430 has been marked as a duplicate of this bug. ***

Comment 15 Matthias Clasen 2008-11-12 16:14:53 UTC

Simos, any update on this ?

Comment 16 Matthias Clasen 2009-01-11 21:38:49 UTC

We could still get this in 2.16, if a new patch shows up quickly

Comment 17 Tor Lillqvist 2009-01-29 07:41:50 UTC

Getting this in also is a requirement for fixing some behaviour that annoys users on Windows. So please try to get this in soonish.

They expect to be able to type a dead accent key twice and then actually get two copies of the corresponding spacing accent. Mainly this seems to be used for frivolous purposes like silly nicknames in IRC and for emoticons, so it is not that serious, but many users or Pidgin and XChat seem to be annoyed.

Another more serious issue that can be fixed only if this bug is fixed is that at least on the "US International" keyboard, there is no separate plain apostrophe key, just a dead acute (which looks like an apostrophe, though) that has a weird expected behaviour: It should combine only with a small number of following characters, not 's' for instance, and if some other character follows, an ASCII apostrophe and that following character are expected to be input.

I guess in general, the expected behaviour on Windows is that if the following key after a dead accent key is not something that can be combined with the accent, you should not get a beep and both keys discarded, but two characters: the corresponding spacing accent (or, in some cases, apostrophe instead of spacing acute), and the second key's corresponding character. My personal opinion is that it would be more cool if we then would in true Unicode fashion get the codepoint for the second key and the codepoint for a combining accent, but I guess we should match "native" behaviour.

Comment 18 Tor Lillqvist 2009-01-29 07:48:07 UTC

Hmm, now that I look in the patch in comment #7 more closely, do I understand correctly that it as such already *is* fully possible to just call gtk_im_context_simple_commit_char() several times? If so, then the bugs I just marked as depending on this bug, don't actually depend on this bug. I will experiment.

Comment 19 Tor Lillqvist 2009-01-29 13:46:01 UTC

Yes indeed. Removing the dependency info.

Comment 20 Simos Xenitellis 2009-01-29 18:11:43 UTC

Created attachment 127475 [details] [review]
Adds support for multiple codepoints (updates check_table())

This is a reworked patch, that expands check_table() so that it works for the new type of compose table where the codepoint length can be bigger than 1.

Two issues: 
1. 
-  gint row_stride = table->max_seq_len + 2; 
+  gint row_stride = table->max_seq_len + table->max_codepoint_len; 

I am not sure why the row_stride used to be '+2' with the old table. Shouldn't it be +1 (one codepoint)?

2. I changed "struct _GtkComposeTable" so that it now contains an extra field for 'max_codepoint_len'. I suppose this is an API change which requires special dealing for the inclusion.

Comment 21 Simos Xenitellis 2009-01-29 18:13:23 UTC

Created attachment 127476 [details] [review]
Autogenerated file with compose sequences that produce >1 codepoints

This is a new file (generated with compose-parse.py, http://github.com/simos/compose-parse/)

Comment 22 Matthias Clasen 2009-01-31 04:47:13 UTC

Simos, I don't see why GtkComposeTable would have API relevance. It is not exposed in the headers.

However, there are some things that need more work here:

- To learn about the +2, look at the docs of gtk_im_context_simple_add_table: the two guint16 are interpreted as the high and low words of a gunicode value.

- The add_table function needs to set max_codepoint_len to 1 (or 2, depending if you want to count codepoints or guint16 words)

Comment 23 Khaled Hosny 2011-01-14 13:13:17 UTC

Any chance of getting this fixed for gtk 3?

Comment 24 Matthias Clasen 2011-01-14 15:08:35 UTC

no, I don't think we have the time to finish this up for 3.0.
it can still happen for 3.2

Comment 25 Mosaab Alzoubi 2013-08-01 02:36:13 UTC

2008 -----> 2013 
Any news ??

Comment 26 Matthias Clasen 2018-02-10 05:14:39 UTC

We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.

Comment 27 Mosaab Alzoubi 2018-02-10 18:05:41 UTC

It still here. Ohhh ten years bug still no solve !!!

Comment 28 André Klapper 2018-02-10 18:38:19 UTC

Mosaab Alzoubi: Age of a ticket is entirely irrelevant. If you want to see a bug solved, you have to provide a software patch. Open source projects do not have unlimited developers and developers are free to work on what they want. Thanks.

Comment 29 Khaled Hosny 2018-02-10 20:37:12 UTC

There is already a patch waiting for review since 2009, see comment 21.

Comment 30 Matthias Clasen 2018-04-15 00:03:25 UTC

As announced a while ago, we are migrating to gitlab, and bugs that haven't seen activity in the last year or so will be not be migrated, but closed out in bugzilla.

If this bug is still relevant to you, you can open a new issue describing the symptoms and how to reproduce it with gtk 3.22.x or master in gitlab:

https://gitlab.gnome.org/GNOME/gtk/issues/new

Comment 31 Samuel Thibault 2018-04-16 20:54:58 UTC

I have posted https://gitlab.gnome.org/GNOME/gtk/issues/186