Bug 321896 – Synch gdkkeysyms.h/gtkimcontextsimple.c with X.org 6.9/7.0

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 321896 - Synch gdkkeysyms.h/gtkimcontextsimple.c with X.org 6.9/7.0


Summary:	Synch gdkkeysyms.h/gtkimcontextsimple.c with X.org 6.9/7.0


Status:	RESOLVED FIXED

Product:	gtk+
Classification:	Platform
Component:	Input Methods
Version:	2.8.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	Medium feature
Assigned To:	Simos Xenitellis
QA Contact:	gtk-bugs

URL:
Whiteboard:

Duplicates:	88639 162845 167940 324021 333710 504383 (view as bug list)
Depends on:
Blocks:	334075

Reported:	2005-11-20 00:19 UTC by Simos Xenitellis
Modified:	2008-12-10 02:10 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Updates gdkkeysyms.h with keysymdef.h from X.org 6.9/7.0 (3.59 KB, text/plain) 2005-11-20 02:28 UTC, Simos Xenitellis		Details
WORK In PROGRESS - Updates gtkimcontextsimple.c automagically (13.74 KB, text/plain) 2006-01-23 17:34 UTC, Simos Xenitellis		Details
Patch that updates gtkimcontextsimple.c to the latest Compose file in Xorg 7.x (46.63 KB, application/x-compressed-tar) 2006-04-18 09:52 UTC, Simos Xenitellis		Details
Script that automagically updates gtkimcontextsimple.c from Compose.pre in Xorg. (14.11 KB, text/plain) 2006-04-18 09:55 UTC, Simos Xenitellis		Details
Generated gdkkeysyms.h from keysymdef.h in Xorg 7.x (52.19 KB, text/plain) 2006-04-18 09:57 UTC, Simos Xenitellis		Details
Script that automagically updates gtk+/gdk/gdkkeysyms.h from keysymdef.h in Xorg. (3.78 KB, text/plain) 2006-04-18 09:59 UTC, Simos Xenitellis		Details
Fragment of compose sequence table which shows what we want to convert from. (3.22 KB, text/plain) 2006-07-16 15:51 UTC, Simos Xenitellis		Details
Converted version of the preview fragment to optimise on memory. (4.04 KB, text/plain) 2006-07-16 16:00 UTC, Simos Xenitellis		Details
Script that automagically updates gtkimcontextsimple.c from Compose.pre in Xorg, for the memory-optimised version of the table. (18.38 KB, text/plain) 2006-07-17 02:05 UTC, Simos Xenitellis		Details
gtkimcontextsimple.c with the latest upstream Compose data, arranged to save memory. (94.65 KB, application/octet-stream) 2006-07-17 10:18 UTC, Simos Xenitellis		Details
Updated generation script, updated compose table, move compose table to separate file. (568.47 KB, patch) 2007-07-04 20:25 UTC, Simos Xenitellis	none	Details \| Review
Updated version of script that converts Xorg Compose.pre to gtk+ optimised table (23.63 KB, text/plain) 2007-07-19 23:17 UTC, Simos Xenitellis		Details
Optimised file (generated with above script) (72.04 KB, application/x-compressed-tar) 2007-07-19 23:25 UTC, Simos Xenitellis		Details
Rough implementation of table-less handling of dead accents (21.36 KB, patch) 2007-07-24 15:04 UTC, Tor Lillqvist	needs-work	Details \| Review
Patch on top of Tor's patch to handle compose sequences algorithmically (4.47 KB, patch) 2008-01-13 00:59 UTC, Simos Xenitellis	needs-work	Details \| Review
Script to parse the Xorg compose file, calculate memory savings, verify algorithmic function, etc. (6.59 KB, text/plain) 2008-01-13 01:18 UTC, Simos Xenitellis		Details
Updated Python script that parses the Xorg compose file, provides stats, verifies algo-function, etc (29.16 KB, text/plain) 2008-01-30 15:44 UTC, Simos Xenitellis		Details
Patch for gtkimcontextsimple.c to enable optimized/algorithmic (38.60 KB, patch) 2008-01-30 16:11 UTC, Simos Xenitellis	needs-work	Details \| Review
Patch to gtk+ (HEAD) to update compose table (315.19 KB, patch) 2008-03-03 14:34 UTC, Simos Xenitellis	committed	Details \| Review
Patch to gtk+ (HEAD) to update compose table (fixes one error, typos) (315.18 KB, patch) 2008-03-15 01:21 UTC, Simos Xenitellis	committed	Details \| Review

Description Simos Xenitellis 2005-11-20 00:19:58 UTC

In the gtk+ library, the files
gdk/gdkkeysyms.h and
gtk/gtkimcontextsimple.h
contain information which come from the X server. 
This information should be in synch.

These two files are severely out of date when compared to the current X.org
(6.9/7.0).
Specifically, 
gdk/gdkkeysyms.h: Has 1341 keysyms, but now X.org defines 1708 keysyms.
gtk/gtkimcontextsimple.h: Has 842 compose sequenes, but now X.org defines 5545
of them.

There should be a way to easily update these files and keep them in synch with
upstream, with X.org.

Comment 1 Simos Xenitellis 2005-11-20 02:28:19 UTC

Created attachment 54955 [details]
Updates gdkkeysyms.h with keysymdef.h from X.org 6.9/7.0

Updates http://cvs.gnome.org/viewcvs/gtk%2B/gdk/gdkkeysyms.h from upstream
(X.org 6.9/7.0),
from http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h

Author	: Simos Xenitellis <simos at gnome dot org>.
Version : 1.0

Input	: http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h
Output	: http://cvs.gnome.org/viewcvs/gtk%2B/gdk/gdkkeysyms.h

Notes	: It downloads keysymdef.h from the Internet if not found locally
Notes	: and creates an updated gdkkeysyms.h (checks not to overwrite).

Comment 2 Simos Xenitellis 2005-11-20 12:08:03 UTC

*** Bug 167940 has been marked as a duplicate of this bug. ***

Comment 3 Simos Xenitellis 2005-11-20 14:15:21 UTC

The en_US.UTF-8 Compose file
http://cvs.freedesktop.org/xorg/xc/nls/Compose/en_US.UTF-8

appears not to be sync with the keysymdef.h file
http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h

A bug has been logged for this at the FreeDesktop Bugzilla,
https://bugs.freedesktop.org/show_bug.cgi?id=5107

Comment 4 Simos Xenitellis 2005-11-20 16:05:45 UTC

The Compose file, 
http://cvs.freedesktop.org/xorg/xc/nls/Compose/en_US.UTF-8
contains unicode codepoints in addition to keysyms.
<U0313> <Greek_alpha>	: "ἀ" U1F00 # GREEK SMALL LETTER ALPHA WITH PSILI

U0313 is COMBINING COMMA ABOVE, so a comparison is possible with 0x0313.

However, 
http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h?view=markup
has keysyms with values that conflicts with Unicode.
For example, in the URL above, search for "Latin 4". 
You will notice the Latin 4 keysym group conflicts with the Greek Unicode block.

Pending this issue, the script is ready to update gtk/gtkimcontextsimple.c.

Comment 5 Simos Xenitellis 2006-01-23 17:34:58 UTC

Created attachment 57953 [details]
WORK In PROGRESS - Updates gtkimcontextsimple.c automagically

To update the main structure in gtkimcontextsimple.c requires access to several files and combining them together. This script does exactly that.

It is marked as work in progress as the Xorg Compose file contains some constructs that I do not know how to process.

Comment 6 Simos Xenitellis 2006-02-16 14:50:35 UTC

Changing status to NEEDINFO.

This bug report is almost there to be fixed. Some issues, described above, need to be attended and we are done! :)

Comment 7 Matthias Clasen 2006-02-17 00:58:03 UTC

the way in which i would like to see this addressed
is by keeping the generated files in cvs. therefore,
it is not the end of the world if the script output
needs some manual tweaking...

Comment 8 Simos Xenitellis 2006-02-17 01:21:24 UTC

(In reply to comment #4)
> The Compose file, 
> http://cvs.freedesktop.org/xorg/xc/nls/Compose/en_US.UTF-8
> contains unicode codepoints in addition to keysyms.
> <U0313> <Greek_alpha>	: "ἀ" U1F00 # GREEK SMALL LETTER ALPHA WITH PSILI
> 
> U0313 is COMBINING COMMA ABOVE, so a comparison is possible with 0x0313.
> 
> However, 
> http://cvs.freedesktop.org/xorg/xc/include/keysymdef.h?view=markup
> has keysyms with values that conflicts with Unicode.
> For example, in the URL above, search for "Latin 4". 
> You will notice the Latin 4 keysym group conflicts with the Greek Unicode block.
> 
> Pending this issue, the script is ready to update gtk/gtkimcontextsimple.c.

The main issue is that the affected keysyms ("Latin 4" group but some others as well) should have 0x1000000 added to their values so that they do not conflict with real Unicode codepoints that may exist.

I filed a bug report on this, at
https://bugs.freedesktop.org/show_bug.cgi?id=5129

Comment 9 Simos Xenitellis 2006-03-24 23:35:54 UTC

The new upstream location of the Compose files for X.org modular (compared to monolithic) is
http://webcvs.freedesktop.org/xorg/lib/X11/nls/

The exact file is
http://webcvs.freedesktop.org/xorg/lib/X11/nls/en_US.UTF-8/Compose.pre?view=markup

Comment 10 Simos Xenitellis 2006-04-09 03:17:59 UTC

Bug 155010 has a patch that makes the compose sequences table configurable.
That is, the user would be able to override the built-in compose sequences with a configuration file found in, let's say, /etc/gtk+/compose/.

I am not sure if there are any performance issues with such a configuration.

In any case, both bug 155010 and this bug require to bring from upstream the new Compose file.

Comment 11 Matthias Clasen 2006-04-17 20:42:27 UTC

Simos, any update on this ? 

If I understand Daniels comment o the fd.o bug correctly, what you script needs
to do is use existing legacy keysyms where they exist, and otherwise use 
Unicode keysyms with the added 0x100000

Comment 12 Simos Xenitellis 2006-04-17 22:44:42 UTC

Matthias, there are a couple of questions that are still pending.

1. The Compose file has some keysyms of the form "combining_*" that I could not find the value of. I do not know where they are defined so I cannot assign them a value. One option could be to ignore the compose sequences that have them in GTK+ IM. I filed an issue on this, at
https://bugs.freedesktop.org/show_bug.cgi?id=5107

2. The Compose file has legacy and Unicode keysyms. The Unicode keysyms do not have 0x100000 added to them yet in the current Compose file. As far as I understand, GTK+ IM does not depend on the content of the Compose file. Is that correct? Therefore, are we not blocked by https://bugs.freedesktop.org/show_bug.cgi?id=5129 ?
I assume there is already code in Xorg that understands x+0x100000 keysyms.

Once we have a view on the two issues above, it should be easy to get patches.

Comment 13 Daniel Stone 2006-04-17 23:02:25 UTC

simos:
1) yeah, ignore this issue for the time being: i'll fix it a bit later on.
2) the Compose file and GTK are independent, so yes, you can freely ignore that.  however, as I explained in #5129, some legacy keysyms have co-incidences with Unicode keysyms, and you just need to ignore that: 0x31B2 is not guaranteed to be U+31B2, or whatever.

Comment 14 Simos Xenitellis 2006-04-18 09:52:01 UTC

Created attachment 63781 [details]
Patch that updates gtkimcontextsimple.c to the latest Compose file in Xorg 7.x

The patch applies to HEAD.

Comment 15 Simos Xenitellis 2006-04-18 09:55:04 UTC

Created attachment 63782 [details]
Script that automagically updates gtkimcontextsimple.c from Compose.pre in Xorg.

Updates gtk+/gtk/gtkimcontextsimple.c from Compose.pre found at Xorg 7.0.

Comment 16 Simos Xenitellis 2006-04-18 09:57:39 UTC

Created attachment 63783 [details]
Generated gdkkeysyms.h from keysymdef.h in Xorg 7.x

We used the script that is shown below to autogenerate the header file.

Comment 17 Simos Xenitellis 2006-04-18 09:59:50 UTC

Created attachment 63784 [details]
Script that automagically updates gtk+/gdk/gdkkeysyms.h from keysymdef.h in Xorg.

This script uses the new location of keysymdef.h of modular Xorg (7.x).

Comment 18 Simos Xenitellis 2006-04-19 19:33:05 UTC

Tor, I am adding you to this report as it affects GTK+/Windows as well (hope that's ok).

This report tries to update the compose sequence table in GTK+ (gtk+/gtk/gtkimcontextsimple.c, gtk+/gdk/gdkkeysyms.h) from upstream, Xorg 7.0.

Checking the history of
http://cvs.gnome.org/viewcvs/gtk+/gtk/gtkimcontextsimple.c
I can see that at least two compose sequences specific to Windows were added, as shown in bug 164859.

Is there a bigger list that can be merged or is it just the following lines in bug 164859?

+  GDK_Greek_accentdieresis,	GDK_Greek_iota,	0,	0,	0,	0x0390,	/* GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS */
+  GDK_Greek_accentdieresis,	GDK_Greek_upsilon,	0,	0,	0,	0x03B0,	/* GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS */

Comment 19 Simos Xenitellis 2006-04-19 19:35:09 UTC

Tor, now I am adding you really. Please see message above.

Comment 20 Tor Lillqvist 2006-04-19 20:30:08 UTC

I only know I added those two entries on Daniel Atallah's request. I don't really have any clue about Greek keyboards otherwise ;) Are the key sequences mentioned in bug #164859 not usable on Greek X11 keyboards?

Comment 21 Simos Xenitellis 2006-04-19 21:11:07 UTC

(In reply to comment #20)
> I only know I added those two entries on Daniel Atallah's request. I don't
> really have any clue about Greek keyboards otherwise ;) Are the key sequences
> mentioned in bug #164859 not usable on Greek X11 keyboards?
> 

I see. Windows has a specific key to produce accentdiaeresis while Xorg is commonly configured to produce one at a time (accent or diaeresis), so the addition of accentdiaeresis makes sense.
There will be need for similar work when Greek Polytonic is added with these patches.

Comment 22 Simos Xenitellis 2006-04-21 01:38:06 UTC

I just tried out the patches listed above using jhbuild and they work.
I used the released gtk+ (the HEAD version does not compile today).

I tested with Hungarian where you can put a dot on letters and now it works (previously it was not available).
I tested with Spanish and there was no regression.
I also tested with Ancient Greek and it worked well. A few compose sequences though where not available due to incosistencies in the Xorg file which we are working on.

Comment 23 Danilo Segan 2006-05-07 12:47:43 UTC

Note that we might need extending gtk+ to support decomposed characters as well (for accented Cyrillic), i.e. a key sequence to result in several unicode characters instead of one (just like Compose files allow).

Comment 24 Simos Xenitellis 2006-05-08 18:13:28 UTC

(In reply to comment #23)
> Note that we might need extending gtk+ to support decomposed characters as well
> (for accented Cyrillic), i.e. a key sequence to result in several unicode
> characters instead of one (just like Compose files allow).
> 

Danilo,
Could you please file a bug report about this?

I could not create a test case for this.
I tried picking the Unicode characters from
http://webcvs.freedesktop.org/xorg/lib/X11/nls/en_US.UTF-8/Compose.pre?view=markup
and placing them in 
http://people.w3.org/rishida/scripts/uniview/conversion
I tried manually and I could not find characters composed of more than one character. It looks like all are precomposed?

This discussion can continue at the new bug report.

Comment 25 Danilo Segan 2006-05-10 21:53:16 UTC

Simos, it's reported as bug #341341 (I thought I already discussed this with Owen back in 2003, but I may be lost altogether ;).

Comment 26 Matthias Clasen 2006-05-11 16:00:58 UTC

ok, in order to stop blocking on this and make progress on this,
I compared your gdkkeysyms.h with the current one, and things look
mostly fine (ie just additions). The one thing I stumbled over was
XK_CURRENCY, where I see

-#define GDK_EcuSign 0x20a0
-#define GDK_ColonSign 0x20a1
-#define GDK_CruzeiroSign 0x20a2
-#define GDK_FFrancSign 0x20a3
-#define GDK_LiraSign 0x20a4
-#define GDK_MillSign 0x20a5
-#define GDK_NairaSign 0x20a6
-#define GDK_PesetaSign 0x20a7
-#define GDK_RupeeSign 0x20a8
-#define GDK_WonSign 0x20a9
-#define GDK_NewSheqelSign 0x20aa
-#define GDK_DongSign 0x20ab

+#define GDK_EcuSign 0x10020a0
+#define GDK_ColonSign 0x10020a1
+#define GDK_CruzeiroSign 0x10020a2
+#define GDK_FFrancSign 0x10020a3
+#define GDK_LiraSign 0x10020a4
+#define GDK_MillSign 0x10020a5
+#define GDK_NairaSign 0x10020a6
+#define GDK_PesetaSign 0x10020a7
+#define GDK_RupeeSign 0x10020a8
+#define GDK_WonSign 0x10020a9
+#define GDK_NewSheqelSign 0x10020aa
+#define GDK_DongSign 0x10020ab

why is this ? have the legacy keysyms be replaced by unicode ones
for XK_CURRENCY ?

Comment 27 Matthias Clasen 2006-05-11 16:45:47 UTC

Looking at the imcontext simple compose sequences, there is fairly
obvious problem: with non-bmp keysyms, we need to go from guint16 to
guint32, and we also seem to have a lot more sequences. The table
size grows from 10116 to 113520, which is clearly a problem. At this
size, we should probably look at going from the flat representation 
+ bsearch to a tree

Comment 28 Simos Xenitellis 2006-05-11 16:53:15 UTC

(In reply to comment #26)
> ok, in order to stop blocking on this and make progress on this,
> I compared your gdkkeysyms.h with the current one, and things look
> mostly fine (ie just additions). The one thing I stumbled over was
> XK_CURRENCY, where I see
> 
> -#define GDK_EcuSign 0x20a0
> -#define GDK_ColonSign 0x20a1
> -#define GDK_CruzeiroSign 0x20a2
> -#define GDK_FFrancSign 0x20a3
> -#define GDK_LiraSign 0x20a4
> -#define GDK_MillSign 0x20a5
> -#define GDK_NairaSign 0x20a6
> -#define GDK_PesetaSign 0x20a7
> -#define GDK_RupeeSign 0x20a8
> -#define GDK_WonSign 0x20a9
> -#define GDK_NewSheqelSign 0x20aa
> -#define GDK_DongSign 0x20ab
> 
> +#define GDK_EcuSign 0x10020a0
> +#define GDK_ColonSign 0x10020a1
> +#define GDK_CruzeiroSign 0x10020a2
> +#define GDK_FFrancSign 0x10020a3
> +#define GDK_LiraSign 0x10020a4
> +#define GDK_MillSign 0x10020a5
> +#define GDK_NairaSign 0x10020a6
> +#define GDK_PesetaSign 0x10020a7
> +#define GDK_RupeeSign 0x10020a8
> +#define GDK_WonSign 0x10020a9
> +#define GDK_NewSheqelSign 0x10020aa
> +#define GDK_DongSign 0x10020ab
> 
> why is this ? have the legacy keysyms be replaced by unicode ones
> for XK_CURRENCY ? 
> 

According to 
http://webcvs.freedesktop.org/xorg/proto/X11/keysymdef.h?view=markup
only "XK_EuroSign" is a legacy keysym. The rest are Unicode keysyms.

Marcus Khun did this change 10 months ago:
http://webcvs.freedesktop.org/xorg/proto/X11/keysymdef.h?r1=1.2&r2=1.3

Also notice that (same source as above)
#define XK_EcuSign                    0x10020a0  /* U+20A0 EURO-CURRENCY SIGN */
#define XK_EuroSign                      0x20ac  /* U+20AC EURO SIGN */

Also, according to 
http://www.unicode.org/charts/PDF/U20A0.pdf
XK_EuroSign is favoured over XK_EcuSign ("U+20A0 EURO-CURRENCY SIGN").

Comment 29 Matthias Clasen 2006-05-11 17:05:18 UTC

ok, that sounds good enough to me for the keysyms. I'll commit that part.

Comment 30 Matthias Clasen 2006-05-11 17:15:59 UTC

2006-05-11  Matthias Clasen  <mclasen@redhat.com>

	* gdk/gdkkeysyms.h: Regenerated from Xorg 7.1 keysyms.h, using
	gdkkeysyms-update.pl.

	* gdk/gdkkeysyms-update.pl: Script to sync gdkkeysyms.h
	with Xorg.  (#321896, Simos Xenitellis)

	* gdk/Makefile.am (EXTRA_DIST): Add gdkkeysyms-update.pl

Comment 31 Simos Xenitellis 2006-05-11 17:29:28 UTC

(In reply to comment #27)
> Looking at the imcontext simple compose sequences, there is fairly
> obvious problem: with non-bmp keysyms, we need to go from guint16 to
> guint32, and we also seem to have a lot more sequences. The table
> size grows from 10116 to 113520, which is clearly a problem. At this
> size, we should probably look at going from the flat representation 
> + bsearch to a tree 
> 

It might be good to also split the compose sequences in both upstream (Xorg) and GTK+ into groups based on the language and get the end-user "decide" through the configuration which languages to be actually supported.
In Ubuntu, for example, you can pick and choose the writing aids for each of the supported languages.

As it is now, languages that a user may never write in are potentially available. For example, Ancient Greek (Polytonic) currently takes about 35-40% of the compose sequences.

For GNOME to manage different languages, something like bug 155010 would be able to help.

Is the imcontext simple compose sequences table loaded just once and shared between GTK+ applications?

Comment 32 Matthias Clasen 2006-05-11 19:11:50 UTC

The table is compiled into GTK+ itself, as const data.
Thus it is shared between apps.

Comment 33 Matthias Clasen 2006-06-19 19:32:53 UTC

The api affecting part of this has been committed; somebody needs to 
devise a compact table format for the additional sequences.

Comment 34 Simos Xenitellis 2006-06-20 14:46:44 UTC

As mentioned above, it looks suitable to use a tree structure for the table.
Looking at Glib, the N-ary tree (http://developer.gnome.org/doc/API/2.0/glib/glib-N-ary-Trees.html) might be a good option.

Is there a good way to represent (serialise?) a tree as text so that it is included verbatim in the GTK+ source code? 

Should the table be instead saved as is and let GTK+ parse it on startup creating the tree?

Comment 35 Matthias Clasen 2006-06-20 15:01:03 UTC

No, I don't think using a runtime-generated pointerized tree structure like that
is the right approach. It should still be a compiled in array of numbers, just
begin interpreted as a tree structure instead of the current flat table. Not sure
about the best way of doing that.

Comment 36 Simos Xenitellis 2006-07-16 15:51:14 UTC

Created attachment 68995 [details]
Fragment of compose sequence table which shows what we want to convert from.

This is a fragment of the compose table (array) that shows what we already have.

Notice that the first column has lots of repeats. This is first area of optimisation.

Also notice that there are several 0s. This is the second area of optimisation.

Comment 37 Simos Xenitellis 2006-07-16 16:00:31 UTC

Created attachment 68996 [details]
Converted version of the preview fragment to optimise on memory. 

This is the suggested format, that will be generated by a script taking as input the Compose file from Xorg.

We save space by reducing the repetitions in the first column.
We also save space by eliminating the superfluous zeros.

Some figures for the space we save:
=====>
Some stats for you. We have 4730 lines, with 6 guint32s per line, total 113520 bytes
From all keysyms, 14190 have the value of zero and take up 56760 bytes.
By optimising on the zeros, we end up occupying 56760 bytes.
Also, we optimise on the first column as from each of the 4730 lines,
there are less than about 30 different keysyms. So we save a further approx. 18800 bytes.
So, total savings are 75560 bytes, we occupy 37960 bytes.
Of course, take into account some memory overhead to support the optimisation.
=====|

The importance is for the data structure to be shared among GTK+ applications. As static const, I believe we achieve this.

If there are any comments at this stage to enhance the format, please add here.

The next step is to write the conversion script (easy) and then plug the structure in gtkimcontextsimple.c.

Comment 38 Simos Xenitellis 2006-07-17 02:05:34 UTC

Created attachment 69014 [details]
Script that automagically updates gtkimcontextsimple.c from Compose.pre in Xorg, for the memory-optimised version of the table.

Script that automagically updates gtkimcontextsimple.c from Compose.pre in Xorg, for the memory-optimised version of the table.

The script will create a patched up version of gtkimcontextsimple.c with the new data structure.

It is not usable yet as the code that implements the searching has not been adapted yet. Will do once I get a buildable GNOME using jhbuild.

Comment 39 Simos Xenitellis 2006-07-17 10:18:30 UTC

Created attachment 69030 [details]
gtkimcontextsimple.c with the latest upstream Compose data, arranged to save memory.

We obsolete the previous unoptimised file, however there is still a bit of work to do to recode the search algorithm. That is, this file does not let us compile yet.

Comment 40 Matthias Clasen 2006-07-18 06:00:59 UTC

Thanks for this work.

In its current form, this table needs relocations, since it uses pointers 
to point to the subtables, and thus it won't be shared (unless you use prelink).
You need to replace the pointers by offsets to arrive at something that can
used without relocations.

Comment 41 Simos Xenitellis 2007-07-04 20:25:59 UTC

Created attachment 91207 [details] [review]
Updated generation script, updated compose table, move compose table to separate file.

Applies to trunk.

This is an updated version of the initial script; 
a. we take out the compose table by putting in a separate file
b. we generate a fresh compose table based on upstream
   1. after we  1,$s/U1000/U/g  (we verify we did not touch the U1000 character)
   2. after we remove U1xxxx (Plane 1) sequences. It's guint16 anyway.
   3. after we replace the Greek section with the one from the el_GR.UTF-8/Compose.pre upstream file; 
c. we add some auxiliary files generated by the script into .cvsignore. (ok for SVN?)

Files 
Patch: /gtk/gtkimcontextsimple.c
Added: /gtk/gtkimcontextsimpleseqs.h
Added: /gtk/compose-sequence-update.pl
Patch: /gtk/.cvsignore

Space calculations:
a. Currently, the compose table takes up 10164 bytes, with 847 entries (847x6x2)
b. New compose table without space optimisations takes up 54120 bytes, with 4510 entries (4510x6x2)
c. The first column in the compose table has many repetitions. If we eliminate, the table will take up ~45100 bytes (10164x[5]x2), a saving of about 9000 bytes.

Complexity when optimising the table
gtkimcontextsimple.c does three operations on the compose table,
1. run bsearch()
2. uses pointer to item
3. get next item
4. get previous item

To avoid the repetitions of the first column, we can use separate arrays based on the value of the first column. We generate about 30 such arrays.
In order to bsearch() through those arrays, we use a script that implements (=generates C code) binary search through nested conditional statements (done).

Overall, I find it would make the code quite complicated at this stage to squeeze these extra bytes.

This patch has been tested for the Greek language (Ancient greek now works) and Latin (US International w/ dead keys).

Comment 42 Matthias Clasen 2007-07-06 05:33:51 UTC

I don't think the "get previous" and "get next" operations are actually
necessary for the gtkimcontextsimple use case. What is required is
the information "does it not match, match a prefix, or match exactly ?" 

The patch does not actually work, since it still uses guint16, while some of 
the keysyms in the table are larger than that by now. 

Looking a bit closer, there are 44 rows containing non-BMP keysyms. I'd 
propose to put those into a separate guint32 table to avoid blowing up the 
data size needlessly. 

Looking at the remaining BMP keysyms, there are two things we could do to 
reduce the size: 

- split the table by length of the sequence, since a lot of the entries
  are just 2 or 3 keys long. 

- looking at the first column, there are only 31 different starting symbols,
  so it might be worthwhile to split the first column off

That would lead to a table roughtly of the following form, with off2 in the
startkey1 line pointing to the remainder of the first sequence of length 2
starting with startkey1, and so on:


 { /* offsets */
   startkey1, off2, off3, off4, off5,
   startkey2, off2, off3, off4, off5,
  ...

   startkey31, off2, off3, off4, off5,
   0, 
   /* sequences of length 2 */  
   key2, value,
   ...
   /* sequences of length 3 */
   key2, key3, value, 
   ...

 }

From a quick run over your tables, it looks like this table layout
would reduce the size of the BMP table from ~55k to ~30k, with the
non-BMP table being at ~1k.

It should still be possible to use bsearch() to find seq[0] in the offsets
part, and then use bsearch repeatedly to find seq[1]...seq[n] in the tables
of the right length.

Comment 43 Tor Lillqvist 2007-07-12 19:19:39 UTC

I think what should be done is to remove those sequences that are painfully self-evident from the table, and instead just add small amount of code to to the logical thing: If we get one or two dead keysyms, and together with the following keysym they combine into a precomposed unicode character, use that. Surely it is possible to deduce this without have explicit entries in the table for each sequence?

All the dead keysyms are between 0xFE50 and 0xFE62 (and no other keysyms are in that interval), so it is trivial to determine if a keysym is "dead". Then one just converts the dead keysym(s) into the corresponding Unicode combining mark(s), append them after the letter that follows into a string, and check if that string normalizes (using NFC) into a single, precomposed Unicode character.

Comment 44 Simos Xenitellis 2007-07-19 23:17:59 UTC

Created attachment 92007 [details]
Updated version of script that converts Xorg Compose.pre to gtk+ optimised table

This script creates a table similar to the description that Matthias gives at comment 42.

Comment 45 Simos Xenitellis 2007-07-19 23:25:56 UTC

Created attachment 92008 [details]
Optimised file (generated with above script)

The file consists of two tables; a guint16 table with optimisations to reduce size and a guint32 table with the remaining sequences.

What is remaining is the glue code in GTK+ to use these tables.
Also, the optimisation that Tor describes at comment 43 is not reflected in this table.

Comment 46 Tor Lillqvist 2007-07-24 15:04:10 UTC

Created attachment 92279 [details] [review]
Rough implementation of table-less handling of dead accents

Here's a first version of a patch that removes the straightforward dead diacritic key sequences from the table, and instead handles them using code. Presumably most of the table entries Simos wants to add can be handled by code like this, without a need for explicit table entries? (The patch still contains debugging printfs.) Comments, please...

BTW, g_utf8_normalize(..., G_NORMALIZE_NFC) works a bit odd in my opinion. It normalizes the sequence 03B9 0308 0301 to a single 0390. But not the sequence 03B9 0301 0308 (just swapping the order of the two combining diacritics) even if that should be equivalent? 

(03B9 = GREEK SMALL LETTER IOTA, 0308 = COMBINING DIAERESIS (Dialytika), 0301 = COMBINING ACUTE ACCENT (Oxia, Tonos), and 0390 = GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS)

Comment 47 Simos Xenitellis 2007-12-28 16:29:05 UTC

Thanks Tor.

I did some more work on the patch. 

g_utf8_normalize() appears to work ok; the function rearranges the sequence as long as the diacritic marks belong to different "canonical combining classes". In the case of GREEK SMALL LETTER IOTA with COMBINING DIAERESIS and COMBINING ACUTE ACCENT, both diacritic marks belong to the same canonical combining class (which has the value 230). In this case we need to try all combinations (n factorial) of diacritic marks in case we find a match.

Greek Polytonic as a keyboard layout in Xorg reuses dead_ogonek, etc which are meant for other languages and refer to a different diacritic mark. We just got dead_psili and dead_daseia added to Xorg which is good. This means we should also get dead_perispomeni added instead of dead_tilde we currently use.

Taking both these issues in account account, I managed to get my system write Greek Polytonic with GTK+ IM. I'll make a usable patch shortly.

Comment 48 Matthias Clasen 2007-12-29 02:21:29 UTC

Pretty cool. 

Do you know how much this will reduce the size of the updated tables ?

Comment 49 Mathias Hasselmann (IRC: tbf) 2008-01-12 10:44:44 UTC

*** Bug 504383 has been marked as a duplicate of this bug. ***

Comment 50 Simos Xenitellis 2008-01-13 00:49:35 UTC

(In reply to comment #48)
> Pretty cool. 
> 
> Do you know how much this will reduce the size of the updated tables ?
> 

The algorithmic function will reduce the size of the updated tables by almost 30% (~21KB) compared to the "unoptimised" solution. Thus, the addition of Tor's algorithmic function and leaving the rest unoptimised would create a table of about 46KB (compared to the totally flat table of about 68KB).

If we also add the optimised solution you mentioned above for the sparse table, we shave off a further 17KB.

I do not take into account for now compose sequences with guint32 elements; they add to complexity and there are no layouts to used them yet. I prefer to have them at a latter date.

Comment 51 Simos Xenitellis 2008-01-13 00:59:27 UTC

Created attachment 102700 [details] [review]
Patch on top of Tor's patch to handle compose sequences algorithmically

It has been modified to work well for Greek Polytonic; other scripts may vary.

Xorg has just added dead_dasia and dead_psili so the Polytonic compose sequences do not need to re-use dead_ogonek (Polish) and dead_horn; something that causes conflict in the algorithmic function between the scripts.

A patch has been submitted to include dead_perispomeni (we abuse dead_tilde in the patch) in Xorg which would let the algorithmic function cover all Greek Polytonic compose sequences,
https://bugs.freedesktop.org/show_bug.cgi?id=14013

One can use this code to test other keyboard layouts, as long as they do not require dead_ogonek, dead_horn and dead_tilde.

Comment 52 Simos Xenitellis 2008-01-13 01:18:13 UTC

Created attachment 102702 [details]
Script to parse the Xorg compose file, calculate memory savings, verify algorithmic function, etc.

This is a Python rewrite of previous Perl script in this bug report.

Input:
a) Compose file en_US.UTF-8 from Xorg
b) keysym to Unicode mapping, now using Marcus Khun's list instead of gdkkeysyms.h

Output:
a) For each compose sequence in the Xorg Compose file, apply the algorithmic function (create Unicode sequence, normalize, check if it creates precomposed character)
b) For the remaining compose sequences, put in a list, sort according to the order described at #42 and calculate roughly the savings.

Comment 53 Simos Xenitellis 2008-01-30 15:44:39 UTC

Created attachment 104041 [details]
Updated Python script that parses the Xorg compose file, provides stats, verifies algo-function, etc

$ ./compose-parse.py
compose-parse available parameters:
        -h, --help              this craft
        -s, --statistics        show overall statistics (both algorithmic,
                                  non-algorithmic)
        -a, --algorithmic       show sequences saved with algorithmic
                                  optimisation
        -g, --gtk               show entries that go to GTK+
        -u, --unicodedatatxt    show compose sequences derived from 
                                  UnicodeData.txt (from unicode.org)
        -v, --verbose           show verbose output
        -p, --plane1            show plane1 compose sequences
        -n, --numeric           when used with --gtk, create file with 
                                  numeric values only
        -e, --gtk-expanded      when used with --gtk, create file that repeats
                                  first column; not usable in GTK+

        Default is to show statistics.

$ ./compose-parse.py

Total number of compose sequences (from file)              : 5020
  of which can be expressed algorithmically                : 1201
  of which cannot be expressed algorithmically             : 3819
    of which have Multi_key                                : 3381

Algorithmic (stats for Xorg Compose file)
Number of sequences off due to algo from file (len(array)) : 1201
Number of sequences off due to algo (uniq(sort(array)))    : 805
  of which are for Greek                                   : 176

Unicode statistics from UnicodeData.txt
Number of entries that can be algorithmically produced     : 925
  of which are for Greek                                   : 239
Number of compose sequence combinations requiring          : 1323
  of which are for Greek                                   : 521
Note: We do not include partial compositions, 
thus the slight discrepancy in the figures

Non-algorithmic (stats from Xorg Compose file)
Number of sequences left                                   : 3819
Flat array looks like                                      : 3819 rows of 6 integers (2 bytes per int, or 12 bytes per row)
Flat array would have taken up (in bytes)                  : 45828 bytes from the GTK+ library
Number of items (i.e. ints) in flat array                  : 22914
  of which are zeroes                                      : 9350 or 40%
Number of different first items                            : 22
Number of max bytes (if using flat array)                  : 45828
Number of savings                                          : 18480

Memory needs if both algorithmic+optimised table in latest Xorg compose file
                                                           : 27348

Existing (old) implementation in GTK+
Number of sequences in old gtkimcontextsimple.c            : 691
The existing (old) implementation in GTK+ takes up         : 16584 bytes
$ _
-----------------
This is the updated compose-parse.py file that automates some of the tasks of the processing of the compose file, 
a. provides statistics on the benefits of the algorithmic approach on the Xorg compose file, 
b. uses UnicodeData.txt (from unicode.org) to calculate the full benefit of the algorithmic approach
c. outputs the optimized table that GTK+ needs for non-algorithmic sequences

The executive summary is that with the suggested implementation (optimised table, as described by Matthias and algorithmic function, as described by Tor), the GTK+ compose sequence table increases in size by 11KB (from 16KB to 27KB), and still it covers the full Xorg compose file sequences.

Comment 54 Simos Xenitellis 2008-01-30 16:11:37 UTC

Created attachment 104043 [details] [review]
Patch for gtkimcontextsimple.c to enable optimized/algorithmic

Patch applies to GTK+ HEAD.

Contains
a. patch to gtk/gtkimcontextsimple.c; 
b. new file gtk/gtkimcontextsimpleseqs.c; optimised table with compose sequences
b. patch to gdk/gdkkeysyms.h (required for the version of this table)

1. Tested on Ubuntu Linux, äãâáạȧṗ, ⒼⓃⓄⓂⒺ
2. Requires testing on Win32 (algorithmic, does ´ + ¨ + ι == ΐ ;)
3. Greek Polytonic works apart from dead_psili, dead_dasia (keysyms will be available in new Xorg; did not add them anyway in the GTK+ compose table at this stage). Greek perispomeni works though with entries in GTK+ compose table (normally conflicts with dead_tilde).
4. GTK+ has support (function: gtk_im_context_simple_add_table()) to append to the compose table; I did not fix this functionality at this stage.

Any comments would be greatly appreciated.

Comment 55 Simos Xenitellis 2008-01-31 01:45:56 UTC

*** Bug 333710 has been marked as a duplicate of this bug. ***

Comment 56 Simos Xenitellis 2008-01-31 01:52:33 UTC

*** Bug 162845 has been marked as a duplicate of this bug. ***

Comment 57 Matthias Clasen 2008-02-28 14:57:18 UTC

Simos, this looks very impressive indeed. 

Here is my take on what is needed to get this over the finish line:

1) reinstate the old check_table function, and use it on tables added
by gtk_im_context_simple_add_table(), move the code that works on the 
compact tables to some new check_compact_table function and use that on
gtk_compose_seqs.

2) remove all the debug printfs

3) coding style fixes: no // comments

Comment 58 Simos Xenitellis 2008-03-03 14:34:49 UTC

Created attachment 106472 [details] [review]
Patch to gtk+ (HEAD) to update compose table

Affects four files,
1) Updated gtk+/gdk/gdkkeysyms.h (using gdkkeysyms-update.pl found in same dir)
2) Updated gtk+/gtk/gtkimcontextsimple.c
3) New file gtk+/gtk/gtkimcontextsimpleseqs.h (sequences now go here)
4) New file gtk+/gtk/compose-parse.py (script that auto-updates sequences).

Tested with latin extended (¨~´^`˚¯˝ˇ˘, 12 dead keys), greek polytonic (10 dead keys).

I attended all three comments above (did not test though the functionality of adding custom compose tables).

Comment 59 Matthias Clasen 2008-03-04 05:57:08 UTC

+	""" Grabs and opens the keysyms.txt file that Markus Khun maintains """

His name is Markus Kuhn, I believe

My build runs into the following:

gtkimcontextsimple.c:62: error: 'gtk_compose_seqs_compact' undeclared here (not in a function)

it seems that should be gtk_compose_seqs_optimised

The few simple tests that I did seemed to work. I assume you have given it some more extensive testing.

Lets get this committed to trunk, and for more widespread testing on the mailing list. Does that sound like a good plan ?

Comment 60 Simos Xenitellis 2008-03-04 11:38:51 UTC

Sounds great.

I updated the surname of Markus and fixed the table name (using gtk_compose_seqs_compact[]). 

"gtk_compose_seqs_optimised" was the previous name of the table which I think was not a good choice (optimised vs optimized). I used the old file when producing the patch.

I committed the patch with these changes.

I'll mail the mailing list for more testing.

Comment 61 Simos Xenitellis 2008-03-15 01:21:26 UTC

Created attachment 107326 [details] [review]
Patch to gtk+ (HEAD) to update compose table (fixes one error, typos)

Updated patch which corresponds to what was committed. There are four more occurrences of mispellings of the name of Markus (it's Markus Kuhn) which this patch fixes, but I will commit to SVN on the next opportunity.

I requested for testing of this patch in the gtk-i18n-list, at http://blogs.gnome.org/simos/2008/03/05/testing-the-updated-im-support-in-gtk/ and  
http://simos.info/blog/archives/661

Comment 62 Simos Xenitellis 2008-03-17 23:18:56 UTC

*** Bug 88639 has been marked as a duplicate of this bug. ***

Comment 63 Simos Xenitellis 2008-03-17 23:49:29 UTC

*** Bug 324021 has been marked as a duplicate of this bug. ***

Comment 64 Simos Xenitellis 2008-03-31 14:39:43 UTC

I am closing this report as the patch has been submitted.
I suppose that is ok.

To summarize, a call for testing has been sent to

a. Use JhBuild to create a custom 
http://blogs.gnome.org/simos/2008/03/05/testing-the-updated-im-support-in-gtk/
b. Creating patched .deb packages for Ubuntu
http://simos.info/blog/archives/661
c. Email at gtk-i18n-list.

Comment 65 Diego Escalante Urrelo (not reading bugmail) 2008-09-04 20:37:41 UTC

Heya!. I have a little peeve for the update, I can no longer use <compose><-><n> to produce an ñ. Now I have to use altgr+] which produces an ~, this is in an UK configured keyboard.
I guess it was removed from one of the sources used in the update script and hence the new version of the file does not have it. Can it be added? How? Should I bug someone else?

It's a big regression for me since the ~ key is far away from the typying area while the - key is just next to it, I can't speak for others but there's a chance of other people using compose like <-><n> for ñ. 

Don't kill the cute ñ.

Comment 66 Henrique 2008-09-04 21:42:38 UTC

(In reply to comment #65)
> Heya!. I have a little peeve for the update, I can no longer use
> <compose><-><n> to produce an ñ. Now I have to use altgr+] which produces an
> ~, this is in an UK configured keyboard.
> I guess it was removed from one of the sources used in the update script and
> hence the new version of the file does not have it. Can it be added? How?
> Should I bug someone else?

 I couldn't find the the composition <compose><-><n> in Xorg. Since gtk keeps in sync with X, you probably need to track this bug both here and in freedesktop.

Comment 67 Simos Xenitellis 2008-09-04 22:09:25 UTC

(In reply to comment #66)
> (In reply to comment #65)
> > Heya!. I have a little peeve for the update, I can no longer use
> > <compose><-><n> to produce an ñ. Now I have to use altgr+] which produces an
> > ~, this is in an UK configured keyboard.
> > I guess it was removed from one of the sources used in the update script and
> > hence the new version of the file does not have it. Can it be added? How?
> > Should I bug someone else?
> 
>  I couldn't find the the composition <compose><-><n> in Xorg. Since gtk keeps
> in sync with X, you probably need to track this bug both here and in
> freedesktop.
> 

That is correct. You would need to file a bug report at freedesktop.org (bugs.freedesktop.org), product xorg, component Lib/Xlib. You may also CC: me.

My understanding is that "Compose + -" is more intuitive to be connected to sequences for the macron, as in āēūī which are already available.

You can get ñ with Compose + ~  as well, which I use often for the "gb" basic layout.

Comment 68 Henrique 2008-09-04 22:17:23 UTC

There is an N with a macron below in Unicode: Ṉ (U+1E48). The guy will probably need to argue why composing to Ñ is better.