Bug 107200 – TODO: Internationalize word completion

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 107200 - TODO: Internationalize word completion


Summary:	TODO: Internationalize word completion


Status:	RESOLVED FIXED

Product:	gok
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	bill.haneman
QA Contact:	bill.haneman

URL:
Whiteboard:

Depends on:
Blocks:	122112

Reported:	2003-02-27 16:49 UTC by simon.bates
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
and, another patch (for above issue). (6.45 KB, patch) 2003-06-12 14:57 UTC, bill.haneman	none	Details \| Review
refactoring patch for word-completion, believed to be a prerequisite to a fix (70.74 KB, patch) 2003-09-29 21:53 UTC, bill.haneman	none	Details \| Review
next patch against HEAD as of 16:30 GMT Oct 1 (26.73 KB, patch) 2003-10-01 15:32 UTC, bill.haneman	none	Details \| Review
screen shot of word-completion in progress for French word including non-ascii chars (26.43 KB, image/png) 2003-10-09 23:49 UTC, bill.haneman		Details

Description simon.bates 2003-02-27 16:49:46 UTC

The word completion system represents characters using the `char' (or
`gchar') type.  This will cause problems for charecter encodings that use
more than one byte to store a character.  The word completion system should
be modified so that characters are represented by another type, such as
`gunichar'.

Comment 1 bill.haneman 2003-03-10 14:55:07 UTC

simon:  gchar is fine as long as the gchar* is passed to g_utf8_foo
methods.

UTF-8, the default GNOME encoding, uses byte-sized pieces but doesn't
assume that there's a 1:1 correspondance between bytes and characters.
 So I think you can probably make it work while retaining your
existing API, or at least doing something very similar.

It may be that the "input" parameter to the wordcompletion module
should be a "gchar*" instead of a char, keysym, etc. since it then
could point to a UTF-8 encoded unicode character or even a
multi-character string.

Comment 2 bill.haneman 2003-06-12 13:38:48 UTC

this bug has been partially addressed by the recent patch for 109183
(to gok-word-complete.c).  No work has been done yet in word-complete.c.

there's still a problem in gok-word-complete.c where outputs are
generated from the completion label.  X Keysyms are *not*, in general,
characters, but they were cleverly chosen to map to ASCII for the
basic qwerty printable characters.  However for the general case
passing a gchar in place of a KeySym will wreak havoc.

Mapping unichars or chars to keysyms is however not straightforward. 
We are making an effort to do this in at-spi, in support of the
newly-operational "string synthesis" mode.  I suggest that we move
gok-word-complete.c to use this API instead of keysym synthesis, this
is easier (for GOK ;-), simpler, and more internationalizable.

that change would be a very small diff, but we'd need to update the
version dependencies for GOK.

Comment 3 David Bolter 2003-06-12 13:48:15 UTC

Sold.  Sounds like a must.

Comment 4 bill.haneman 2003-06-12 14:57:47 UTC

Created attachment 17489 [details] [review]
and, another patch (for above issue).

Comment 5 Calum Benson 2003-08-07 16:15:55 UTC

Apologies for spam... marking as GNOMEVER2.3 so it appears on the official GNOME
bug list :)

Comment 6 bill.haneman 2003-08-22 17:07:07 UTC

I am getting information that our ATs, though we won't expect them to
be fully localized right away, will have to be internationalized soon.
 Bumping up priority and severity accordingly, this isn't "just" an
RFE since internationalization is expected of all GNOME apps.

I am not saying this is feasible for 2.4, but we need to keep it very
visible.

Comment 7 bill.haneman 2003-09-19 16:20:28 UTC

David:

I'd be happy to fix word-complete.c to make it unicode/UTF8-ready, but
I need some background on the "trie" and how it's supposed to work.

Ideed, perhaps we should reexamine whether exposing a "trie" as an
implementation detail makes sense: should we in fact use some other
kind of structure for building our completion table?  The trie seems
cumbersome to me - though more performant than mechanisms involving
more string comparison, the performance impact may in fact pale in
comparison to other operations we're doing, and not be worth the
additional obscurity.  For instance there's lots of string comparison
happenning in a GOK session etc., it may be that the trie isn't the
best choice from performance or memory-footprint basis, since we have
to read the dictionary stuff into it and construct a fairly elaborate
structure.  If instead we just create a string array, we could just
move index pointers around when a new letter is added, i.e. 


user has entered "w"

array pointers point to min "want", max "wrote"

user enters "o", to form "wo"

array pointers move to min "woman", max "worth"
this means dictionary.txt is ordered (easier to maintain!) and also
would make integrating with system dictionaries and aspell/pspell much
easier IMO.

Using the trie would be considerably harder to implement and introduce
more problems regarding collation-sequences in international locales,
but I expect it could be done.

Comment 8 David Bolter 2003-09-19 16:37:54 UTC

Your suggestion seems KISS appropriate. I need to think about the
tradeoffs a bit.

A Trie structure, at least in some instances, has a smaller memory
footprint.  E.g.

the
then
there
they

The trie is:
                t -> h -> e
                        / | \
                       n  r  y
                          |  
                          e

You'll notice that the e is stored once instead of four times in this
example. Also notice that there are a lot more words that start with
'e' so the tradeoff in storing pointers might be worthwhile.  As for
i18n affect on this issue, I am not sure...  It might be that for some
locales the Trie is even better, and for some, worse?

The Trie structure also makes lookup more efficient...

Comment 9 David Bolter 2003-09-19 16:39:08 UTC

BTW, when I say "the trie is:"  I mean the "trie subtree structure is:"

Comment 10 David Bolter 2003-09-19 16:40:26 UTC

Sigh, also when I said "there are a lot more words that start with
'e'", I mean't "there are a lot more words that start with 'the'" -
sorry about that.

Comment 11 bill.haneman 2003-09-19 17:02:44 UTC

Well, if you have a _really large_ dictionary I agree that tries are
possibly significantly more compact.  However for our purposes I doubt
that this is true, since the pointers are 4-byte and the chars are 1-3
bytes (most of them only one byte!) in UTF-8.  I think what it does
really is front-load the sorting effort so that lookup is quick, but
in point of fact I don't think the CPU load of searching an ordered
list one-character-at-a-time (in user time!) is significant.  It may
be that the trie is actually more compute-intensive in some scenarios.

As an exercise in rough sizing, note that the whole dictionary.txt is
about 36KB, and gok's binary size (not including the shared libraries
it pulls in, and other structs it creates at runtime) is >9MB.

I can't think of any locales/languages where trie will be greatly more
efficient that it is in english; in many cases I agree it'll be worse
(in general it'll be better in locales with fewer characters I believe)

I think maintainability and flexibility of our dictionaries is a key
aspect here - so even if we retain the trie I think we should rethink
our dictionary.txt format so that the list is stored in sorted order.

also if we store the dictionary in sorted order, it's possible (if
dictionaries get really really big) to use offsets into the file so
that we don't have to read the whole string-set into memory, we only
read in the "a" block or the "b" block, etc. etc.

Note that if we do our sorting in-memory, we will need to use locale
collation-sequences and routines to do the sort: I suspect they are
much more convenient to use on strings that to use on a per-character
basis whilst building a trie of gunichars.

Comment 12 David Bolter 2003-09-19 19:33:41 UTC

Your argument regarding 4 byte pointers looks sound and so any memory
saving for the trie seems unlikely.  Creating the trie is indeed where
all the sorting is done and that is its beauty. Using the trie
structure (after it has been created) is lightening quick. If, as you
suggest, using sorted lists is not going to create any significant lag
(if any), then I no longer see an reason for the trie implementation. 

I'm leaning towards your argument.

Where are we on the using the system dictionary issue?

Comment 13 bill.haneman 2003-09-26 08:13:25 UTC

David: I think the trie structure is very elegant, but it has some
maintainability issues in my opinion.  At least, the revision required
to make it UTF8-compatible might be harder than reimplementing using
string manipulation API that we have available already in glib.  My
expectation is that we can achieve acceptable performance with string
comparisons if we aren't too brute-force (i.e. if we keep state
pointers in a sorted list so that we don't have to search the whole
'haystack' for our needle on each keypress).

As for the system dictionary, the most expedient solution is to dump
the system dict and import it; not perhaps the most elegant solution -
it might be more elegant to use ispell/pspell/aspell API at runtime. 
But those spelling APIs are neither consistent, nor optimized for
prefix matching, they are designed specifically for detecting "similar
spellings".  Some of the APIs that are useful for on-the-fly
spellchecking might work, but really the import method would be much
easier to manage.

I am leaning towards this approach:

(1) refactor word-complete.c to use string methods and not tries;
(2) import the system dictionary either in response to user command or
at first invocation of GOK*

[* - note that we either must 

(a) import automatically (with user comfirmation) only once, and hope
that the user's "first" locale is their primary one, or

(b) import automatically on first invocation in a specific locale;
requires a mechanism for determining whether GOK has been run in
<locale> before or not;

(c) import automatically on first invocation for _all available
locales_; requires setting the locale/LANG and re-running the spell
dictionary dump routine.

(c) might be nicest _but_ might be slow the first time GOK is run. 
OTOH, users are probably accustomed to applications doing
time-consuming config things when initially run, especially if the
apps post dialogs telling the user what they are doing.

David, if you want to write the dialogs for (c), which we can extend
to a general purpose gok-config-wizard later on, I can write the
dictionary import code and possibly revise the word-complete.c APIs. 
I can also write the new word-completion code if you like, using the
UTF8 APIs from glib.

Comment 14 bill.haneman 2003-09-29 21:53:31 UTC

Created attachment 20370 [details] [review]
refactoring patch for word-completion, believed to be a prerequisite to a fix

Comment 15 bill.haneman 2003-09-30 13:29:49 UTC

Above patch is now in CVS.

Comment 16 Christian Rose 2003-10-01 07:04:17 UTC

The above patch broke string freeze.

Comment 17 bill.haneman 2003-10-01 11:14:23 UTC

Christian:  I didn't think there was a problem with adding strings in
a minor release.  That's not documented in the release/freeze info on
developer.gnome.org.  Obviously new features aren't allowed in foo.x.y
releases but I thought there was no problem with minor UI tweaks.

Comment 18 bill.haneman 2003-10-01 12:42:41 UTC

I removed the offending strings from translation macros.  However this
makes the patch useless for its original purpose, so we cannot address
this aspect of GOK's internationalization woes without permission to
reinstate the strings (which are actually the patch for bug 122117).

Comment 19 bill.haneman 2003-10-01 15:32:50 UTC

Created attachment 20410 [details] [review]
next patch against HEAD as of 16:30 GMT Oct 1

Comment 20 bill.haneman 2003-10-09 23:45:40 UTC

Recent work on this bug is committed to the gok_i18n branch.

Word completion of non-ascii text is now possible for selected
languages such as French; Irish doesn't work right yet but that may be
due to at-spi problems.  Still investigating.

Comment 21 bill.haneman 2003-10-09 23:49:31 UTC

Created attachment 20618 [details]
screen shot of word-completion in progress for French word including non-ascii chars

Comment 22 bill.haneman 2003-11-04 13:33:08 UTC

This is fixed (modulo at-spi bugs and testing with non-Latin locales)
in CVS; I merged the gok_i18n branch into HEAD.