GNOME Bugzilla – Bug 107200
TODO: Internationalize word completion
Last modified: 2004-12-22 21:47:04 UTC
The word completion system represents characters using the `char' (or `gchar') type. This will cause problems for charecter encodings that use more than one byte to store a character. The word completion system should be modified so that characters are represented by another type, such as `gunichar'.
simon: gchar is fine as long as the gchar* is passed to g_utf8_foo methods. UTF-8, the default GNOME encoding, uses byte-sized pieces but doesn't assume that there's a 1:1 correspondance between bytes and characters. So I think you can probably make it work while retaining your existing API, or at least doing something very similar. It may be that the "input" parameter to the wordcompletion module should be a "gchar*" instead of a char, keysym, etc. since it then could point to a UTF-8 encoded unicode character or even a multi-character string.
this bug has been partially addressed by the recent patch for 109183 (to gok-word-complete.c). No work has been done yet in word-complete.c. there's still a problem in gok-word-complete.c where outputs are generated from the completion label. X Keysyms are *not*, in general, characters, but they were cleverly chosen to map to ASCII for the basic qwerty printable characters. However for the general case passing a gchar in place of a KeySym will wreak havoc. Mapping unichars or chars to keysyms is however not straightforward. We are making an effort to do this in at-spi, in support of the newly-operational "string synthesis" mode. I suggest that we move gok-word-complete.c to use this API instead of keysym synthesis, this is easier (for GOK ;-), simpler, and more internationalizable. that change would be a very small diff, but we'd need to update the version dependencies for GOK.
Sold. Sounds like a must.
Created attachment 17489 [details] [review] and, another patch (for above issue).
Apologies for spam... marking as GNOMEVER2.3 so it appears on the official GNOME bug list :)
I am getting information that our ATs, though we won't expect them to be fully localized right away, will have to be internationalized soon. Bumping up priority and severity accordingly, this isn't "just" an RFE since internationalization is expected of all GNOME apps. I am not saying this is feasible for 2.4, but we need to keep it very visible.
David: I'd be happy to fix word-complete.c to make it unicode/UTF8-ready, but I need some background on the "trie" and how it's supposed to work. Ideed, perhaps we should reexamine whether exposing a "trie" as an implementation detail makes sense: should we in fact use some other kind of structure for building our completion table? The trie seems cumbersome to me - though more performant than mechanisms involving more string comparison, the performance impact may in fact pale in comparison to other operations we're doing, and not be worth the additional obscurity. For instance there's lots of string comparison happenning in a GOK session etc., it may be that the trie isn't the best choice from performance or memory-footprint basis, since we have to read the dictionary stuff into it and construct a fairly elaborate structure. If instead we just create a string array, we could just move index pointers around when a new letter is added, i.e. user has entered "w" array pointers point to min "want", max "wrote" user enters "o", to form "wo" array pointers move to min "woman", max "worth" this means dictionary.txt is ordered (easier to maintain!) and also would make integrating with system dictionaries and aspell/pspell much easier IMO. Using the trie would be considerably harder to implement and introduce more problems regarding collation-sequences in international locales, but I expect it could be done.
Your suggestion seems KISS appropriate. I need to think about the tradeoffs a bit. A Trie structure, at least in some instances, has a smaller memory footprint. E.g. the then there they The trie is: t -> h -> e / | \ n r y | e You'll notice that the e is stored once instead of four times in this example. Also notice that there are a lot more words that start with 'e' so the tradeoff in storing pointers might be worthwhile. As for i18n affect on this issue, I am not sure... It might be that for some locales the Trie is even better, and for some, worse? The Trie structure also makes lookup more efficient...
BTW, when I say "the trie is:" I mean the "trie subtree structure is:"
Sigh, also when I said "there are a lot more words that start with 'e'", I mean't "there are a lot more words that start with 'the'" - sorry about that.
Well, if you have a _really large_ dictionary I agree that tries are possibly significantly more compact. However for our purposes I doubt that this is true, since the pointers are 4-byte and the chars are 1-3 bytes (most of them only one byte!) in UTF-8. I think what it does really is front-load the sorting effort so that lookup is quick, but in point of fact I don't think the CPU load of searching an ordered list one-character-at-a-time (in user time!) is significant. It may be that the trie is actually more compute-intensive in some scenarios. As an exercise in rough sizing, note that the whole dictionary.txt is about 36KB, and gok's binary size (not including the shared libraries it pulls in, and other structs it creates at runtime) is >9MB. I can't think of any locales/languages where trie will be greatly more efficient that it is in english; in many cases I agree it'll be worse (in general it'll be better in locales with fewer characters I believe) I think maintainability and flexibility of our dictionaries is a key aspect here - so even if we retain the trie I think we should rethink our dictionary.txt format so that the list is stored in sorted order. also if we store the dictionary in sorted order, it's possible (if dictionaries get really really big) to use offsets into the file so that we don't have to read the whole string-set into memory, we only read in the "a" block or the "b" block, etc. etc. Note that if we do our sorting in-memory, we will need to use locale collation-sequences and routines to do the sort: I suspect they are much more convenient to use on strings that to use on a per-character basis whilst building a trie of gunichars.
Your argument regarding 4 byte pointers looks sound and so any memory saving for the trie seems unlikely. Creating the trie is indeed where all the sorting is done and that is its beauty. Using the trie structure (after it has been created) is lightening quick. If, as you suggest, using sorted lists is not going to create any significant lag (if any), then I no longer see an reason for the trie implementation. I'm leaning towards your argument. Where are we on the using the system dictionary issue?
David: I think the trie structure is very elegant, but it has some maintainability issues in my opinion. At least, the revision required to make it UTF8-compatible might be harder than reimplementing using string manipulation API that we have available already in glib. My expectation is that we can achieve acceptable performance with string comparisons if we aren't too brute-force (i.e. if we keep state pointers in a sorted list so that we don't have to search the whole 'haystack' for our needle on each keypress). As for the system dictionary, the most expedient solution is to dump the system dict and import it; not perhaps the most elegant solution - it might be more elegant to use ispell/pspell/aspell API at runtime. But those spelling APIs are neither consistent, nor optimized for prefix matching, they are designed specifically for detecting "similar spellings". Some of the APIs that are useful for on-the-fly spellchecking might work, but really the import method would be much easier to manage. I am leaning towards this approach: (1) refactor word-complete.c to use string methods and not tries; (2) import the system dictionary either in response to user command or at first invocation of GOK* [* - note that we either must (a) import automatically (with user comfirmation) only once, and hope that the user's "first" locale is their primary one, or (b) import automatically on first invocation in a specific locale; requires a mechanism for determining whether GOK has been run in <locale> before or not; (c) import automatically on first invocation for _all available locales_; requires setting the locale/LANG and re-running the spell dictionary dump routine. (c) might be nicest _but_ might be slow the first time GOK is run. OTOH, users are probably accustomed to applications doing time-consuming config things when initially run, especially if the apps post dialogs telling the user what they are doing. David, if you want to write the dialogs for (c), which we can extend to a general purpose gok-config-wizard later on, I can write the dictionary import code and possibly revise the word-complete.c APIs. I can also write the new word-completion code if you like, using the UTF8 APIs from glib.
Created attachment 20370 [details] [review] refactoring patch for word-completion, believed to be a prerequisite to a fix
Above patch is now in CVS.
The above patch broke string freeze.
Christian: I didn't think there was a problem with adding strings in a minor release. That's not documented in the release/freeze info on developer.gnome.org. Obviously new features aren't allowed in foo.x.y releases but I thought there was no problem with minor UI tweaks.
I removed the offending strings from translation macros. However this makes the patch useless for its original purpose, so we cannot address this aspect of GOK's internationalization woes without permission to reinstate the strings (which are actually the patch for bug 122117).
Created attachment 20410 [details] [review] next patch against HEAD as of 16:30 GMT Oct 1
Recent work on this bug is committed to the gok_i18n branch. Word completion of non-ascii text is now possible for selected languages such as French; Irish doesn't work right yet but that may be due to at-spi problems. Still investigating.
Created attachment 20618 [details] screen shot of word-completion in progress for French word including non-ascii chars
This is fixed (modulo at-spi bugs and testing with non-Latin locales) in CVS; I merged the gok_i18n branch into HEAD.