Bug 383706 – Adding support for spellcheckers into the Gtk+ stack

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 383706 - (gspell) Adding support for spellcheckers into the Gtk+ stack

(gspell)
Summary:	Adding support for spellcheckers into the Gtk+ stack


Status:	RESOLVED OBSOLETE

Product:	gtk+
Classification:	Platform
Component:	.General
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtk-bugs
QA Contact:	gtk-bugs

URL:
Whiteboard:

Duplicates:	162414 167286 708807 (view as bug list)
Depends on:	719978
Blocks:	382205 586222

Reported:	2006-12-08 12:15 UTC by Paolo Maggi
Modified:	2018-05-02 14:26 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
gspellcheckerlanguage.h (1.77 KB, text/plain) 2006-12-08 15:01 UTC, Paolo Maggi	Details
gspellchecker.h - The GSpellChecker class (7.90 KB, text/plain) 2006-12-08 15:02 UTC, Paolo Maggi	Details

Description Paolo Maggi 2006-12-08 12:15:18 UTC

On desktop-devel-list mailing list there has been a discussion about adding support for spellcheckers into the Gtk+ stack.

See thread starting at http://mail.gnome.org/archives/desktop-devel-list/2006-December/msg00028.html
for more information.

I'm creating this bug report to track progressions.

Comment 1 Paolo Maggi 2006-12-08 14:59:32 UTC

As I said in http://mail.gnome.org/archives/desktop-devel-list/2006-December/msg00032.html I don't think the current Enchant API can be directly used to add support for spell checking in the gtk+ stack.

I think we need a more object oriented API.

I'm going to attach a first proposal based on the API I designed for the gedit spell checker plugin.
The implementation of the proposed API cannot be based on enchant since it requires some functionalities that are not exported in the enchant public API.
Nevertheless, current enchant code can be heavily reused to implement the proposed API. 
Probaby dom can help us to evaluate how much effort is required to port current enchant code to the proposed API.

In http://mail.gnome.org/archives/desktop-devel-list/2006-December/msg00055.html I have described how we can use the proposed API to add support for spell checking in GtkEntry and GtkTextView.

Notes about the proposed API:
- I'm not sure we need to have a GError parameter in all the functiona (may be some of them cannot report errors)
- A GSpellChecker support a single language + one personal word list. To perform spell checking with multiple languages you need to have multiple GSpellChecker objects
- A language tag is a string like "en_US", "it_IT", etc.
- g_spell_checker_language_to_string returns a translated string like "English (United States)", "Italian (Italy)", etc.
- The proposed API contains a few functions to manage a pwl, these are not present in public enchant API
- Probably we need to add other properties to GSpellChecker, for example ignore-numbers, ignore-all-caps, ignome-urls, etc.
- I have not tried to compile the .h files, so there could be syntax errors.

Comment 2 Paolo Maggi 2006-12-08 15:01:50 UTC

Created attachment 77967 [details]
gspellcheckerlanguage.h

Comment 3 Paolo Maggi 2006-12-08 15:02:51 UTC

Created attachment 77968 [details]
gspellchecker.h - The GSpellChecker class

Comment 4 Steve Frécinaux 2006-12-08 15:35:13 UTC

> /* TODO: do we need other properties like ignore-numbers, ignore-all-caps, ignome-urls, etc. ? */

Maybe in a more general fashion a way to ignore anything you want, like a filter function. For instance it would be great to be able not to highlight keywords as mispelled words in gedit when editing, say, a LaTeX file.

Comment 5 Wouter Bolsterlee (uws) 2006-12-08 15:46:17 UTC

Perhaps a list of filter function, with a default 'ignore words in this wordlist' function as a special-case?

Comment 6 Paolo Maggi 2006-12-08 15:53:37 UTC

Speaking of filters, KSpell2 use the Filter class to split text into words which will be spell checked. 

See http://api.kde.org/3.5-api/kdelibs-apidocs/kspell2/html/index.html

Comment 7 Wouter Bolsterlee (uws) 2006-12-08 15:59:01 UTC

The "filter" concept I referred to is a function/signal/object that can tell whether a word should be ignored... the "filter" concept you mentioned is more like a tokenizer.

Btw, what about using signals in this API? The default would be to spellcheck using Enchant, but a "spell-check-word" signal (with a string parameter in the callback) can be connected to hook up custom logic. Convenience API to add such a signal handler with a "this is a list of words you should ignore" paramater would be useful, still.

Comment 8 Havoc Pennington 2006-12-08 16:03:26 UTC

A signal on every word would be pretty expensive for a large document, something to consider.

Comment 9 Marco Barisione 2006-12-09 16:09:30 UTC

I like the proposed API but I would prefer to have language as a contruct only property for GSpellChecker.

I would like to have some multilanguage support in gspell and not only in the UI part, so it can be used by programs without a GUI.

One of the thing to handle for multilanguage spell checkers is filtering duplicate suggestions for variants of the same language. If we use a menu like the one I proposed in http://www.barisione.org/blog-files/2006/12/menu-multi.png we should not show the "word" suggestion for both en and en_US.

Comment 10 Marco Barisione 2006-12-10 00:46:28 UTC

See comment #27 to bug #97545.

If the same string can be a single word in English and two words in another language, then a multilanguage spell checker class cannot have a simple check_word function, as it would have to split differently the string for every spell checker.

Comment 11 Paolo Maggi 2006-12-10 09:16:08 UTC

Word splitting is an app-level or widget level (in the case of TextView and Entry) operation.

Comment 12 Marco Barisione 2006-12-12 12:01:19 UTC

I copied the non-GUI code of the spell cheking plugin used by gedit, creating a new separate library called gspell. It's available at http://techn.ocracy.org/gspell/

In the next days I will work to implement the API proposed by Paolo.

Currently the library uses libxml2 and iso-codes to obtain the language codes and their names. Is it ok to depend on iso-codes? If it's ok I will port gspell from libxml2 to GMarkup.

Comment 13 Matthias Clasen 2006-12-12 21:27:17 UTC

As a general comment, I like this idea. Looking merging the good bits from libsexy has been on my list for a long time, so I am happy that someone beats me to it.

If we want the api to be GObject-ified, it may be better off inside
GTK+, since glib does not depend on gobject. There is some precedent for that
if you look at e.g. the input method support code in gtk.

Comment 14 Behdad Esfahbod 2006-12-12 22:27:42 UTC

(In reply to comment #13)
> As a general comment, I like this idea. Looking merging the good bits from
> libsexy has been on my list for a long time, so I am happy that someone beats
> me to it.
> 
> If we want the api to be GObject-ified, it may be better off inside
> GTK+, since glib does not depend on gobject. There is some precedent for that
> if you look at e.g. the input method support code in gtk.

I had some feelings along this line too.  gregex for example is not using GObject either, but I conferss I've not looked into the details of proposed APIs.  But if we decide to GObjectify it, is that a problem, given than gspell will be a separate library?

Comment 15 David Trowbridge 2006-12-13 11:40:46 UTC

I don't have a lot of time to comment right now, but I've got a lot brewing. A few initial thoughts:

First off, this is happening really fast. I appreciate that people want to get started implementing and getting stuff working, but creating a good API takes time and careful review. I'd really hate for "prototype" code to slip into GTK+ because everyone's excited about it, and then continue needing to maintain SexySpellEntry because the GTK+ API doesn't suit my (or others') needs. I'll try to get around to speccing out and summarizing exactly what my needs are from this API in the next day or two. I imagine many others will have these same sorts of requirements.

Meanwhile, here are a few questions to ponder:

How can an application mark domain-specific words as OK? xchat-gnome shouldn't mark nicknames of the folks you're chatting with as misspelled.

How can word-splitting be done with this in a way that's both a reasonable default and easily modifiable for ? Pango's word-splitting does pretty horribly with things like URLs or "xyz123" IRC nicknames.

And some ones specifically related to the proposed API:

"Language" is a generic concept, and there's no reason why it should be exposed as a spell-checker only function. What are the implications of including an iso-codes dependency for this?

What is the value of having multiple personal word lists?

What is the value of session-only word lists? Are these shared between processes or separate instances of a single program? Are they valid for the entire X session?

Comment 16 Paolo Maggi 2006-12-13 12:35:14 UTC

Note that my current API proposal is only a partial solution since it does not solve the entire problem of adding an easy to use support for spell checking in gtk+.

Since the whole problem is complex, I'd prefer to split it in different easier sub-problems. The first one, i.e. the one my API proposal tries to solve, is:

- designing an API to spell check single words 

The we will need to solve other problems:

- how to split a document into words (I think we need an interface applications will have to implement and for which we will provide a few implementations for the most important use cases, i.e. splitting a string into words, splitting the content on a GtkEditable into words and splitting the content of a GtkTextBuffer into words)

- how to use the two previous components to spell check an entire document

- design the UI parts of the solution

To reply to previous comments (by Matthias and David)

> If we want the api to be GObject-ified, it may be better off inside
> GTK+, since glib does not depend on gobject.

It is not clear to me why a library like gspell (distributed inside glib) cannot depend on gboject.

> How can an application mark domain-specific words as OK?  xchat-gnome 
> shouldn't mark nicknames of the folks you're chatting with as misspelled.

I see various possible solutions:

1. apps check for domain-specific words before using the spell checker (i.e. if xchat know  word is a "nickname"  there is no need to ask the spell checker to check it [1])
2. app pre-populate the "session" of the GSpellChecker objects with domain-specific words 

[1] You clearly need to have a way to mark as "must-not-be-checked" words inside a GtkEntry or GtkTextBuffer, but this is a widget specific problem

> How can word-splitting be done with this in a way that's both a reasonable
> default and easily modifiable for ?  Pango's word-splitting does pretty
> horribly with things like URLs or "xyz123" IRC nicknames.

As I said before, my current API proposal does not try to solve this problem.
To sketch a possible solution I think we need an interface, that applications and widgets can implement, similar to the interface of the KSpell2::Filter class in KSpell2 (see http://api.kde.org/3.5-api/kdelibs-apidocs/kspell2/html/classKSpell2_1_1Filter.html)

> "Language" is a generic concept, and there's no reason why it should be 
> exposed as a spell-checker only function.  What are the implications of 
> including an iso-codes dependency for this?

I agree, "language" is a generic concept, but in this case it tries to encapsulate the concept of "language for which we have an installed dictionary".

> What is the value of having multiple personal word lists?

Personal word list can depend on the language you are using, e.g. you can have a PWL for English and a different PWL for italian.

> What is the value of session-only word lists?  Are these shared between
> processes or separate instances of a single program?  Are they valid for the
> entire X session?

Session-only word lists are used to implement the "Ignore All" functionality you normally see in spell checkers. They live inside a single GSpellChecker objects and so are not shared between processes.

Comment 17 Marco Barisione 2006-12-13 14:25:56 UTC

> It is not clear to me why a library like gspell (distributed inside glib)
> cannot depend on gboject.

Probabily mclasen wants gspell inside libglib, as gregex. But GRegex works well without being a GObject, GSpellChecker needs to be a GObject.

> I agree, "language" is a generic concept, but in this case it tries to
> encapsulate the concept of "language for which we have an installed
> dictionary".

If language codes and names are needed by other programs we can move it to glib. Only g_spell_checker_language_get_available_language() is specific to spell checking and can become a static method of the GSpellChecker class.

> Personal word list can depend on the language you are using, e.g. you can have
> a PWL for English and a different PWL for italian.

I would prefer to have a single pwl without language distinction.

> Session-only word lists are used to implement the "Ignore All" functionality
> you normally see in spell checkers. They live inside a single GSpellChecker
> objects and so are not shared between processes.

Maybe there is a better name than session.


I'm implementing the API proposed by Paolo, the problem is that enchant does not offer all the needed features.

dom: What should I do?
1 - Keep enchant and depend on it. Needed features are added to enchant.
2 - Fork enchant and include its code in gspell. Enchant will continue to be a separate library.
3 - Include the code from enchant in gspell. New versions of enchant will depend on gspell.

I prefer the third solution, but it may be impossible if GSpell is not going to offer all the features in Enchant.

Comment 18 Wouter Bolsterlee (uws) 2006-12-13 14:35:56 UTC

Perhaps a 'system-wide word list' in addition to a 'personal word list' is a good idea. I can imagine companies/organizations adding their own name and product/department/marketing/jargon words to the 'system-wide word list'.

Comment 19 Dominic Lachowicz 2006-12-13 15:06:22 UTC

(In reply to comment #17)
> dom: What should I do?
> 1 - Keep enchant and depend on it. Needed features are added to enchant.
> 2 - Fork enchant and include its code in gspell. Enchant will continue to be a
> separate library.
> 3 - Include the code from enchant in gspell. New versions of enchant will
> depend on gspell.

I was hoping for #3, and deprecating enchant altogether, or at most, having enchant's API wrap gspell's.

Comment 20 Matthias Clasen 2006-12-13 17:54:19 UTC

>> It is not clear to me why a library like gspell (distributed inside glib)
>> cannot depend on gboject.
>
>Probabily mclasen wants gspell inside libglib, as gregex. But GRegex works well
>without being a GObject, GSpellChecker needs to be a GObject.

I don't think we need to shoehorn everything into GLib that we want to be available in GTK+. The particular complication with GObject is that we don't 
want libglib to depend on libgobject, so any GObject-based APIs need to live 
outside libglib.

I don't think it will be a problem for gtk to conditionally depend on a 
spell-checking library. Of course, the gui bits should live directly in gtk

Comment 21 Behdad Esfahbod 2006-12-13 20:43:16 UTC

Well, gspell will most probably depend on gmodule.  So I think it has to live as a separate .so file in glib.

Comment 22 Paolo Maggi 2006-12-15 15:15:10 UTC

> 1 - Keep enchant and depend on it. Needed features are added to enchant.
> 2 - Fork enchant and include its code in gspell. Enchant will continue to be a
> separate library.
> 3 - Include the code from enchant in gspell. New versions of enchant will
> depend on gspell.

As dom said I think the best solution is to move the code of enchant  in gspell changing its public interface.

Comment 23 Marco Barisione 2006-12-16 15:05:23 UTC

(In response to comment #21)
> Well, gspell will most probably depend on gmodule.  So I think it has to live
> as a separate .so file in glib.

Yes, enchant uses gmodule to load the spell checking providers.

I'm merging enchant and gspell. I have some problems and I need suggestions.

C++
===
Some providers are written in C++, configure disables them if a compiler is not available. Is this acceptable?

Win 32 stuff
============
There is some Windows stuff that I don't know how to handle, for instance the ENCHANT_PLUGIN_DECLARE() macro generates a DllMain entry point. I'm a Windows programmer but I don't know how to handle this in an acceptable way for glib.

MacOS X stuff
=============
There is some MacOS stuff but I'm not a MacOS programmer. For instance in enchant_get_module_dir():
#ifdef XP_TARGET_COCOA
  return g_strdup ([[EnchantResourceProvider instance] moduleFolder]);
#endif

MySpell/HunSpell
================
Why is enchant using an internal copy of MySpell/HunSpell? Note that it's written in C++. If we keep the internal copy of HunSpell we should remove its internal tables for Unicode and use only the functions in glib, as I did for GRegex.

USpell
======
Uspell from CVS (it's in the AbiWord CVS) doesn't compile:
  uspell.h:140: error: extra qualification 'uSpell::' on member 'acceptGoodWord'
Removing "uSpell::" resolves the problem.

Thread-safety
=============
What is the status of thread-safety in enchant?

License
=======
Enchant is under LGPL with the additional permission to use non-LGPL provider libraries. Some providers use different licenses. The spell checking libraries use other licenses:

           provider         library
aspell     Enchant[1]       LGPL
applespell GPL              proprietary
hspell     Enchant          GPL
ispell     Enchant          I don't know[2]
myspell    Enchant          MPL/GPL/LGPL[3]
uspell     Enchant          GPL
voikko     Enchant          GPL
zemberek   GPL              BSD

[1] The same modified LGPL used by Enchant
[2] It's a modified BSD but I don't know if it's GPL-compatible
[3] The code is compiled in the provider, it isn't a separate library.

My questions are:
1 - What happens when enchant uses a provider under GPL?
2 - What happens when enchant uses a spell checking library with a different license?
3 - What happens when a program under GPL or a proprietary license is using spell checking?

Writing providers
=================
Currently providers have to export a init_enchant_provider() function, this fills a EnchantProvider struct with the needed function pointers.

Since we are GObject-ificating the library I could use a GInterface, so providers would have to implement a GSpellProviderInterface. However this would make more difficult (or better boring) writing a provider. What do you think?

Comment 24 Paolo Maggi 2007-01-09 17:21:05 UTC

(In reply to comment #23)
What about including only a couple of providers, for example aspell and myspell , and allow the development and distribution of 3rd party  providers?
Shouldn't this also resolve the problem with C++?

Comment 25 Dominic Lachowicz 2007-01-09 17:27:16 UTC

This may have gotten a little bit more interesting, as the KDE guys have expressed an interest in Enchant again. CC'ing jrideout.

http://www.abisource.com/mailinglists/abiword-dev/2006/Dec/0108.html

Comment 26 Dominic Lachowicz 2007-01-09 17:37:33 UTC

(In reply to comment #23)
> MySpell/HunSpell
> ================
> Why is enchant using an internal copy of MySpell/HunSpell? Note that it's
> written in C++. If we keep the internal copy of HunSpell we should remove its
> internal tables for Unicode and use only the functions in glib, as I did for
> GRegex.

MySpell is (or at least was) very badly packaged, and wasn't present on most users' systems. Since it was such a popular request, I included it in Enchant.

Since Hunspell supports more languages than MySpell, while retaining dictionary compatibility, it was a natural choice to incorporate that instead.
 
> USpell
> ======
> Uspell from CVS (it's in the AbiWord CVS) doesn't compile:
>   uspell.h:140: error: extra qualification 'uSpell::' on member
> 'acceptGoodWord'
> Removing "uSpell::" resolves the problem.

I'll fix that ASAP.
 
> Thread-safety
> =============
> What is the status of thread-safety in enchant?

Iffy, and that's one thing that I'd like to improve. The FOO_get_error() methods refer to an internal error string. I'd prefer if the functions used an [out] GError argument, or we didn't return error descriptions.

Adding new words to word lists should be protected by file locking where available, but this is no guarantee.

The broker may have some MT issues when requesting dictionaries that could be worked around using a GStaticMutex or similar.

Finally, any one of the backends may have MT issues. I'm not aware of any problems, but any backend may be affected.

> License
> =======
> Enchant is under LGPL with the additional permission to use non-LGPL provider
> libraries. Some providers use different licenses. The spell checking libraries
> use other licenses:

I'm not aware of any providers that use different licenses. If so, I'll relicense them as applicable.
 
>            provider         library
> aspell     Enchant[1]       LGPL
> applespell GPL              proprietary
> hspell     Enchant          GPL
> ispell     Enchant          I don't know[2]
> myspell    Enchant          MPL/GPL/LGPL[3]
> uspell     Enchant          GPL
> voikko     Enchant          GPL
> zemberek   GPL              BSD
> 
> [1] The same modified LGPL used by Enchant

> [2] It's a modified BSD but I don't know if it's GPL-compatible

I largely copied this code from AbiWord. IIRC, we had asked the ISpell folks if we could remove the advertising clause, and they amended the license so that it could be used in AbiWord.

> [3] The code is compiled in the provider, it isn't a separate library.

I don't think that makes a difference.
 
> Writing providers
> =================
> Currently providers have to export a init_enchant_provider() function, this
> fills a EnchantProvider struct with the needed function pointers.
> 
> Since we are GObject-ificating the library I could use a GInterface, so
> providers would have to implement a GSpellProviderInterface. However this would
> make more difficult (or better boring) writing a provider. What do you think?

GdkPixbuf has a concept of pluggable image loaders. It uses a "fill_vtable()" like method, rather than defining a GTypeInterface.

Comment 27 Dominic Lachowicz 2007-01-09 19:32:17 UTC

Worth checking out, just FYI:

http://jrideout.blogspot.com/2006/12/how-is-sonnet-stacking-up.html

Comment 28 Jacob R Rideout 2007-01-09 20:13:42 UTC

There are several things that have been confused so far.

To perform spellcheck you first must parse a text and determine what words should be checked. This can be easy or very complex. It depends on several factors:

* Purpose of the document - should we ignore non-content markup information?

* Language used - what constitutes a word as is useful to the spellcheck? Compound words in German, no space between words in Tibetan and Thai, etc...

* Layout issues - how is hyphenation handled? are does the software keep track of when they are inserted

* User preference - Should you assume capitalized words are pronouns and thus ignore them?

These issues aren't trivial. They are also often specific to the application's purpose and these questions are answered differently in different contexts. For example, should we use simpler heuristics to have better real-time feedback. Is this an IDE or a word processor. Should we support very complex scripts?

The best way to solve these problems is modular design and configurable properties. We should keep Enchant separate. Its purpose is to provide consistent interface to the various spelling engines available. It also maintains a user preference of which spell engine to use by default for each language. Both Gnome and KDE can use Enchant and have consistent preferences and PWLs saved. Additional classes such as GSpellChecker should provide a simple interface for application developers to use. But there are still are several different uses cases for which different, but similar spellchecking classes might be used.

Bottom line: being everything to everyone is a recipe for disaster.

Comment 29 Paolo Maggi 2007-01-10 08:52:11 UTC

Hi Jacob,

> There are several things that have been confused so far.

I agree with you. It seems most people is thinking to a solution that only allow adding spell checking capabilities to a couple of widget in gtk+. What I'm thinking to is a more generic framework to implement spell checker functionalities in applications. 

> 
> To perform spellcheck you first must parse a text and determine what words
> should be checked. 

Right, this is what I tried to explain in comment #16. It is also nice to see that the framework I'm thinking too is very similar to what you explained in your blog (comment #27), but without parts like language guessing, grammar checking and text breaking (this one is already in Pango).

> The best way to solve these problems is modular design and configurable
> properties. We should keep Enchant separate. Its purpose is to provide
> consistent interface to the various spelling engines available. It also
> maintains a user preference of which spell engine to use by default for each
> language. 

What we proposed is to move Enchant to glib with a more glib-like interface.
I don't see why this could be a problem. Isn't KDE depending on glib too?
BTW, Enchant depends on glib so moving it on glib will only reduce the dependency chain.

Comment 30 Marco Barisione 2007-01-10 09:55:04 UTC

(In reply to comment #26):
> MySpell is (or at least was) very badly packaged, and wasn't present on most
> users' systems. Since it was such a popular request, I included it in Enchant.
> Since Hunspell supports more languages than MySpell, while retaining dictionary
> compatibility, it was a natural choice to incorporate that instead.

Do you know if something changed? I would prefer to have it as a separate library.

> I'm not aware of any providers that use different licenses. If so, I'll
> relicense them as applicable.

The applespell and zembrek providers are under GPL.

(In reply to comment #28):
> To perform spellcheck you first must parse a text and determine what words
> should be checked. This can be easy or very complex. It depends on several
> factors:

For now I'm only working on the low level part, i.e. a class with a check_word method. To word breaking we are going to use Pango, but it's not sufficient.

> * Purpose of the document - should we ignore non-content markup information?

In GTK+ this should be done by the widget, for instance GtkSourceView could use syntax information to pass only real text to the spell checker.

> * Language used - what constitutes a word as is useful to the spellcheck?
> Compound words in German, no space between words in Tibetan and Thai, etc...

Another example are the apostrophe and URLS, a string like http://www.gnome.org/ should not be splitted.

> The best way to solve these problems is modular design and configurable
> properties. We should keep Enchant separate. Its purpose is to provide
> consistent interface to the various spelling engines available. It also
> maintains a user preference of which spell engine to use by default for each
> language. Both Gnome and KDE can use Enchant and have consistent preferences
> and PWLs saved. Additional classes such as GSpellChecker should provide a
> simple interface for application developers to use. But there are still are
> several different uses cases for which different, but similar spellchecking
> classes might be used.

We were going to kill Enchant and put its code in GSpell as KDE was not using Enchant. So what should I do? Enchant depends on glib and gmodule but GSpell depends on GObject too.

(In reply to comment #29):
> What we proposed is to move Enchant to glib with a more glib-like interface.
> I don't see why this could be a problem. Isn't KDE depending on glib too?
> BTW, Enchant depends on glib so moving it on glib will only reduce the
> dependency chain.

As I said GSpell will add a dependency to GObject.


Do we plan to do spell checking as you type or in background? Sonnet is going to use a thread but the background spell checking seems to be patented: http://www.delphion.com/details?pn=US05787451__.

Comment 31 Jacob R Rideout 2007-01-10 16:54:29 UTC

>> What we proposed is to move Enchant to glib with a more glib-like interface.
>> I don't see why this could be a problem. Isn't KDE depending on glib too?
>> BTW, Enchant depends on glib so moving it on glib will only reduce the
>> dependency chain.

>As I said GSpell will add a dependency to GObject.

KDE can include glib, but we cannot use GObjects.

> Do we plan to do spell checking as you type or in background? Sonnet is going
> to use a thread but the background spell checking seems to be patented:
> http://www.delphion.com/details?pn=US05787451__.

I'll check that out. But we've checked in the background for years with loops and callbacks rather than threads.

I just don't see the need to combine Enchant and GSpell. but it really isn't an issue, you can use composition and have GSpell use Enchant internally.

If your point is that the need for an external library should be eliminated, I understand. In that case, we should maintain a common interface to spell checking engines, so that the plugins are available to all spec conforming spellcheck classes. That said, we don't have to keep the current interface as it exists for Enchant. We could develop a new one.

I've created a page a the fdo wiki so we can flesh this out:
http://freedesktop.org/wiki/Standards_2fdesktop_2dlanguage_2dchecking_2dspec

Remember, also, to ensure enough code reuse for grammar and style checking. However, the separate facility of text-breaking in pango is likely sufficient.

Comment 32 Marco Barisione 2007-02-06 15:08:32 UTC

So what should I do? My preferred solution is to kill enchant moving its code to gspell, but keeping the compatibility at provider level.

I would like to have separate modules for each provider (maybe in the freedesktop SVN), so distros can easily distribute separate packages for each provider.

Dom, what's your opinion?

Comment 33 Dominic Lachowicz 2007-02-12 12:39:30 UTC

(In reply to comment #32)
> So what should I do? My preferred solution is to kill enchant moving its code
> to gspell, but keeping the compatibility at provider level.
> 
> I would like to have separate modules for each provider (maybe in the
> freedesktop SVN), so distros can easily distribute separate packages for each
> provider.
> 
> Dom, what's your opinion?
> 

Hi Marco,

I initially agreed to this because KDE was so reluctant to use Enchant. Now that they apparently want to use it in Sonnet, I'm not sure how I feel about this. Since I'm not doing any of the work on Gspell or Sonnet, I think that it would be more productive for you to talk with Jrideout than me about how he wants to move forward.

Comment 34 Jacob R Rideout 2007-02-12 13:22:55 UTC

Marco,

> So what should I do? My preferred solution is to kill enchant moving its code
> to gspell, but keeping the compatibility at provider level.

I am fine with that as long as compatibility is truly maintained.

> I would like to have separate modules for each provider (maybe in the
> freedesktop SVN), so distros can easily distribute separate packages for each
> provider.

I agree entirely, and urge that this occurs regardless of enchant's fate. This can be done by adding independent targets to the makefile. Each provider already has a make Makefile.am, so all that is need its build system tweaks.

There is one thing I'd like to add if we go in this direction. We need to ensure not only compatible providers, but consistent behavior among applications. Configuration files like enchant.ordering need to be honored. We can rename the file, but it should have known name, location, format and global/user override rules. Rules for tags should follow known standards like RFC 4646.

I would still like a common implementation/library to share the burden of maintenance. But, don't let that desire stop you.

Comment 35 Mathias Hasselmann (IRC: tbf) 2007-08-28 13:32:26 UTC

Completely unaware of this thread I've created a tiny library[1] to attach spell checking capabilities to all kinds of widgets by implementing this interface:

namespace Gtk {
    public interface SpellCheckCluster {
        public abstract string! text { get; construct; }
        public abstract long length { get; construct; }
    }

    public interface SpellCheckClient {
        public abstract List<SpellCheckCluster> get_clusters ();

        public abstract void reset_highlighting ();
        public abstract void highlight_word (SpellCheckCluster! word, 
                                             int start, int end);
        public abstract void replace_word (SpellCheckCluster! word, 
                                           string! replacement);

        public signal void changed ();
        public signal void populate_popup (SpellCheckCluster word, 
                                           Menu! menu);
    }
}

This interface is consumed by a SpellCheckManager splitting the words of a cluster (maybe better "section"). It uses enchant to check the words and words not found by any dictionary are passed to the highlight_word method. The spell check client can emit the changed event if it wants the manager to restart spell checking - maybe this should be more granular. The populate_popup signal is emited to tell the manager to fill a menu with suggestions and such.

Its nice to see that spell checking support shall be added to GTK+. Just have the problem now, that my code uses Vala... :-/

[1] http://taschenorakel.de/mathias/2007/08/28/spell-checking-masses/

Comment 36 Mathias Hasselmann (IRC: tbf) 2007-08-28 17:06:44 UTC

(In reply to comment #11)
> Word splitting is an app-level or widget level (in the case of TextView and
> Entry) operation.
>

Comment 37 Mathias Hasselmann (IRC: tbf) 2007-08-28 17:22:56 UTC

(In reply to comment #11)
> Word splitting is an app-level or widget level (in the case of TextView and
> Entry) operation.

From my little experience with my gtkspellcheck library I'd say splitting words belongs into the library to avoid pointless and boring code duplication. The concept of clusters (or sections) in my library allows the application/widget code to do inexpessive pre-filtering of the text by just starting a new cluster when hitting text to ignore. It also avoids needless concatination of strings which already exists in chunks in the widget, like of instance the rows of a tree model. Also notice how the API of gtkspellcheck allows post-filtering by selectively ignoring highlight_word calls.

(In reply to comment #12)
> Currently the library uses libxml2 and iso-codes to obtain the language codes
> and their names. Is it ok to depend on iso-codes? If it's ok I will port gspell
> from libxml2 to GMarkup.

Just want to notice my library contains a GMarkup based iso-codes parser transfering the relevant portions of the iso-code XML files into a memory-mapable blobs: http://taschenorakel.de/gitweb/?p=gtkspellcheck;a=blob;f=isocodes/isocodes.c

Sucks, that I didn't find this thread before.

Comment 38 Marco Barisione 2007-08-28 17:32:33 UTC

I stopped working on gspell months ago because I had to work on more important (i.e. university related :) stuff.

A mercurial repository with some code is available at http://techn.ocracy.org/gspell/ but probably I have some more code on my hard disk not pushed on the server. If you want to discuss on this you can find me on IRC, my nick is barisione.

Comment 39 Wouter Bolsterlee (uws) 2007-09-03 11:15:24 UTC

(In reply to comment #11)
> Word splitting is an app-level or widget level (in the case of TextView and
> Entry) operation.

Actually, word splitting can be extremely hard, depending on the language. Splitting on whitespace is definitely not going to suffice for some of the more "exotic" languages out there. Therefor tokenizing/word splitting should not be implemented in application code, but in a reusable library.

Comment 40 Dominic Lachowicz 2007-09-03 12:27:13 UTC

(In reply to comment #39)
> (In reply to comment #11)
> > Word splitting is an app-level or widget level (in the case of TextView and
> > Entry) operation.
> 
> Actually, word splitting can be extremely hard, depending on the language.
> Splitting on whitespace is definitely not going to suffice for some of the more
> "exotic" languages out there. Therefor tokenizing/word splitting should not be
> implemented in application code, but in a reusable library.
> 

That's fine. We have something like ICU for that already.

Comment 41 Mathias Hasselmann (IRC: tbf) 2007-09-03 13:58:30 UTC

(In reply to comment #39)
> (In reply to comment #11)
> > Word splitting is an app-level or widget level (in the case of TextView and
> > Entry) operation.
> 
> Actually, word splitting can be extremely hard, depending on the language.
> Splitting on whitespace is definitely not going to suffice for some of the more
> "exotic" languages out there. Therefor tokenizing/word splitting should not be
> implemented in application code, but in a reusable library.
> 

Pango also has quite usable word breaking caps in pango_get_log_attrs.

Comment 42 Bastien Nocera 2008-03-13 15:37:53 UTC

*** Bug 162414 has been marked as a duplicate of this bug. ***

Comment 43 Matthias Clasen 2008-09-08 05:44:16 UTC

*** Bug 167286 has been marked as a duplicate of this bug. ***

Comment 44 Leonardo Ferreira Fontenelle 2009-06-08 18:23:44 UTC

(In reply to comment #39)
> Actually, word splitting can be extremely hard, depending on the language.

The Aspell dicionaries, for instance, can define if hyphenized words are to be checked together or separated.

Comment 45 Dominic Lachowicz 2009-06-08 19:09:22 UTC

(In reply to comment #44)
> (In reply to comment #39)
> > Actually, word splitting can be extremely hard, depending on the language.
> 
> The Aspell dicionaries, for instance, can define if hyphenized words are to be
> checked together or separated.
> 

That's not really a core competency of a spell checker. We'd be better off leaving that to something with substantial knowledge of language rules, like ICU. It already implements word and line breaking iterators that take things like hyphenation into account.

Comment 46 Javier Jardón (IRC: jjardon) 2010-01-13 07:01:27 UTC

So, what is the status of this? Is someone working (or interested to work) in gspell or in tbf aproach?
Maybe today current Enchant API can be directly used to add support for spell checking in the gtk+ stack ...

Comment 47 Leonardo Ferreira Fontenelle 2010-11-02 21:58:19 UTC

(In reply to comment #46)
> So, what is the status of this? Is someone working (or interested to work) in
> gspell or in tbf aproach?
> Maybe today current Enchant API can be directly used to add support for spell
> checking in the gtk+ stack ...

What would be needed, for the Enchant API to be directly used by Gtk+?

Comment 48 Serkan Kaba 2010-11-03 09:01:20 UTC

(In reply to comment #47)
> (In reply to comment #46)
> > So, what is the status of this? Is someone working (or interested to work) in
> > gspell or in tbf aproach?
> > Maybe today current Enchant API can be directly used to add support for spell
> > checking in the gtk+ stack ...
> 
> What would be needed, for the Enchant API to be directly used by Gtk+?

Incorporate Gtkspell?

Comment 49 Leonardo Ferreira Fontenelle 2011-03-16 02:47:57 UTC

(In reply to comment #45)
> (In reply to comment #44)
> > (In reply to comment #39)
> > > Actually, word splitting can be extremely hard, depending on the language.
> > 
> > The Aspell dicionaries, for instance, can define if hyphenized words are to be
> > checked together or separated.
> > 
> 
> That's not really a core competency of a spell checker. We'd be better off
> leaving that to something with substantial knowledge of language rules, like
> ICU. It already implements word and line breaking iterators that take things
> like hyphenation into account.

ICU defines word boundaries, and we could use it, but I would recommend reading this:

The correct interpretation of hyphens in the context of word boundaries is challenging. It is quite common for separate words to be connected with a hyphen: “out-of-the-box,” “under-the-table,” “Italian-American,” and so on. A significant number are hyphenated names, such as “Smith-Hawkins.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words (usually to resolve some ambiguity such as “re-sort” as opposed to “resort”), it is better overall to keep the hyphen out of the default definition. Hyphens include U+002D hyphen-minus, U+2010 hyphen, possibly also U+058A ( ֊ ) armenian hyphen, and U+30A0 katakana-hiragana double hyphen.

Implementations may build on the information supplied by word boundaries. For example, a spell-checker would first check that each word was valid according to the above definition, checking the four words in “out-of-the-box.” If any of the words failed, it could build the compound word and check if it as a whole sequence was in the dictionary (even if all the components were not in the dictionary), such as with “re-iterate.” Of course, spell-checkers for highly inflected or agglutinative languages will need much more sophisticated algorithms.

The use of the apostrophe is ambiguous. It is usually considered part of one word (“can’t” or “aujourd’hui”) but it may also be considered as part of two words (“l’objectif”). A further complication is the use of the same character as an apostrophe and as a quotation mark. Therefore leading or trailing apostrophes are best excluded from the default definition of a word. In some languages, such as French and Italian, tailoring to break words when the character after the apostrophe is a vowel may yield better results in more cases. This can be done by adding a rule WB5a.

From: http://www.unicode.org/reports/tr29/#Word_Boundaries

Comment 50 Sébastien Wilmet 2012-07-31 12:37:07 UTC

The link for gspell [1] is now broken. But gtkhtml has a modified version, used for Evolution [2] [3].

Enchant is still used by KDE and many other applications or libraries, so does the project of GObjectifying enchant still make sense?

An easier solution would be to use enchant as a dependency, and wrap its features in GObject classes, like it is done in gtkhtml or in the spell gedit plugin.

Fetching the name of a language from the iso-codes (for example, the name of fr_BE is "French (Belgium)") is useful for other purposes than spell checking. So I've filed a bug (bug #680876) to include this feature in PangoLanguage directly. If it is accepted and done, the code for the spell checking can be simplified to use PangoLanguage.

Also, as said on the mailing list [2], GIO is a better place these days for wrapping enchant. If GIO can have a dependency to enchant, obviously.

[1] http://techn.ocracy.org/gspell/
[2] https://mail.gnome.org/archives/gtk-devel-list/2012-January/msg00050.html
[3] http://git.gnome.org/browse/gtkhtml/tree/components/editor

Comment 51 Alexandre Franke 2013-09-20 14:24:33 UTC

(In reply to comment #48)
> Incorporate Gtkspell?

GtkSpell could certainly serve as a testbed for a future integration.

Currently some important modules like Empathy or gedit use Enchant directly and have a lot of code in their tree to handle things that are not handled at the enchant level and maybe missing from GtkSpell. It would be nice to have this features implemented in GtkSpell and these modules moving to GtkSpell.

Comment 52 Sandro Mani 2013-09-20 14:45:45 UTC

As the current maintainer of gtkspell, I'm certainly open to incorporating the missing features. If corresponding tickets are opened, I'll look at implementing them.

Comment 53 Allison Karlitskaya (desrt) 2013-09-26 11:54:22 UTC

*** Bug 708807 has been marked as a duplicate of this bug. ***

Comment 54 Pander 2015-01-15 12:38:53 UTC

There is a bounty for this issue on https://www.bountysource.com/issues/2679534-adding-support-for-spellcheckers-into-the-gtk-stack Please make a donation too.

Comment 55 Daniel Korostil 2015-01-15 23:14:30 UTC

Sébastien, what the status? Are you still working on it or you can delegate the work to someone else?

Comment 56 Sébastien Wilmet 2015-01-16 10:21:23 UTC

No, I don't plan to work on this any time soon. Some links:

https://mail.gnome.org/archives/gtk-devel-list/2013-October/msg00025.html
https://wiki.gnome.org/Initiatives/SpellChecking

Comment 57 Pander 2016-03-16 09:14:03 UTC

Regarding Comment #32 on Linux

Please omit:
* Ispell
* Pspell
* MySpell

Please support at least:
* Hunspell via libhunspell
* Enchant via libenchant

Comment 58 Sébastien Wilmet 2016-03-16 10:09:08 UTC

There is now gspell:
https://wiki.gnome.org/Projects/gspell

I'm not sure that we want to keep this bug open for another decade.

Comment 59 Matthias Clasen 2016-03-16 11:54:02 UTC

People still want spell checking available by default in every entry/textview.

Comment 60 Matthias Clasen 2018-02-10 05:10:43 UTC

We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.

Comment 61 Bastien Nocera 2018-02-13 10:42:47 UTC

As per comment 59.

Comment 62 GNOME Infrastructure Team 2018-05-02 14:26:15 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/gtk/issues/274.