After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 377891 - Automatic Language Detection
Automatic Language Detection
Status: RESOLVED OBSOLETE
Product: tracker
Classification: Core
Component: General
unspecified
Other All
: Normal enhancement
: ---
Assigned To: Edward Duffy
Tracker maintainers
Depends on:
Blocks:
 
 
Reported: 2006-11-21 19:51 UTC by Florian Steinel
Modified: 2010-03-11 15:01 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch using libtextcat-2.2 (9.78 KB, patch)
2007-03-05 23:14 UTC, Edward Duffy
none Details | Review
Updated patch (9.72 KB, patch)
2007-03-07 16:43 UTC, Edward Duffy
none Details | Review

Comment 1 Edward Duffy 2007-03-05 23:14:58 UTC
Created attachment 84020 [details] [review]
Patch using libtextcat-2.2

Ok..here's a patch that uses a system installed libtextcat.  It's in the Feisty (Ubunutu 7.05) repository, so I used that one.  A couple of remarks:

 1. Language detection will only work for plain text files and mime types we have text filters for.

 2. libtextcat seems to require full paths. I've included a configuration file for tracker to use, so we know where that is installed. However, the paths to the language models have also been hardcoded to /usr/share/libtextcat.

 3. Instead of full words for the languages ("english", "french", etc), I've replaced them with standard country codes "en", fr", etc.

 4. I've removed support for most of the languages supported by libtextcat, and only left in the ones supported by the stemmer (see your tracker.cfg for the list).

 5. configure.ac is hardcoded to use libtextcat.  Could one of our autotools wizards patch my patch to check for this (optionally).  I've got all my changes wrapped around HAVE_LIBTEXTCAT macros, and src/trackerd/Makefile.am also needs updating.

 6. And finally, this does create a new metadata type (File:Language), so you'll need to rebuild your tracker database to use this.
Comment 2 Edward Duffy 2007-03-07 16:43:50 UTC
Created attachment 84182 [details] [review]
Updated patch

This update pulls the language detection out of trackerd.  Some files I tested with Asian languages were causing segfaults.

This is also UTF-8 friendly.
Comment 3 era+gnome 2007-12-15 18:18:35 UTC
You should probably take care to create new language models which are appropriate for the domain and language sets you wish to process.  The language models which ship with Gertjan's original TextCat were proof of concept ones, nothing more.  Dunno if libtextcat ships with different ones, but I would not be surprised if they simply recycled Gertjan's.

The mguesser library comes with larger models which were also specifically developed for a document indexing system (mnogosearch).  They are almost compatible with the TextCat format; you need a couple of minor tweaks but it's literally a sed one-liner.

Do you have more information about how to reproduce those segfaults?
Comment 4 era+gnome 2007-12-15 18:24:30 UTC
I looked at the patch; it's pretty obvious that these language models are the original ones by Gertjan van Noord (a comment even states this).

If you want to investigate mguesser, you could also try libmguesser, though I have no idea how it behaves when compiled as a library.  You can build the library straight from the mguesser sources.

Sorry for following up on my own comment /-:
Comment 5 Ivan Frade 2008-10-29 13:39:14 UTC
Marked patch as obsolete. Now the text index is done in the indexer.
Comment 6 Duncan Lithgow 2010-02-03 09:12:56 UTC
Is there a status update for this? It looks like some work went into getting started - but I can't see if anything came of the initial work.

Also I disagree that this is an enhancement. I think that the lack of this feature is really a minor bug. What is the use of stemming if it can't reliably work out the documents language?
Comment 7 Ivan Frade 2010-02-03 11:48:26 UTC
No work planned on this in the short term.

Locale gives us the main language in the system and that is good enough for a regular user.
Comment 8 Duncan Lithgow 2010-02-03 13:05:39 UTC
I think you'll find that your idea of 'regular user' is rather ameri-centric (word I made up). Here in Europe and Latin America and Middle East and most places actually, people have their system configured for one language but work in several. The number of people with their local language plus English must be rather large.

But of course this doesn't change the fact that no-one is motivated to work on this just now, but it is something to remember.
Comment 9 Martyn Russell 2010-03-11 15:01:18 UTC
Thanks for taking the time to report this bug.
However, you are using a version that is too old and not supported anymore. GNOME developers are no longer working on that version, so unfortunately there will not be any bug fixes for the version that you use.

By upgrading to a newer version of GNOME you could receive bug fixes and new functionality. You may need to upgrade your Linux distribution to obtain a newer version of GNOME.
Please feel free to reopen this bug if the problem still occurs with a newer version of GNOME.