GNOME Bugzilla – Bug 354742
Automatic Language Detection
Last modified: 2018-07-03 09:56:00 UTC
It would be sweet if we could automatically detect the language of a document so that we could perform correct text analysis. A way to do it would be porting TextCat to C#, a nice little project for somebody to get involved with Beagle. A Java implementation of TextCat is available at http://textcat.sourceforge.net/ and is distributed under the LGPL.
I think I'm going to work on this; let me know if people are no longer interested in it or someone else is on it. Thanks!
I've got the language categorizer ported to C# and working but I was wondering if anyone had any insight as to where it should go in beagled, I'm having trouble trying to find the point where it interprets the text and sends it off to Lucene. Thoughts? Comments? Phrased differently, if you had a function called GetTextLanguage(string s), where would you call it?
Right now we have a BeagleAnalyzer, which derives from Lucene's StandardAnalyzer in LuceneCommon.cs. What we probably want to do is use Snowball analyzers instead, using whichever one is appropriate for the language of the document. See https://svn.apache.org/repos/asf/incubator/lucene.net//trunk/C%23/contrib/Snowball.Net/ for the svn repo of the Snowball analyzers in C#.
Created attachment 76877 [details] [review] 0/2 - Patch to add text categorization support to Beagle CVS This patch adds three main things: * Two classes to Beagle.Util that will use statistics to discover the language a text sample is written in * A (incomplete) library of sample texts that are categorized by locale name * A program in ./tools called beagle-get-language that will read a text file and print its locale and friendly language name (make sure to make install before using it, the program won't find the language fingerprints otherwise) Most of the samples I got from libtextcat, the original implementation of this concept; however, a lot of them were in encodings that TextReader mangles when read so I've been going through them and replacing as needed. All you have to do to add support for a language is find a significantly long piece of text written in it; I've been using Wikipedia for this purpose (the topic of the text doesn't matter, just pick anything!). I'll work on improving the samples but I wanted to get the code out as soon as possible.
Created attachment 76880 [details] [review] 0/2 - Patch to add text categorization support to Beagle CVS This patch adds three main things: * Two classes to Beagle.Util that will use statistics to discover the language a text sample is written in * A (incomplete) library of sample texts that are categorized by locale name * A program in ./tools called beagle-get-language that will read a text file and print its locale and friendly language name (make sure to make install before using it, the program won't find the language fingerprints otherwise) Most of the samples I got from libtextcat, the original implementation of this concept; however, a lot of them were in encodings that TextReader mangles when read so I've been going through them and replacing as needed. All you have to do to add support for a language is find a significantly long piece of text written in it; I've been using Wikipedia for this purpose (the topic of the text doesn't matter, just pick anything!). I'll work on improving the samples but I wanted to get the code out as soon as possible. Apply with patch -p1
Please ignore the first patch, it's not right and generally f'd up, I hit submit accidentally and it got sent early
Created attachment 77309 [details] [review] Update of previous patch + integration with LuceneCommon This patch updates the previous one, making the results _much_ more reliable. It also fixes the texts that were bad in the previous patch and adds in new ones. Finally, it also adds a new keyword "Language" to all files indexed by beagled where a language can be determined. The text categorization is almost 100% accurate on my machine (which is admittedly mostly English files, though I have tested it on Farsi, Japanese, German, and French text samples); it even gets the 2nd closest language right (Chinese for Japanese texts, Dutch for German texts, etc). However, the biggest problem right now is performance (as with any addition to the indexer); I suspect beagled performs slower with this patch as well as uses more memory. Some of the Beagle perf experts can definitely help here, as well as doing some more clever things wrt character sets (if we see Chinese characters, don't bother testing for Finnish; this has some problems wrt multi-language documents or something that's 99% English with one kanji for example). In this regard I've tried to make the detector bail early if it doesn't have enough data but I suspect some more aggressive optimization could be done as well.
Another note that is pretty important, when you try to add new languages to sample_texts, you _must_ make sure the text is long enough, otherwise detection will be incorrect for every language. The easiest way is to sudo make install, then run "wc {prefix}/share/beagle/lang_fp/*". If they aren't all 1000, you need to add more text to that sample file.
Just confirming the bug since its getting close to complete.
Created attachment 78579 [details] [review] Newest iteration of patch This patch adds the language:<some_lang> keyword to query parsing
Some details about the patch, you can either specify the query using the language code (en,de,fr) or the name of the language in your current locale (for example, language:Français works if LANG="fr" and language:French works if I'm in LANG="en"). I tried playing around with optimizing analyze() some, but didn't get very far but I suspect it could definitely be done, it calls Substring() a lot. Some memory analysis using heap-buddy would probably be good as well although I think a lot of the memory usage is temporary (lots of string objects that get created then discarded)
D'oh! I just realized now that Bugzilla hoses up my gzip'd patch, I'm uploading the uncompressed one (it was starting to get big so I wanted to save space). I also checked this patch against CVS HEAD and it applies cleanly, but I'm not sure if it works since I need Mono 1.1.18 and I only have 1.1.17 on this machine (working on getting the new one) What are the next steps for getting this patch committed? I'd like to continue the work in this area but I want to make sure that what I'm doing is helping towards getting this merged.
Created attachment 79961 [details] [review] Update of beagle patch, un-gzip'd Apply with patch -p1 (older patches used -p0, sorry about that)
I had a brief look at the code. Some comments (its only a brief look, not a rigorous review :) follow: * TextCategorizer is opening a lot of (sample) files, it doesnt look like the streamreaders are being closed. Make sure opened files are properly closed. * In LuceneCommon, when you call string lang = lang_table.GetTextLanguage(reader); does not that consume all the text from 'reader' ? I that case, when Lucene later tries to fetch data from 'reader' for adding to the index, it wont get any. The reader is of type Beagle.Util.PullingReader; I dont remember if it allows resetting. If what I am guessing is correct, some more work is needed to ensure no data is lost while determining the language from the reader. * Is there a Dublin Core metadata with the name "lang" ? Could you check, and if there is, please use "dc:lang" for the language property. * You are merely setting an attribute - how is that going to help ? I think using the correct analyzer based on the language should be the prime goal. I did not follow this bug and the related mailing list discussions, so no technical comments as of now.
Hi Paul, I sincerely apologize for not reviewing the code yet, and I appreciate that you're keeping it up to date with the trunk. I've been busy with other things (the unified indexes branch, the move to SVN, and the holidays) and haven't given the code the testing it deserves. If you want, it may make sense to check this code in on a branch, give it some testing, and solidify it there. It's a large chunk of code to have to review and increment in patches. I agree with Bera, determining the language of the document is the right first step, but ultimately we need to use analyzers for that language to actually store it in the index. (Right now we're only using English.) There is a set of analyzers called Snowball that we can use. The Lucene.Net guys have even done a port of the Java ones to C# for use in Lucene. We should use those. Another thing to nitpick about is that a lot of the code doesn't follow the Beagle style guidelines, which are generally a requirement for inclusion. They can be found in the HACKING file, but the big one I've noticed is having a space between functions/keywords and the opening parenthesis: ie: "DoStuff (foo)" or "if (bar)" rather than "DoStuff(foo)" and "if(bar)". I'd like to see these fixed, but they won't ultimately cause me to say "no" for the code's inclusion.
Bera: I've added in the Close's(), what a n00b mistake; I think that I thought that it'd be handled by the finalizer, which it would but not necessarily as fast as it could. I fixed the metadata too (it's 'dc:language'). As to setting the attribute, it allows you to do searches based on the language of the content; I've added the appropriate plumbing to QueryStringParser (see comment #11). I also fixed the reader issue by using a new one for the language and then closing it. It's not terribly elegant, but after looking at the underlying code I think it will work alright, a new Stream gets allocated when you call GetTextReader. Joe: That's alright about the time, don't worry about it. I can't commit the code because I don't have SVN write access, but what I will do is split apart the samples and the code patches; that'll make it much easier to review. I've checked out the Snowball analysers and the problem is, the "port" is actually a porting of generated code; what I'm trying to do is (and I'm almost there), instead port the _generator_ using the Java generator as a base. Unfortunately, the code is...unfriendly, and his Makefile is really awful; I may eschew contributing the code directly back to him and just use it for Beagle (and Email him the generator of course). Another problem is the handling of non-UTF8 character sets in the indexables. Basically, there's two ways to go; we can either make an extra sample text for each encoding ("ja-SHIFTJIS" for example), or convert all the incoming text to UTF8. I'm in favor for the latter, because we can take advantage of Mono/.NETs text support and it's less memory. It may be best to jump off that bridge later though.
Created attachment 80053 [details] [review] [0/1] Update of patch incorporating fixes suggested
Created attachment 80054 [details] [review] [1/1] Separated sample texts
Created attachment 80055 [details] [review] [0/1] Update of patch incorporating fixes suggested - de'dumbed The 's' in S-Expression stands for 'stupid'.
Created attachment 80056 [details] [review] [0/1] Update of patch incorporating fixes suggested - de'dumbed 2 This is not my day.
It appears that Jacob Rideout from KDE is doing similar work on language identification with Sonnet, he blogs about it here: http://blog.jacobrideout.net/ They use a variant of this method that appears to be less accurate but also less memory, as well as optimizing by using the character set as an indicator of the language. This may be a way to go but I'd like to get the numbers on it first
Alright, I've finished porting the Snowball compiler to C#. I haven't integrated the stemmers into Beagle proper yet, but that shouldn't be too bad. What this patch does that's new: * Adds the ./Util/Snowball directory, which contains a Snowball compiler that converts .sbl files to C#, as well as stemmer algorithms for Dutch, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, and Swedish. * Invokes this compiler and adds the resulting files to Util.dll * Also includes all the TextCat stuff (I couldn't split them apart due to the way the patches are created, sorry) Some notes: * The classes are all of the form Snowball.Stemmer_<langcode> (Snowball.Stemmer_en for example). The generated code is pretty ugly, but the generator code I based it off of is awful, it can't be helped. * All of the classes are derived from SnowballProgram, so it should be pretty easy to write a Reflection routine that finds and instantiates all the SnowballPrograms and reads the class name Also, sorry about all the Bugzilla mess, I'm not that great at making patches yet (today I discovered splitdiff and was amazed). Perhaps this belongs in its own branch because the patches are getting pretty difficult to keep track of. Basically, here's how to apply it: svn co http://svn.gnome.org/svn/beagle/trunk/beagle . cd beagle patch -p1 < beagle-textcat-sources.patch patch -p0 < beagle-textcat-in-beagled-v3.patch patch -p0 < beagle-snowball.patch
Created attachment 81223 [details] [review] Snowball compiler for Beagle
Created attachment 81224 [details] [review] [0/1] Textcat patch + Snowball build patch
Alright, as of two days or so ago I officially have svn access, so should I create this on a separate branch now?
Its a lot of code. A branch would be good.
Of interest may be an upstream Lucene issue in their tracker: https://issues.apache.org/jira/browse/LUCENE-826
Paul, making some attempt to wake up from hibernation :) As a first step, I am trying to get snowball into beagle. I am a bit lost - are these the 2 different ways of doing this ? - Use lucene.net's snowball.net: they port the generated java files to c#. - Use your generator, which generates managed code directly from snowball source. Also, if the snowball guys makes change to their source (and data), then how easy is to get the corresponding C# changes ?
And are you by any chance using this: http://thread.gmane.org/gmane.comp.search.snowball/916/focus=916
I felt it safe to pull in dotlucene's snowball.net - they are sort of released and hence maintained. r3960 contains last related commit. With the trunk now armed with snowball stemmer, the next goal is to recognize non-english data and get non-english stemmers activated. I will look into this later. The existing work by Paul, as attached above, is directly relevant and should be looked into.
Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735): https://gitlab.gnome.org/Archive/beagle/commits/master "tracker" is an available alternative. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the responsibility for active development again.