GNOME Bugzilla – Bug 168189
Strip diacritics and split ligature characters when searching
Last modified: 2018-07-03 09:53:08 UTC
If you search for "aeiou" in Google, it will also match pages with "αθοτu". This is useful because it lets you find pages with bad spelling, but that nonetheless contain the information you want. Beagle should do the same. This page describes a quick-n-dirty way to strip diacritics: http://weblogs.asp.net/michkap/archive/2005/02/19/376617.aspx
Unfortunately, this example requires .Net 2.0 (or compatible API) which mono does not support yet.
Created attachment 70798 [details] [review] Strip Diacritics when searching Ok, the attached patch does that, but at the same time, doesn't this ruin any chance at real international support? But maybe I just did it the wrong way. Irregardless, I just did this in the beagle-search frontend, since other frontends may choose differently, and its not hard to implement.
Scratch that, this causes XMLSerialization to die horribly.
The right way to do this is to add an additional filter to the analzyer that makes these modifications for you. Cases of changing things like é to e are easy, but for ü you probably want to support both u and ue. That's a little trickier.
Lucene.Net 1.9.1 has support for this; we should add it once it's checked in.
Marking NEW, since Lucene.Net 1.9.1 is now checked in and merged...
Just adding some bug-intertwine, we need some sort of language detection for this to work.
I don't think this bug depends on 354742. We probably want to strip diacritics regardless of language. There is a Lucene filter (new in 2.0, I believe) for stripping them, we should look into using it: http://cvs.gnome.org/viewcvs/*checkout*/beagle/beagled/Lucene.Net/Analysis/ISOLatin1AccentFilter.cs
FYI, there is another one in the making, a StripLatinDiacriticsFilter https://issues.apache.org/jira/browse/LUCENENET-38
*** Bug 482567 has been marked as a duplicate of this bug. ***
*** Bug 525911 has been marked as a duplicate of this bug. ***
Bug 525911, which I just marked as a dup, is a slight variation on the diacritic problem. It has ligature characters like a combined "fi" that should be searchable with the regular, individual "fi" characters. Attached to that bug is a PDF which illustrates the behavior.
Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735): https://gitlab.gnome.org/Archive/beagle/commits/master "tracker" is an available alternative. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the responsibility for active development again.