GNOME Bugzilla – Bug 713191
email body search corpus improvements
Last modified: 2021-07-05 13:26:15 UTC
---- Reported by chaz@yorba.org 2013-05-21 17:15:00 -0700 ---- Original Redmine bug id: 6980 Original URL: http://redmine.yorba.org/issues/6980 Searchable id: yorba-bug-6980 Original author: Charles Lindsay Original description: There are a few improvements we can make to how we add email bodies to the search table: * We may need to be smarter about stripping HTML in general. The algorithm used currently may just need to be tweaked to split words where it's not, or not split where it is. * <del>We should be adding the text of any attachments a client might display inline, e.g. email attachments.</del> (Became #7069.) * When we strip HTML, we could add in things like alt-text for images. Not sure if this is desirable. These can be split up into separate tickets as necessary. Related issues: related to geary - 6837: Normalize message bodies for full-text search table (Fixed) related to geary - 7069: index attachments, especially ones we display inline (Fixed) ---- Additional Comments From geary-maint@gnome.bugs 2013-09-04 16:28:00 -0700 ---- ### History #### #1 Updated by Jim Nelson 6 months ago * **Target version** set to _0.4.0_ #### #2 Updated by Jim Nelson 5 months ago * **Assignee** set to _Charles Lindsay_ * **Priority** changed from _Normal_ to _High_ #### #3 Updated by Charles Lindsay 5 months ago * **Description** updated (diff) Split out the second bullet point to #7069. #### #4 Updated by Charles Lindsay 5 months ago * **Assignee** deleted (<strike>_Charles Lindsay_</strike>) * **Priority** changed from _High_ to _Low_ The rest of these improvements seem low priority. #### #5 Updated by Jim Nelson 3 months ago * **Target version** changed from _0.4.0_ to _0.5.0_ --- Bug imported by chaz@yorba.org 2013-11-21 20:19 UTC --- This bug was previously known as _bug_ 6980 at http://redmine.yorba.org/show_bug.cgi?id=6980 Unknown milestone "unknown in product geary. Setting to default milestone for this product, "---". Setting qa contact to the default for this product. This bug either had no qa contact or an invalid one. Resolution set on an open status. Dropping resolution
A number of these have been address in Bug 714317. We'd need to do a FTS rebuild to get them to apply to old messages though.
Questions about the following improvement: * We may need to be smarter about stripping HTML in general. The algorithm used currently may just need to be tweaked to split words where it's not, or not split where it is. Is the algorithm you are talking about the function html_to_text(...) in src/engine/util/util-html.vala line 123? Can you provide an example of a string containing a word which should not be split and an example of a string containing a word which should be split?
(In reply to Michael Hochleitner from comment #2) > Questions about the following improvement: > > * We may need to be smarter about stripping HTML in general. The algorithm > used currently may just need to be tweaked to split words where it's not, or > not split where it is. > > Is the algorithm you are talking about the function html_to_text(...) in > src/engine/util/util-html.vala line 123? That's right - that gets called by Geary.RFC822.Message.get_searchable_body() to convert HTML bodies to text for searching. > Can you provide an example of a string containing a word which should not be > split and an example of a string containing a word which should be split? These mostly arise as a result of the semantics of HTML. Typically block elements should cause a word (and line) break, but inline elements shouldn't. E.g. the HTML string "<UL><LI>Break</LI></UL><P>Fast" should be converted to "Break\n\nFast" not "BreakFast", and similarly "in<EM>line</EM>" should be converted to "inline", not "in line". As I mention above, this may already be working well enough, so all that might be needed here is adding some unit tests to Geary.HTML.UtilTest to ensure that the plain-text produced by that method is doing the right thing, and proving cases where it's a bit subpotimal (coalescing adjacent spaces, for example.) There's already a couple of test cases, but they only cover a limited number of cases.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/geary/-/issues/ Thank you for your understanding and your help.