After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 713191 - email body search corpus improvements
email body search corpus improvements
Status: RESOLVED OBSOLETE
Product: geary
Classification: Other
Component: engine
master
Other All
: Low normal
: ---
Assigned To: Geary Maintainers
Geary Maintainers
Depends on:
Blocks: 776330
 
 
Reported: 2013-05-22 12:15 UTC by Charles Lindsay
Modified: 2021-07-05 13:26 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Charles Lindsay 2013-11-21 20:18:59 UTC


---- Reported by chaz@yorba.org 2013-05-21 17:15:00 -0700 ----

Original Redmine bug id: 6980
Original URL: http://redmine.yorba.org/issues/6980
Searchable id: yorba-bug-6980
Original author: Charles Lindsay
Original description:

There are a few improvements we can make to how we add email bodies to the
search table:

  * We may need to be smarter about stripping HTML in general. The algorithm used currently may just need to be tweaked to split words where it's not, or not split where it is.
  * <del>We should be adding the text of any attachments a client might display inline, e.g. email attachments.</del> (Became #7069.)
  * When we strip HTML, we could add in things like alt-text for images. Not sure if this is desirable.

These can be split up into separate tickets as necessary.

Related issues:
related to geary - 6837: Normalize message bodies for full-text search
table (Fixed)
related to geary - 7069: index attachments, especially ones we display
inline (Fixed)



---- Additional Comments From geary-maint@gnome.bugs 2013-09-04 16:28:00 -0700 ----

### History

####

#1

Updated by Jim Nelson 6 months ago

  * **Target version** set to _0.4.0_

####

#2

Updated by Jim Nelson 5 months ago

  * **Assignee** set to _Charles Lindsay_
  * **Priority** changed from _Normal_ to _High_

####

#3

Updated by Charles Lindsay 5 months ago

  * **Description** updated (diff)

Split out the second bullet point to #7069.

####

#4

Updated by Charles Lindsay 5 months ago

  * **Assignee** deleted (<strike>_Charles Lindsay_</strike>)
  * **Priority** changed from _High_ to _Low_

The rest of these improvements seem low priority.

####

#5

Updated by Jim Nelson 3 months ago

  * **Target version** changed from _0.4.0_ to _0.5.0_



--- Bug imported by chaz@yorba.org 2013-11-21 20:19 UTC  ---

This bug was previously known as _bug_ 6980 at http://redmine.yorba.org/show_bug.cgi?id=6980

Unknown milestone "unknown in product geary. 
   Setting to default milestone for this product, "---".
Setting qa contact to the default for this product.
   This bug either had no qa contact or an invalid one.
Resolution set on an open status.
   Dropping resolution 

Comment 1 Michael Gratton 2016-12-20 13:56:08 UTC
A number of these have been address in Bug 714317. We'd need to do a FTS rebuild to get them to apply to old messages though.
Comment 2 Michael Hochleitner 2018-07-27 09:32:04 UTC
Questions about the following improvement:

  * We may need to be smarter about stripping HTML in general. The algorithm used currently may just need to be tweaked to split words where it's not, or not split where it is.

Is the algorithm you are talking about the function html_to_text(...) in src/engine/util/util-html.vala line 123?

Can you provide an example of a string containing a word which should not be split and an example of a string containing a word which should be split?
Comment 3 Michael Gratton 2018-07-28 05:39:52 UTC
(In reply to Michael Hochleitner from comment #2)
> Questions about the following improvement:
> 
>   * We may need to be smarter about stripping HTML in general. The algorithm
> used currently may just need to be tweaked to split words where it's not, or
> not split where it is.
> 
> Is the algorithm you are talking about the function html_to_text(...) in
> src/engine/util/util-html.vala line 123?

That's right - that gets called by Geary.RFC822.Message.get_searchable_body() to convert HTML bodies to text for searching.

> Can you provide an example of a string containing a word which should not be
> split and an example of a string containing a word which should be split?

These mostly arise as a result of the semantics of HTML. Typically block elements should cause a word (and line) break, but inline elements shouldn't.

E.g. the HTML string "<UL><LI>Break</LI></UL><P>Fast" should be converted to "Break\n\nFast" not "BreakFast", and similarly "in<EM>line</EM>" should be converted to "inline", not "in line".

As I mention above, this may already be working well enough, so all that might be needed here is adding some unit tests to Geary.HTML.UtilTest to ensure that the plain-text produced by that method is doing the right thing, and proving cases where it's a bit subpotimal (coalescing adjacent spaces, for example.) There's already a couple of test cases, but they only cover a limited number of cases.
Comment 4 GNOME Infrastructure Team 2021-07-05 13:26:15 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/geary/-/issues/

Thank you for your understanding and your help.