Bug 655383 – Tracker does not index text files with other encodings than UTF8

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 655383 - Tracker does not index text files with other encodings than UTF8


Summary:	Tracker does not index text files with other encodings than UTF8


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	0.10.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Aleksander Morgado
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-07-27 06:27 UTC by Björn Johansson
Modified:	2011-12-14 16:25 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Björn Johansson 2011-07-27 06:27:38 UTC

I receive many text files from windows users using the windows encoding ISO8859-15 that are not indexed. These open fine in gedit, so I guess gedit is able to guess the encoding. I think it would be nice if tracker would index also windows encoded text files.

Comment 1 Aleksander Morgado 2011-12-14 09:31:52 UTC

Guessing any arbitrary encoding properly is a huge task, and we don't really want that for every text file extracted, as it would be much slower and we would get easily lots of false positives.

But, but, we could have a best effort guessing, using the same set of encodings as GEdit (UTF-8, current locale, ISO8859-15 and UTF-16). An approach would be:
 * Validate input as UTF-8, and if not UTF-8 continue.
 * Try to convert from locale encoding to UTF-8, and if not possible, continue.
 * Try to convert from ISO8859-15 to UTF-8, and if not possible, continue.
 * Try to convert from UTF-16 to UTF-8, and if not possible, fully skip.

For all UTF-8 encoded files we shouldn't see a performance degradation during indexing, as we already check for UTF-8 validity; and we would then also end up indexing files which we didn't index before, as most text files coming from Windows machines (in Latin-9 or UTF-16),

Comment 2 Aleksander Morgado 2011-12-14 16:25:47 UTC

Fixed here: 9b8010c68d0a31b107c71f9a799964ce3c0b1d51

The new logic is:
 * Look for UTF-16 BOMs (BE & LE) at the beginning of the string; and if found, decode as UTF-16.
 * Otherwise, validate input as UTF-8. If not valid:
  ** If embedded NUL bytes found, try to decode as UTF-16 (in host endian).
  ** Otherwise try with locale encoding (if not UTF-8).
  ** Otherwise try with windows-1252.