GNOME Bugzilla – Bug 655383
Tracker does not index text files with other encodings than UTF8
Last modified: 2011-12-14 16:25:47 UTC
I receive many text files from windows users using the windows encoding ISO8859-15 that are not indexed. These open fine in gedit, so I guess gedit is able to guess the encoding. I think it would be nice if tracker would index also windows encoded text files.
Guessing any arbitrary encoding properly is a huge task, and we don't really want that for every text file extracted, as it would be much slower and we would get easily lots of false positives. But, but, we could have a best effort guessing, using the same set of encodings as GEdit (UTF-8, current locale, ISO8859-15 and UTF-16). An approach would be: * Validate input as UTF-8, and if not UTF-8 continue. * Try to convert from locale encoding to UTF-8, and if not possible, continue. * Try to convert from ISO8859-15 to UTF-8, and if not possible, continue. * Try to convert from UTF-16 to UTF-8, and if not possible, fully skip. For all UTF-8 encoded files we shouldn't see a performance degradation during indexing, as we already check for UTF-8 validity; and we would then also end up indexing files which we didn't index before, as most text files coming from Windows machines (in Latin-9 or UTF-16),
Fixed here: 9b8010c68d0a31b107c71f9a799964ce3c0b1d51 The new logic is: * Look for UTF-16 BOMs (BE & LE) at the beginning of the string; and if found, decode as UTF-16. * Otherwise, validate input as UTF-8. If not valid: ** If embedded NUL bytes found, try to decode as UTF-16 (in host endian). ** Otherwise try with locale encoding (if not UTF-8). ** Otherwise try with windows-1252.