GNOME Bugzilla – Bug 525911
PDF files containing "linked" pairs of letters badly indexed/non retrievable
Last modified: 2008-04-10 14:09:12 UTC
Dear All, I am a heavy user of beagleto index all my scientific documentation archive. I recently found some articles fail to be retrieved even when they contain extractable text. Namely, they are PDF files with particular words. On word for all: "transesterification", where the "fi" is one linked character (as typographers say it should be). Unfortunately, though, the "fi" is not converted in separate letters when beagle-extract-content is run on the file, and consequently when I search for "transesterification" in beagle-search this file is not in the "hits" list. Please not that if I search for "transesterification" inside evince, the word is found. I think something should be done around this! Marco PS I put the file which exhibites this problem in http://ingchim.ing.uniroma1.it/users/mbravi/liu2007bpd.pdf
This is basically a dup of bug #168189, which specifically mentions diacritics whereas yours are typographical ligatures, but the idea is the same: they should map to individual, non-accented Latin characters. *** This bug has been marked as a duplicate of 168189 ***