After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 525911 - PDF files containing "linked" pairs of letters badly indexed/non retrievable
PDF files containing "linked" pairs of letters badly indexed/non retrievable
Status: RESOLVED DUPLICATE of bug 168189
Product: beagle
Classification: Other
Component: General
0.3.3
Other Linux
: Normal normal
: ---
Assigned To: Beagle Bugs
Beagle Bugs
Depends on:
Blocks:
 
 
Reported: 2008-04-03 07:52 UTC by Marco Bravi
Modified: 2008-04-10 14:09 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Marco Bravi 2008-04-03 07:52:05 UTC
Dear All,

I am a heavy user of beagleto index all my scientific documentation archive.

I recently found some articles fail to be retrieved even when they contain extractable text. Namely, they are PDF files with particular words.

On word for all: "transesterification", where the "fi" is one linked character (as typographers say it should be). Unfortunately, though, the "fi" is not converted in separate letters when beagle-extract-content is run on the file, and consequently when I search for "transesterification" in beagle-search this file is not in the "hits" list.

Please not that if I search for "transesterification" inside evince, the word is found.

I think something should be done around this!

Marco

PS I put the file which exhibites this problem in http://ingchim.ing.uniroma1.it/users/mbravi/liu2007bpd.pdf
Comment 1 Joe Shaw 2008-04-10 14:09:12 UTC
This is basically a dup of bug #168189, which specifically mentions diacritics whereas yours are typographical ligatures, but the idea is the same: they should map to individual, non-accented Latin characters.

*** This bug has been marked as a duplicate of 168189 ***