Bug 525911 – PDF files containing "linked" pairs of letters badly indexed/non retrievable

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 525911 - PDF files containing "linked" pairs of letters badly indexed/non retrievable


Summary:	PDF files containing "linked" pairs of letters badly indexed/non retrievable


Status:	RESOLVED DUPLICATE of bug 168189

Product:	beagle
Classification:	Other
Component:	General
Version:	0.3.3
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-04-03 07:52 UTC by Marco Bravi
Modified:	2008-04-10 14:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Marco Bravi 2008-04-03 07:52:05 UTC

Dear All,

I am a heavy user of beagleto index all my scientific documentation archive.

I recently found some articles fail to be retrieved even when they contain extractable text. Namely, they are PDF files with particular words.

On word for all: "transesterification", where the "fi" is one linked character (as typographers say it should be). Unfortunately, though, the "fi" is not converted in separate letters when beagle-extract-content is run on the file, and consequently when I search for "transesterification" in beagle-search this file is not in the "hits" list.

Please not that if I search for "transesterification" inside evince, the word is found.

I think something should be done around this!

Marco

PS I put the file which exhibites this problem in http://ingchim.ing.uniroma1.it/users/mbravi/liu2007bpd.pdf

Comment 1 Joe Shaw 2008-04-10 14:09:12 UTC

This is basically a dup of bug #168189, which specifically mentions diacritics whereas yours are typographical ligatures, but the idea is the same: they should map to individual, non-accented Latin characters.

*** This bug has been marked as a duplicate of 168189 ***