GNOME Bugzilla – Bug 166285
Disable text search for PDFs without searchable text
Last modified: 2009-11-14 10:26:29 UTC
Search does not work for a few PDFs that I have, most of them work-related. I'm trying to find a suitable PDF that I could attach as an example. I found that search does not work either on the first attachment of bug 112506 : attachment #16336 [details] - but I don't know if it's for the same reason (the PDFs I have where search doesn't work are not with Type 3 fonts).
Seems like a pattern of documents that are generated from "dvips(k) 5.86 Copyright 1999 Radical Eye Software"
Search and text selection don't work at all in documents with type3 fonts. Even acroread is not able to search/select text. It is not a bug, it's a feature ;-). Sorry if I'm saying something obvious.
The PDF I'm trying to search has been created (in MS Word on windows) by PScript5.dll and produced by GNU Ghostscript 7.06 (PDF v.1.3). Fonts are built-in TrueType. Note that I cannot search the PDF with acroread either, so it might indeed be a feature ;) I cannot attach it on the website, but I could send it to someone if it helps.
Ghostscript versions previous to 8 do embed both TrueType and Type1 fonts, but TT fonts don't have text within. It works as expected and it is also a nice feature ;-) To avoid this, you should use either Ghostscript version 8.50 (AFPL) or 8.15 (GNU) to get searchable text with embedded TrueType fonts. Integration works for GNU Ghostscript 8.15 in ESP code (http://www.cups.org/espgs/index.php) have begun and I guess the release could be ready by the end of this month (although probably this is only an expression of a personal wish).
Evince guys : could it be possible to detect such a situation and disable the search ? It could be annoying for the user, but it's better than letting him believe the search is working when it is not.
Pablo : thanks for the info !
Vincent: I'm glad of reading that it helped. Your report points to a very interesting point, that I hope to state clearly. Many PDF documents generated from out of there do have type3 fonts and it is not always easy or even possible to get the tex or dvi source. I'm not sure, but acroread rendered type3 fonts horribly before version 7 (or 6, I'm not sure). Type1 fonts are not also interesting for better display, but mainly for searching and selecting text from the PDF document. And this is sometimes essential for some PDF documents. gpdf has implemented a display on some PDF documents that use the Computer Modern type3 font that renders the text using the standard TrueType font. This would be great to implement in evince (although Martin will knows the problem better) and even the possibility to generate a copy of the file using CM type1 fonts instead of type3 fonts.
I'm changing the bug title, as it appears this is due to the way a PDF is created.
Hmm so if I get this correctly, some pdfs doesnt have text information within. I'm not sure disabling the find menu/control would be more clear that what we have now. Also I think checking this would be equivalent to do a search, it could slow down things a bit. What about displaying "The document has no text" or something like that in the search status bar when trying to search?
[Sorry for repeating what I have already written] The problem with Ghostscript versions previous to 8 was that they had problems to handle text information when using TrueType fonts (this was fixed in version 8). So what you get when copy/paste text is garbage. Adobe Reader has search/copy text with this documents enabled (and it doesn't seem to be problematic). What it seems more interesting for me was a feature that (I think) I saw in previous versions of gpdf (Martin sure knows about this) that rendered a type3 document using type1 or truetype fonts. This would be very interesting not to display the characters but to handle the text information more properly.
I like marco's idea better than disabling the item. > What about displaying "The document has no text" or something like that in > the search status bar when trying to search? I'd go with something like "The document text cannot be searched" Of course it would be better if it worked instead of not working :)
*** Bug 321177 has been marked as a duplicate of this bug. ***
Note that broken documents that evince can't searh properly should also display this message. Look at attachment to 321177 for example of such document.
Can we identify documents produced by a buggy Ghostscript? In that case, is there some magic we can do to make the documents searchable? Would documents produced by later Ghostscripts be searchable? Why are some documents not searchable - do they store glyph IDs within a font and there's no way to get back the characters? If this all sounds very naive, it's because I don't know how PDF works :)
Created attachment 59956 [details] libgnomeprint generated PDF
Federico, I don't know what is wrong with GS < 8 and the embedding of non-standard TrueType fonts. But GS > 8 only produces PDF documents with searchable text in Windows. In GNOME (and ESP Ghostscript 8.15.1), using the “Create PDF document” from the Print dialog of gedit generates a PDF document with no searchable text (see attached gedit-output.pdf). I thought it was a ESP GS bug and I filled a bug (http://www.cups.org/espgs/str.php?L1325+P0+S-2+C0+I10+E0+Q). I was asked to provide the commands to generate the PDF file. Since it is libgnomeprint the one that invokes GS, I can't provide them. Is there any way to know how libgnomeprint invokes GS in order to report this to ESP GS developers and check whether it is ESP GS or libgnomeprint the buggy application? Thanks, Pablo
*** This bug has been marked as a duplicate of bug 596888 ***