GNOME Bugzilla – Bug 741008
Copy/paste to text has lines out of order
Last modified: 2018-05-22 15:59:22 UTC
The pdf has two columns, alphabetical in column one on a page followed by column two. Select All, Copy then Paste into a text file. Each column-line results in a left justified line in the text file but to some extent the left and right columns are intermingled. (I'm using Ubuntu 14.04 with Evince Document Viewer 3.10.3 ) A friend tried it with Acrobat Reader on the same file and his text file came out with column 1 followed by column 2. The pdf can not be released. (private)
unless you provide with an example PDF, we can't help you. And most probably this is a problem in poppler, the underlying library used to read pdf's.
A similar problem is exhibited copy/pasting to a text file from: www.dyerchamber.com/images/Membership_Directory.pdf Tho in this case it also appears (from modest examination) that the first line of some left side records (~10%?) are on the same line in the text file with the last line of a record on the right side. (ie a missing newline)
does Acrobat also read this correctly? In general, pdf files used to NOT have any textual information other than this glyph goes here... So we use a layout heuristic borrowed from OCR's to reconstruct the text from the glyphs... Since it's a heuristic, it will never be 100% accurate (but any examples like this where the heuristic is clearly doing a poor job are welcome) Newer Pdf files use Structural information and Do contain the text. I wonder whether your file has this information or not (Although Structural information is not yet supported by Evince)
The copy/paste from Acrobat reader does seem to be "correct" (regrettably no blank line between records). I put the result here: http://www.justcomm.org/temp/dyer-acro.txt I looked at our pdf file with a text editor and it was mostly not text. None viewable with a pdf reader that I could see. I searched for a few strings I knew were in the content and did not find them. Our rendered PDF has 30 pages. Viewed as text it had 45 lines that start like this one: <</Filter/FlateDecode/Length 752>>stream
(In reply to comment #4) > The copy/paste from Acrobat reader does seem to be "correct" > (regrettably no blank line between records). I put the result here: > http://www.justcomm.org/temp/dyer-acro.txt Semantically speaking, the main difference is: * Acrobat select the text by column * Evince/Poppler select the text by row Another difference is: Acrobar adds a '\n' after a cell, Evince/poppler does not, ending up mixing joining words in 2 different lines (rows). > I looked at our pdf file with a text editor and it was mostly > not text. None viewable with a pdf reader that I could see. > I searched for a few strings I knew were in the content and > did not find them. Our rendered PDF has 30 pages. > Viewed as text it had 45 lines that start like this one: > <</Filter/FlateDecode/Length 752>>stream You can use pdftk to uncompress a PDF, something like: $ pdftk input.pdf output uncompressed-input.pdf uncompress
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/evince/issues/531.