Bug 741008 – Copy/paste to text has lines out of order

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 741008 - Copy/paste to text has lines out of order


Summary:	Copy/paste to text has lines out of order


Status:	RESOLVED OBSOLETE

Product:	evince
Classification:	Core
Component:	PDF
Version:	3.10.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Evince Maintainers
QA Contact:	Evince Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-12-02 04:50 UTC by Fred H Olson
Modified:	2018-05-22 15:59 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Fred H Olson 2014-12-02 04:50:04 UTC

The pdf has two columns, alphabetical in column one on a page followed          
by column two.                                                                  

Select All, Copy then Paste into a text file.                                   

Each column-line results in a left justified line in                            
the text file but to some extent the left and right columns                     
are intermingled. (I'm using Ubuntu 14.04 with Evince Document                  
Viewer 3.10.3 )                                                                 

A friend tried it with Acrobat Reader on the same file and                      
his text file came out with column 1 followed by column 2.

The pdf can not be released. (private)

Comment 1 José Aliste 2014-12-03 01:47:55 UTC

unless you provide with an example PDF, we can't help you. And most probably this is a problem in poppler, the underlying library used to read pdf's.

Comment 2 Fred H Olson 2014-12-03 03:34:51 UTC

A similar problem is exhibited copy/pasting to a text file from:

www.dyerchamber.com/images/Membership_Directory.pdf

Tho in this case it also appears (from modest examination)
that the first line of some left side records (~10%?)
are on the same line in the text file with the last line
of a record on the right side. (ie a missing newline)

Comment 3 José Aliste 2014-12-03 12:48:43 UTC

does Acrobat also read this correctly? In general, pdf files used to NOT have any textual information other than this glyph goes here... So we use a layout heuristic borrowed from OCR's to reconstruct the text from the glyphs... Since it's a heuristic, it will never be 100% accurate (but any examples like this where the heuristic is clearly doing a poor job are welcome)


Newer Pdf files use Structural information and Do contain the text. I wonder whether your file has this information or not (Although Structural information is not yet supported by Evince)

Comment 4 Fred H Olson 2014-12-04 00:22:38 UTC

The copy/paste from Acrobat reader does seem to be "correct" 
(regrettably no blank line between records). I put the result here:
http://www.justcomm.org/temp/dyer-acro.txt

I looked at our pdf file with a text editor and it was mostly
not text. None viewable with a pdf reader that I could see.  
I searched for a few strings I knew were in the content and 
did not find them. Our rendered PDF has 30 pages.
Viewed as text it had 45 lines that start like this one:
<</Filter/FlateDecode/Length 752>>stream

Comment 5 Germán Poo-Caamaño 2014-12-04 03:07:44 UTC

(In reply to comment #4)
> The copy/paste from Acrobat reader does seem to be "correct" 
> (regrettably no blank line between records). I put the result here:
> http://www.justcomm.org/temp/dyer-acro.txt

Semantically speaking, the main difference is:
* Acrobat select the text by column
* Evince/Poppler select the text by row

Another difference is:  Acrobar adds a '\n' after a cell, Evince/poppler does not, ending up mixing joining words in 2 different lines (rows).

> I looked at our pdf file with a text editor and it was mostly
> not text. None viewable with a pdf reader that I could see.  
> I searched for a few strings I knew were in the content and 
> did not find them. Our rendered PDF has 30 pages.
> Viewed as text it had 45 lines that start like this one:
> <</Filter/FlateDecode/Length 752>>stream

You can use pdftk to uncompress a PDF, something like:

$ pdftk input.pdf output uncompressed-input.pdf uncompress

Comment 6 GNOME Infrastructure Team 2018-05-22 15:59:22 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/evince/issues/531.