GNOME Bugzilla – Bug 325189
text selection doesn't follow columns
Last modified: 2015-06-23 22:36:51 UTC
Please describe the problem: When selecting text in a pdf document with columns, text is selected in both columns simultaniously. Steps to reproduce: 1. Open a pdf document which contains columns (like http://www.csr-asia.com/upload/csrasiaweeklyvol1week48a.pdf) 2. Try to select a piece of text from the left collumn Actual results: Text is selected in the left _and_ the right column Expected results: Only select text in the left column Does this happen every time? yes Other information: example pdf: http://www.csr-asia.com/upload/csrasiaweeklyvol1week48a.pdf
Thanks, really a problem. It's certainly not easy to fix, but let's hope someone will work it out.
*** Bug 333967 has been marked as a duplicate of this bug. ***
*** Bug 372908 has been marked as a duplicate of this bug. ***
*** Bug 325457 has been marked as a duplicate of this bug. ***
*** Bug 360722 has been marked as a duplicate of this bug. ***
*** Bug 481825 has been marked as a duplicate of this bug. ***
*** Bug 494078 has been marked as a duplicate of this bug. ***
*** Bug 500352 has been marked as a duplicate of this bug. ***
*** Bug 507523 has been marked as a duplicate of this bug. ***
*** Bug 514150 has been marked as a duplicate of this bug. ***
Which is the upstream bug? This one? https://bugs.freedesktop.org/show_bug.cgi?id=3188 and depending on bug #165155 ?
*** Bug 526379 has been marked as a duplicate of this bug. ***
I have same problem for example with this pdf file: http://www.dehn.de/www_DE/PDF/blitzplaner08_e/Chapters/BBP_E_Chapter_07.pdf ... but it is almost rule, almost every pdf file which is devided to more columns text is not marked properly :(
Does this bug still exists?
Sure, and it's pretty annoying, it's a poppler bug though.
Ah. Should we be relogging it somewhere else?
I think it's already in poppler bugzilla. https://bugs.freedesktop.org/show_bug.cgi?id=4006
Still occurs with 2.24.1.
I've posted a patch for this upstream in https://bugs.freedesktop.org/show_bug.cgi?id=3188 - I can't get the jhbuilt-evince to open pdfs though (even before the patch) and had to test with epdfview. It'd be great if someone could test if the bug fix works in evince too?
*** Bug 582415 has been marked as a duplicate of this bug. ***
*** Bug 588476 has been marked as a duplicate of this bug. ***
Brian, I just tested your patch against poppler 0.12. Selecting text in evince and pasting into a text editor more or less works, but the actual selection in evince sometimes behaves a little weird. The effect is hard to describe, but when I move the mouse, the highlight showing the selected text sometimes behaves in ways that are unexpected to me at least. I tested with the HE-News-Winter-2009.pdf document attached at https://launchpad.net/ubuntu/+source/poppler/+bug/33288.
The patch makes the selection follow reading order. Whats happened with that document is that reading order has been badly misidentified (so /any/ text extraction from that document with poppler will look odd). The reading order it has inferred for page 1 is this: Col 1: para 1, 2, 3 Col 2: para 1 Col 3: whole column, in correct order Col 4: para 6 5 3 4 1 2 (!) Col 2: para 2 Col 1: para 4 Remainder of Col 2, Remainder of Col 1. You can confirm this by starting a selection in the first para and moving the mouse into each of those paragraphs - you'll see previous paras in that list remain selected. This is obviously nonsense, but it is a separate bug from not being able to select in reading order at all. IIRC the issue with this particular document is the very ragged right justification, poppler is attempting to identify columns line by line and the varying column gap triggers this bad behaviour. The suggestion is that we take a hint from ocropus and identify gutters first, its a more robust way of finding columns. NB there will always be pathological document examples. If we attempt to use rectangular gutters, documents that flow text around circular inclusions will not work well, for example.
Oh, I see. Another thing I just noticed is that the patch makes Evince segfault when loading pdf files with no text.
Thanks for spotting that. Code was missing an 'if (!flows) return;' at the start of TextPage::visitSelection. I'll follow up with a replacement patch at freedesktop.org
I've uploaded an updated patch series to https://bugs.freedesktop.org/show_bug.cgi?id=3188 , with corrections to selection and reading order. Those of you who can apply these and rebuild evince might want to give this a go? Comments over there please! For me it fixes up selection for most (but not all) of the documents on the various dupes of this bug (including the one Johan mentions above). Caveats: doesn't cover RTL or documents with rotated text. BTW is there a pool of test documents for evince? Poppler doesn't seem to have any unit testing going on at all, I could do with seeing some RTL docs.
I've built and tested the patches on poppler_0.12.0-0ubuntu2.1 and it is definitely an improvement. (deb packages for ubuntu 9.10 are available from https://launchpad.net/~arand/+archive/poppler, (nil safety included)) For more info see the freedesktop bug linked by Brian Ewins above.
Created attachment 206874 [details] A Pdf with Hebrew on Two Columns. The text is Free and it can be added to the test Pool
Please make this bug high priority. It is been here since long and I am afraid to say that anyone working with scientific articles which will bump into this bug will be frustrated with GNOME. This little issue has prevented me from delivering Live CD to fellows of mine. Becuase, when they once get really bad impression of GNOME, it become hard to convince them that Free Software can do a good job. Cheers, Oz
Ok, it seems like with evince 3.2 the situation is a bit better (libpoppler 0.16.7 on Debian). However, there are still some old PDF were text selection is not done correctly (I can do the selection properly with Okular, Foxitreader and Acroread on the same system). See for example the article I attached.
this is a poppler bug, and you can see it in Okular too (just checked... you need to use the text selection tool). Unfortunately, there is not so much we can do as the code in question is just an heuristic (as the pdf spec does not involve text copy and text selection properly)
Created attachment 206916 [details] Article with 2 columns Here is an article which is currently not working correctly with Evince (v.3.2).
@Jose, Thanks for the reply. Yep you are right. The text selection tool in Okular is as dumb as Evince's. However, until now, I didn't even know it exists. I always use the "Area select tool" which does not have this buggy behavior. I can select the text proprelly in okular (Okular Version 0.12.5 Using KDE Development Platform 4.6.5 (4.6.5)). Yes, I already realized that is a bug in poppler. However, selection mechanism in Okular is different. And it seems that the guys behind Foxit and Adobe know something we don't know, but their text selection tool is working. Until, this bug is fixed in Poppler, can we have an area selection tool in evince like in Okular ?
(In reply to comment #31) > this is a poppler bug, and you can see it in Okular too (just checked... you > need to use the text selection tool). Unfortunately, there is not so much we > can do as the code in question is just an heuristic (as the pdf spec does not > involve text copy and text selection properly) How does the current heuristic work? For a simple method that should work in most cases, I would suggest building a histogram of inter-word spacings, and using the Otsu binary thresholding method (extremely fast and parameter-free) to separate the inter-word spacing into two peaks -- then look to see if the between-peak variance is significantly higher than the sum of within-peak variances, and if it is, then you have N columns with approximately equal spacing between them, and where the spacing is much greater than the word size. Another heuristic to apply as a last resort is to simply only select words within the user's dragged box, and nothing outside it. That would at least let a user select one column at a time without words from adjacent columns being pulled in. I don't think it's unreasonable to expect a user to drag a box over the entire text region they want to select.
I confirm this issue is still present using the document provided iusing comment 32 in evince 3.10 on ubuntu 14.04. When one tries to select a piece of text in the first column, the corresponding lines in the second columns are also selected. Indeed, foxit reader performs far better on this file so there should be a better solution than the current one.
As this has been mentioned before, this is a bug in poppler. Closing this one as NOTGNOME.