GNOME Bugzilla – Bug 341947
Compatibility characters should be normalized in copied text
Last modified: 2007-01-27 17:31:56 UTC
When a document contains ligatures e.g. fi (fi), this messes up copying in two ways: 1. The text cannot be pasted into clients with restricted character sets (e.g. xterm, non-GTK apps) 2. In Unicode-aware apps, ligatures are not readily editable. Typically, positioning the cursor to the right of a fi ligature and hitting delete will delete both alphabetic characters, which is unexpected. Documents generated by TeX often contain fi ligatures. The simplest way to fix this would be to convert copied text to normalization form NFKC. This is an easy patch.
Created attachment 65575 [details] [review] evince-copy-normalized.patch
See also https://bugs.freedesktop.org/show_bug.cgi?id=7002 for an upstream approach.
Should it be resolved as notgnome then?
That's poppler only; evince has other backends which might not normalise their text.
I would love to see this bug fixed. All my LaTeX files use a beautiful font with several ligatures. Copy/pasting text results in a mess. Could someone knowledgeable review the patch?
Created attachment 76318 [details] Sample PDF file with ligatures Ed, I've tried your patch and it does not work when copy/pasting text from the document. The attached PDF file shows some ligatures.
Created attachment 76386 [details] Recreated sample PDF file Wouter, do you know how your PDF was created? I tried to reproduce it using standard LaTeX and got this file, which works fine. LaTeX: \documentclass{article} \begin{document} The list below shows some ligatures in Dutch words: \begin{itemize} \item classificatie: fi \item koffiepauze: ffi \item souffleren: ffl \item treffen: ff \end{itemize} The same list typeset in \emph{italics}: \begin{itemize} \item \emph{classificatie: fi} \item \emph{koffiepauze: ffi} \item \emph{souffleren: ffl} \item \emph{treffen: ff} \end{itemize} \end{document} Command: $ pdflatex ligature-repro.tex
Wouter, I don't think that poppler is even seeing the ligatures as actual text in your file. e.g. if I search for souffleren I get no hits.
Hm. Looking at the TextBlock struct for the problematic PDF, the first bulleted line has: (gdb) p blocks[1]->lines[0]->len $16 = 15 (gdb) x/15 blocks[1]->lines[0]->text 0x82b53c0: 0x00002022 0x00000020 0x00000063 0x0000006c 0x82b53d0: 0x00000061 0x00000073 0x00000073 0x00000069 0x82b53e0: 0x00000020 0x00000063 0x00000061 0x00000074 0x82b53f0: 0x00000069 0x00000065 0x0000003a That's: "• classi catie:" i.e. poppler is not seeing the ligatured glyphs at all. This must be a poppler problem, no?
Incidentally, I think you have "Th" ligatured as well in your PDF.
Indeed: $ pdftotext ligature-test.pdf - e list below shows some ligatures in Dutch words: • classi catie: • ko epauze: • sou eren: • tre en: e same list typeset in italics: • classi catie: • ko epauze: • sou eren: • tre en: $ pdftotext ligature-repro.pdf - The list below shows some ligatures in Dutch words: • classificatie: fi • koffiepauze: ffi • souffleren: ffl • treffen: ff The same list typeset in italics: • classificatie: fi • koffiepauze: ffi • souffleren: ffl • treffen: ff
OK, you're using Minion (the typeface), which is I think where the problem lies. I don't have Minion, so I can't test this, but it seems that the encoding table for Minion uses character names that poppler doesn't understand: /F13_0 /GTLZUG+MinionPro-Regular 1 1 [ /grave/acute/circumflex/tilde/dieresis/hungarumlaut/ring/caron /breve/macron/dotaccent/cedilla/ogonek/quotesinglbase/guilsinglleft/guilsinglright /quotedblleft/quotedblright/quotedblbase/guillemotleft/guillemotright/endash/emdash/f_f_t /f_j/dotlessi/f_f_j/f_f/f_i/f_l/f_f_i/f_f_l /f_t/exclam/quotedbl/numbersign.oldstyle/dollar.oldstyle/uniF642/ampersand/quoteright ... /rcaron/sacute/scaron/uni015F/tcaron/tcommaaccent/uhungarumlaut/uring /ydieresis/zacute/zcaron/zdotaccent/ij/exclamdown/questiondown/T_h ... /oslash/ugrave/uacute/ucircumflex/udieresis/yacute/thorn/germandbls] pdfMakeFont Note the use of e.g. f_f to indicate ff (poppler expects ff, see NameToUnicodeTable.h) and also e.g. uni015F for ş (poppler doesn't understand this). It's obvious how to fix the uniXXXX entries, but the ligatures may be problematic as some of them (T_h, f_j, etc.) don't exist in Unicode and gfxFont mapping tables are single-character. Hmm. Anyway, this is probably a poppler bug now.
Incidentally, that last is from a pdftops of the ligature-test.pdf file; the internal dump is: (gdb) p enc $98 = {0x806fdd8 "grave", 0x806fde8 "acute", 0x806fdf8 "circumflex", 0x806fe08 "tilde", 0x806fe18 "dieresis", 0x806fe28 "hungarumlaut", 0x806fe40 "ring", 0x806fe50 "caron", 0x806fe60 "breve", 0x806fe70 "macron", 0x806fe80 "dotaccent", 0x806fe90 "cedilla", 0x806fea0 "ogonek", 0x806feb0 "quotesinglbase", 0x806fec8 "guilsinglleft", 0x806fee0 "guilsinglright", 0x806fef8 "quotedblleft", 0x806ff10 "quotedblright", 0x806ff28 "quotedblbase", 0x806ff40 "guillemotleft", 0x806ff58 "guillemotright", 0x806ff70 "endash", 0x806ff80 "emdash", 0x806ff90 "f_f_t", 0x806ffa0 "f_j", 0x806ffb0 "dotlessi", 0x806ffc0 "f_f_j", 0x806ffd0 "f_f", 0x806ffe0 "f_i", 0x806fff0 "f_l", 0x8070000 "f_f_i", 0x8070010 "f_f_l", 0x806e690 "f_t", ... (gdb) p/x globalParams->mapNameToUnicode("f_i") $107 = 0 (gdb) p/x globalParams->mapNameToUnicode("fi") $108 = 0xfb01
Final proof: dumping the following in /usr/share/poppler/nameToUnicode/Minion: fb00 f_f fb01 f_i fb02 f_l fb03 f_f_i fb04 f_f_l 00de T_h gives: Þe list below shows some ligatures in Dutch words: • classificatie: fi • koffiepauze: ffi • souffleren: ffl • treffen: ff Þe same list typeset in italics: • classificatie: fi • koffiepauze: ffi • souffleren: ffl • treffen: ff
See https://bugs.freedesktop.org/show_bug.cgi?id=8985 and https://bugs.freedesktop.org/show_bug.cgi?id=8986 Note that the above patch is still valid; it's just that poppler has a problem with Wouter's pdfs, other pdfs will work fine with the patch.
Patch attached to https://bugs.freedesktop.org/show_bug.cgi?id=8986 Wouter, would you mind testing it, along with the above patch?
Also see https://bugs.freedesktop.org/show_bug.cgi?id=9001 (also with patch attached). In all, you should be applying 3 patches: to evince: evince-copy-normalized.patch (attached here) to poppler: mapping_tables.patch (at f.d.o bug 8986) to poppler: ligature-selection-paint.patch (at f.d.o bug 9001) However, the latter two are necessary only for pdfs with weird fonts; pdfs that use Computer Modern just require evince with evince-copy-normalized.patch (though the other patches won't hurt).
Thanks for looking into this issue. I'll look into it soon.
Incidentally, Wouter, do you have a PDF reader that your PDF *does* work on? I tried opening it in Adobe Reader 7.0.8 on Windows and it was unable to copy the ligatures properly: ¿e list below shows some ligatures in Dutch words: • classicatie: • koepauze: • soueren: • treen: Not, of course, that we shouldn't be aiming to be better than Adobe...
I only use Evince :) I use the MinionPro helper files for LaTeX from http://developer.berlios.de/projects/minionpro/
OK, thanks. I haven't got around to installing that yet, but I have looked at MinionPro.pdf and we should also handle glyph variants; see update to fdo bug 8986.
*** Bug 399255 has been marked as a duplicate of this bug. ***
Hm, really so, it would be nice to commit normalize patch.
Ok, patch is applied, so the only issue is about poppler.