Bug 341947 – Compatibility characters should be normalized in copied text

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 341947 - Compatibility characters should be normalized in copied text


Summary:	Compatibility characters should be normalized in copied text


Status:	RESOLVED FIXED

Product:	evince
Classification:	Core
Component:	general
Version:	0.5.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Evince Maintainers
QA Contact:	Evince Maintainers

URL:
Whiteboard:

Duplicates:	399255 (view as bug list)
Depends on:
Blocks:	370721

Reported:	2006-05-16 07:03 UTC by Ed Catmur
Modified:	2007-01-27 17:31 UTC

See Also:
GNOME target:	---
GNOME version:	2.15/2.16

Attachments
evince-copy-normalized.patch (489 bytes, patch) 2006-05-16 07:04 UTC, Ed Catmur	committed	Details \| Review
Sample PDF file with ligatures (73.62 KB, application/pdf) 2006-11-10 09:46 UTC, Wouter Bolsterlee (uws)		Details
Recreated sample PDF file (18.86 KB, application/pdf) 2006-11-11 13:00 UTC, Ed Catmur		Details

Description Ed Catmur 2006-05-16 07:03:50 UTC

When a document contains ligatures e.g. ﬁ (fi), this messes up copying in two ways:

1. The text cannot be pasted into clients with restricted character sets (e.g. xterm, non-GTK apps)
2. In Unicode-aware apps, ligatures are not readily editable. Typically, positioning the cursor to the right of a fi ligature and hitting delete will delete both alphabetic characters, which is unexpected.

Documents generated by TeX often contain fi ligatures.

The simplest way to fix this would be to convert copied text to normalization form NFKC. This is an easy patch.

Comment 1 Ed Catmur 2006-05-16 07:04:23 UTC

Created attachment 65575 [details] [review]
evince-copy-normalized.patch

Comment 2 Ed Catmur 2006-05-24 11:38:00 UTC

See also https://bugs.freedesktop.org/show_bug.cgi?id=7002 for an upstream approach.

Comment 3 Nickolay V. Shmyrev 2006-05-24 14:29:14 UTC

Should it be resolved as notgnome then?

Comment 4 Ed Catmur 2006-05-24 17:44:38 UTC

That's poppler only; evince has other backends which might not normalise their text.

Comment 5 Wouter Bolsterlee (uws) 2006-11-04 19:48:46 UTC

I would love to see this bug fixed. All my LaTeX files use a beautiful font with several ligatures. Copy/pasting text results in a mess.

Could someone knowledgeable review the patch?

Comment 6 Wouter Bolsterlee (uws) 2006-11-10 09:46:24 UTC

Created attachment 76318 [details]
Sample PDF file with ligatures

Ed, I've tried your patch and it does not work when copy/pasting text from the document. The attached PDF file shows some ligatures.

Comment 7 Ed Catmur 2006-11-11 13:00:16 UTC

Created attachment 76386 [details]
Recreated sample PDF file

Wouter, do you know how your PDF was created? I tried to reproduce it using standard LaTeX and got this file, which works fine.

LaTeX:
\documentclass{article}
\begin{document}
The list below shows some ligatures in Dutch words:
\begin{itemize}
\item classificatie: fi
\item koffiepauze: ffi
\item souffleren: ffl
\item treffen: ff
\end{itemize}

The same list typeset in \emph{italics}:
\begin{itemize}
\item \emph{classificatie: fi}
\item \emph{koffiepauze: ffi}
\item \emph{souffleren: ffl}
\item \emph{treffen: ff}
\end{itemize}
\end{document}

Command:
$ pdflatex ligature-repro.tex

Comment 8 Ed Catmur 2006-11-11 13:03:39 UTC

Wouter, I don't think that poppler is even seeing the ligatures as actual text in your file. e.g. if I search for souﬄeren I get no hits.

Comment 9 Ed Catmur 2006-11-11 13:27:27 UTC

Hm. Looking at the TextBlock struct for the problematic PDF, the first bulleted line has:

(gdb) p blocks[1]->lines[0]->len
$16 = 15
(gdb) x/15 blocks[1]->lines[0]->text
0x82b53c0:      0x00002022      0x00000020      0x00000063      0x0000006c
0x82b53d0:      0x00000061      0x00000073      0x00000073      0x00000069
0x82b53e0:      0x00000020      0x00000063      0x00000061      0x00000074
0x82b53f0:      0x00000069      0x00000065      0x0000003a

That's: "• classi catie:"

i.e. poppler is not seeing the ligatured glyphs at all. This must be a poppler problem, no?

Comment 10 Ed Catmur 2006-11-11 13:30:42 UTC

Incidentally, I think you have "Th" ligatured as well in your PDF.

Comment 11 Ed Catmur 2006-11-11 13:46:35 UTC

Indeed:

$ pdftotext ligature-test.pdf -
e list below shows some ligatures in Dutch words: • classi catie: • ko epauze: • sou eren: • tre en: e same list typeset in italics: • classi catie: • ko epauze: • sou eren: • tre en:

$ pdftotext ligature-repro.pdf -
The list below shows some ligatures in Dutch words: • classiﬁcatie: ﬁ • koﬃepauze: ﬃ • souﬄeren: ﬄ • treﬀen: ﬀ The same list typeset in italics: • classiﬁcatie: ﬁ • koﬃepauze: ﬃ • souﬄeren: ﬄ • treﬀen: ﬀ

Comment 12 Ed Catmur 2006-11-11 15:59:00 UTC

OK, you're using Minion (the typeface), which is I think where the problem lies.  I don't have Minion, so I can't test this, but it seems that the encoding table for Minion uses character names that poppler doesn't understand:

/F13_0 /GTLZUG+MinionPro-Regular 1 1
[ /grave/acute/circumflex/tilde/dieresis/hungarumlaut/ring/caron
  /breve/macron/dotaccent/cedilla/ogonek/quotesinglbase/guilsinglleft/guilsinglright
  /quotedblleft/quotedblright/quotedblbase/guillemotleft/guillemotright/endash/emdash/f_f_t
  /f_j/dotlessi/f_f_j/f_f/f_i/f_l/f_f_i/f_f_l
  /f_t/exclam/quotedbl/numbersign.oldstyle/dollar.oldstyle/uniF642/ampersand/quoteright
...
  /rcaron/sacute/scaron/uni015F/tcaron/tcommaaccent/uhungarumlaut/uring
  /ydieresis/zacute/zcaron/zdotaccent/ij/exclamdown/questiondown/T_h
...
  /oslash/ugrave/uacute/ucircumflex/udieresis/yacute/thorn/germandbls]
pdfMakeFont

Note the use of e.g. f_f to indicate ﬀ (poppler expects ff, see NameToUnicodeTable.h) and also e.g. uni015F for ş (poppler doesn't understand this).

It's obvious how to fix the uniXXXX entries, but the ligatures may be problematic as some of them (T_h, f_j, etc.) don't exist in Unicode and gfxFont mapping tables are single-character. Hmm.

Anyway, this is probably a poppler bug now.

Comment 13 Ed Catmur 2006-11-11 16:03:30 UTC

Incidentally, that last is from a pdftops of the ligature-test.pdf file; the internal dump is:

(gdb) p enc
$98 = {0x806fdd8 "grave", 0x806fde8 "acute", 0x806fdf8 "circumflex", 
  0x806fe08 "tilde", 0x806fe18 "dieresis", 0x806fe28 "hungarumlaut", 
  0x806fe40 "ring", 0x806fe50 "caron", 0x806fe60 "breve", 0x806fe70 "macron", 
  0x806fe80 "dotaccent", 0x806fe90 "cedilla", 0x806fea0 "ogonek", 
  0x806feb0 "quotesinglbase", 0x806fec8 "guilsinglleft", 
  0x806fee0 "guilsinglright", 0x806fef8 "quotedblleft", 
  0x806ff10 "quotedblright", 0x806ff28 "quotedblbase", 
  0x806ff40 "guillemotleft", 0x806ff58 "guillemotright", 0x806ff70 "endash", 
  0x806ff80 "emdash", 0x806ff90 "f_f_t", 0x806ffa0 "f_j", 
  0x806ffb0 "dotlessi", 0x806ffc0 "f_f_j", 0x806ffd0 "f_f", 0x806ffe0 "f_i", 
  0x806fff0 "f_l", 0x8070000 "f_f_i", 0x8070010 "f_f_l", 0x806e690 "f_t", 
...
(gdb) p/x globalParams->mapNameToUnicode("f_i")
$107 = 0
(gdb) p/x globalParams->mapNameToUnicode("fi")
$108 = 0xfb01

Comment 14 Ed Catmur 2006-11-11 16:16:03 UTC

Final proof: dumping the following in /usr/share/poppler/nameToUnicode/Minion:
fb00 f_f
fb01 f_i
fb02 f_l
fb03 f_f_i
fb04 f_f_l
00de T_h

gives:

Þe list below shows some ligatures in Dutch words:
   • classificatie: fi
   • koffiepauze: ffi
   • souffleren: ffl
   • treffen: ff
Þe same list typeset in italics:
   • classificatie: fi
   • koffiepauze: ffi
   • souffleren: ffl
   • treffen: ff

Comment 15 Ed Catmur 2006-11-11 16:29:54 UTC

See https://bugs.freedesktop.org/show_bug.cgi?id=8985 and https://bugs.freedesktop.org/show_bug.cgi?id=8986

Note that the above patch is still valid; it's just that poppler has a problem with Wouter's pdfs, other pdfs will work fine with the patch.

Comment 16 Ed Catmur 2006-11-12 00:59:28 UTC

Patch attached to https://bugs.freedesktop.org/show_bug.cgi?id=8986

Wouter, would you mind testing it, along with the above patch?

Comment 17 Ed Catmur 2006-11-15 09:26:51 UTC

Also see https://bugs.freedesktop.org/show_bug.cgi?id=9001 (also with patch attached).

In all, you should be applying 3 patches:
to evince: evince-copy-normalized.patch (attached here)
to poppler: mapping_tables.patch (at f.d.o bug 8986)
to poppler: ligature-selection-paint.patch (at f.d.o bug 9001)

However, the latter two are necessary only for pdfs with weird fonts; pdfs that use Computer Modern just require evince with evince-copy-normalized.patch (though the other patches won't hurt).

Comment 18 Wouter Bolsterlee (uws) 2006-11-15 16:25:15 UTC

Thanks for looking into this issue. I'll look into it soon.

Comment 19 Ed Catmur 2006-11-20 16:02:24 UTC

Incidentally, Wouter, do you have a PDF reader that your PDF *does* work on?  I tried opening it in Adobe Reader 7.0.8 on Windows and it was unable to copy the ligatures properly:

¿e list below shows some ligatures in Dutch words:
• classicatie: 
• koepauze: 
• soueren: 
• treen: 

Not, of course, that we shouldn't be aiming to be better than Adobe...

Comment 20 Wouter Bolsterlee (uws) 2006-11-20 16:04:14 UTC

I only use Evince :) I use the MinionPro helper files for LaTeX from http://developer.berlios.de/projects/minionpro/

Comment 21 Ed Catmur 2006-11-22 23:10:35 UTC

OK, thanks.  I haven't got around to installing that yet, but I have looked at MinionPro.pdf and we should also handle glyph variants; see update to fdo bug 8986.

Comment 22 Nickolay V. Shmyrev 2007-01-22 07:41:51 UTC

*** Bug 399255 has been marked as a duplicate of this bug. ***

Comment 23 Nickolay V. Shmyrev 2007-01-23 07:42:32 UTC

Hm, really so, it would be nice to commit normalize patch.

Comment 24 Nickolay V. Shmyrev 2007-01-27 17:31:56 UTC

Ok, patch is applied, so the only issue is about poppler.