After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 689255 - parser error : Input is not proper UTF-8, indicate encoding
parser error : Input is not proper UTF-8, indicate encoding
Status: RESOLVED OBSOLETE
Product: evince
Classification: Core
Component: PDF
git master
Other Linux
: Normal normal
: ---
Assigned To: Evince Maintainers
Evince Maintainers
Depends on:
Blocks:
 
 
Reported: 2012-11-29 06:23 UTC by Germán Poo-Caamaño
Modified: 2018-05-22 14:52 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
PDF Test case (19.08 KB, application/pdf)
2012-11-29 06:23 UTC, Germán Poo-Caamaño
Details

Description Germán Poo-Caamaño 2012-11-29 06:23:49 UTC
Created attachment 230156 [details]
PDF Test case

This bug was reported in evince mailing list.  I can reproduce it with evince master, but not with poppler-glib-demo.  So, it seems a bug that belongs to evince.  Here is the original report:

The attached pdf triggers an error while opened with evince 3.4.0
(poppler/cairo 0.18.4) on fedora 17:
Entity: line 10: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x3C 0x2F 0x72
='x-default'>Untitled</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>

       ^

The pdf was created with lilypond 2.16.0 (music typesetting tool) on
fedora 17. According to the developpers of lilypond, the "ä" character
that triggers the error is well encoded and the error should not be
reported by evince:
00004830  74 3e 3c 2f 64 63 3a 74  69 74 6c 65 3e 3c 64 63  |t></dc:title><dc|
00004840  3a 63 72 65 61 74 6f 72  3e 3c 72 64 66 3a 53 65  |:creator><rdf:Se|
00004850  71 3e 3c 72 64 66 3a 6c  69 3e e4 3c 2f 72 64 66  |q><rdf:li>.</rdf|

On this last line, we can see that the code of the "ä" is e4 which is
the right unicode code.
Other readers like xpdf, okular and pdfinfo do not report any error.
Here is the issue link at lilypond:
http://code.google.com/p/lilypond/issues/detail?id=2985
Comment 1 Christian Persch 2012-11-29 12:33:22 UTC
The pdf contains this xml:

000043d0  3c 3f 78 70 61 63 6b 65  74 20 62 65 67 69 6e 3d  |<?xpacket begin=|
000043e0  27 ef bb bf 27 20 69 64  3d 27 57 35 4d 30 4d 70  |'...' id='W5M0Mp|
000043f0  43 65 68 69 48 7a 72 65  53 7a 4e 54 63 7a 6b 63  |CehiHzreSzNTczkc|
00004400  39 64 27 3f 3e 0a 3c 3f  61 64 6f 62 65 2d 78 61  |9d'?>.<?adobe-xa|
00004410  70 2d 66 69 6c 74 65 72  73 20 65 73 63 3d 22 43  |p-filters esc="C|
00004420  52 4c 46 22 3f 3e 0a 3c  78 3a 78 6d 70 6d 65 74  |RLF"?>.<x:xmpmet|
00004430  61 20 78 6d 6c 6e 73 3a  78 3d 27 61 64 6f 62 65  |a xmlns:x='adobe|
00004440  3a 6e 73 3a 6d 65 74 61  2f 27 20 78 3a 78 6d 70  |:ns:meta/' x:xmp|
00004450  74 6b 3d 27 58 4d 50 20  74 6f 6f 6c 6b 69 74 20  |tk='XMP toolkit |
00004460  32 2e 39 2e 31 2d 31 33  2c 20 66 72 61 6d 65 77  |2.9.1-13, framew|
00004470  6f 72 6b 20 31 2e 36 27  3e 0a 3c 72 64 66 3a 52  |ork 1.6'>.<rdf:R|
00004480  44 46 20 78 6d 6c 6e 73  3a 72 64 66 3d 27 68 74  |DF xmlns:rdf='ht|
00004490  74 70 3a 2f 2f 77 77 77  2e 77 33 2e 6f 72 67 2f  |tp://www.w3.org/|
000044a0  31 39 39 39 2f 30 32 2f  32 32 2d 72 64 66 2d 73  |1999/02/22-rdf-s|
000044b0  79 6e 74 61 78 2d 6e 73  23 27 20 78 6d 6c 6e 73  |yntax-ns#' xmlns|
000044c0  3a 69 58 3d 27 68 74 74  70 3a 2f 2f 6e 73 2e 61  |:iX='http://ns.a|
000044d0  64 6f 62 65 2e 63 6f 6d  2f 69 58 2f 31 2e 30 2f  |dobe.com/iX/1.0/|
000044e0  27 3e 0a 3c 72 64 66 3a  44 65 73 63 72 69 70 74  |'>.<rdf:Descript|
000044f0  69 6f 6e 20 72 64 66 3a  61 62 6f 75 74 3d 27 75  |ion rdf:about='u|
00004500  75 69 64 3a 66 31 36 33  35 31 62 39 2d 37 32 30  |uid:f16351b9-720|
00004510  36 2d 31 31 65 64 2d 30  30 30 30 2d 34 62 31 39  |6-11ed-0000-4b19|
00004520  38 39 66 61 63 36 36 30  27 20 78 6d 6c 6e 73 3a  |89fac660' xmlns:|
00004530  70 64 66 3d 27 68 74 74  70 3a 2f 2f 6e 73 2e 61  |pdf='http://ns.a|
00004540  64 6f 62 65 2e 63 6f 6d  2f 70 64 66 2f 31 2e 33  |dobe.com/pdf/1.3|
00004550  2f 27 20 70 64 66 3a 50  72 6f 64 75 63 65 72 3d  |/' pdf:Producer=|
00004560  27 47 50 4c 20 47 68 6f  73 74 73 63 72 69 70 74  |'GPL Ghostscript|
00004570  20 39 2e 30 35 27 2f 3e  0a 3c 72 64 66 3a 44 65  | 9.05'/>.<rdf:De|
00004580  73 63 72 69 70 74 69 6f  6e 20 72 64 66 3a 61 62  |scription rdf:ab|
00004590  6f 75 74 3d 27 75 75 69  64 3a 66 31 36 33 35 31  |out='uuid:f16351|
000045a0  62 39 2d 37 32 30 36 2d  31 31 65 64 2d 30 30 30  |b9-7206-11ed-000|
000045b0  30 2d 34 62 31 39 38 39  66 61 63 36 36 30 27 20  |0-4b1989fac660' |
000045c0  78 6d 6c 6e 73 3a 78 6d  70 3d 27 68 74 74 70 3a  |xmlns:xmp='http:|
000045d0  2f 2f 6e 73 2e 61 64 6f  62 65 2e 63 6f 6d 2f 78  |//ns.adobe.com/x|
000045e0  61 70 2f 31 2e 30 2f 27  3e 3c 78 6d 70 3a 4d 6f  |ap/1.0/'><xmp:Mo|
000045f0  64 69 66 79 44 61 74 65  3e 32 30 31 32 2d 31 31  |difyDate>2012-11|
00004600  2d 32 39 54 30 37 3a 30  32 3a 34 35 2b 30 31 3a  |-29T07:02:45+01:|
00004610  30 30 3c 2f 78 6d 70 3a  4d 6f 64 69 66 79 44 61  |00</xmp:ModifyDa|
00004620  74 65 3e 0a 3c 78 6d 70  3a 43 72 65 61 74 65 44  |te>.<xmp:CreateD|
00004630  61 74 65 3e 32 30 31 32  2d 31 31 2d 32 39 54 30  |ate>2012-11-29T0|
00004640  37 3a 30 32 3a 34 35 2b  30 31 3a 30 30 3c 2f 78  |7:02:45+01:00</x|
00004650  6d 70 3a 43 72 65 61 74  65 44 61 74 65 3e 0a 3c  |mp:CreateDate>.<|
00004660  78 6d 70 3a 43 72 65 61  74 6f 72 54 6f 6f 6c 3e  |xmp:CreatorTool>|
00004670  4c 69 6c 79 50 6f 6e 64  20 32 2e 31 36 2e 30 3c  |LilyPond 2.16.0<|
00004680  2f 78 6d 70 3a 43 72 65  61 74 6f 72 54 6f 6f 6c  |/xmp:CreatorTool|
00004690  3e 3c 2f 72 64 66 3a 44  65 73 63 72 69 70 74 69  |></rdf:Descripti|
000046a0  6f 6e 3e 0a 3c 72 64 66  3a 44 65 73 63 72 69 70  |on>.<rdf:Descrip|
000046b0  74 69 6f 6e 20 72 64 66  3a 61 62 6f 75 74 3d 27  |tion rdf:about='|
000046c0  75 75 69 64 3a 66 31 36  33 35 31 62 39 2d 37 32  |uuid:f16351b9-72|
000046d0  30 36 2d 31 31 65 64 2d  30 30 30 30 2d 34 62 31  |06-11ed-0000-4b1|
000046e0  39 38 39 66 61 63 36 36  30 27 20 78 6d 6c 6e 73  |989fac660' xmlns|
000046f0  3a 78 61 70 4d 4d 3d 27  68 74 74 70 3a 2f 2f 6e  |:xapMM='http://n|
00004700  73 2e 61 64 6f 62 65 2e  63 6f 6d 2f 78 61 70 2f  |s.adobe.com/xap/|
00004710  31 2e 30 2f 6d 6d 2f 27  20 78 61 70 4d 4d 3a 44  |1.0/mm/' xapMM:D|
00004720  6f 63 75 6d 65 6e 74 49  44 3d 27 75 75 69 64 3a  |ocumentID='uuid:|
00004730  66 31 36 33 35 31 62 39  2d 37 32 30 36 2d 31 31  |f16351b9-7206-11|
00004740  65 64 2d 30 30 30 30 2d  34 62 31 39 38 39 66 61  |ed-0000-4b1989fa|
00004750  63 36 36 30 27 2f 3e 0a  3c 72 64 66 3a 44 65 73  |c660'/>.<rdf:Des|
00004760  63 72 69 70 74 69 6f 6e  20 72 64 66 3a 61 62 6f  |cription rdf:abo|
00004770  75 74 3d 27 75 75 69 64  3a 66 31 36 33 35 31 62  |ut='uuid:f16351b|
00004780  39 2d 37 32 30 36 2d 31  31 65 64 2d 30 30 30 30  |9-7206-11ed-0000|
00004790  2d 34 62 31 39 38 39 66  61 63 36 36 30 27 20 78  |-4b1989fac660' x|
000047a0  6d 6c 6e 73 3a 64 63 3d  27 68 74 74 70 3a 2f 2f  |mlns:dc='http://|
000047b0  70 75 72 6c 2e 6f 72 67  2f 64 63 2f 65 6c 65 6d  |purl.org/dc/elem|
000047c0  65 6e 74 73 2f 31 2e 31  2f 27 20 64 63 3a 66 6f  |ents/1.1/' dc:fo|
000047d0  72 6d 61 74 3d 27 61 70  70 6c 69 63 61 74 69 6f  |rmat='applicatio|
000047e0  6e 2f 70 64 66 27 3e 3c  64 63 3a 74 69 74 6c 65  |n/pdf'><dc:title|
000047f0  3e 3c 72 64 66 3a 41 6c  74 3e 3c 72 64 66 3a 6c  |><rdf:Alt><rdf:l|
00004800  69 20 78 6d 6c 3a 6c 61  6e 67 3d 27 78 2d 64 65  |i xml:lang='x-de|
00004810  66 61 75 6c 74 27 3e 55  6e 74 69 74 6c 65 64 3c  |fault'>Untitled<|
00004820  2f 72 64 66 3a 6c 69 3e  3c 2f 72 64 66 3a 41 6c  |/rdf:li></rdf:Al|
00004830  74 3e 3c 2f 64 63 3a 74  69 74 6c 65 3e 3c 64 63  |t></dc:title><dc|
00004840  3a 63 72 65 61 74 6f 72  3e 3c 72 64 66 3a 53 65  |:creator><rdf:Se|
00004850  71 3e 3c 72 64 66 3a 6c  69 3e e4 3c 2f 72 64 66  |q><rdf:li>.</rdf|
00004860  3a 6c 69 3e 3c 2f 72 64  66 3a 53 65 71 3e 3c 2f  |:li></rdf:Seq></|
00004870  64 63 3a 63 72 65 61 74  6f 72 3e 3c 2f 72 64 66  |dc:creator></rdf|
00004880  3a 44 65 73 63 72 69 70  74 69 6f 6e 3e 0a 3c 2f  |:Description>.</|
00004890  72 64 66 3a 52 44 46 3e  0a 3c 2f 78 3a 78 6d 70  |rdf:RDF>.</x:xmp|
000048a0  6d 65 74 61 3e 0a 20 20  20 20 20 20 20 20 20 20  |meta>.          |
000048b0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
000048c0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
000048d0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
000048e0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 0a 20  |              . |
000048f0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00004900  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00004910  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00004920  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00004930  20 20 20 20 20 20 20 0a  3c 3f 78 70 61 63 6b 65  |       .<?xpacke|
00004940  74 20 65 6e 64 3d 27 77  27 3f 3e 0a 65 6e 64 73  |t end='w'?>.ends|

Now first observe this:

000043e0  27 ef bb bf 27 20 69 64  3d 27 57 35 4d 30 4d 70  |'...' id='W5M0Mp|
             ^^^^^^^^
which is the byte order mark (BOM) in UTF-8. So clearly we're supposed to interpret this XML data as UTF-8.

Now let's look at the data:

00004850  71 3e 3c 72 64 66 3a 6c  69 3e e4 3c 2f 72 64 66  |q><rdf:li>.</rdf|
                                         ^^^^^
"ä" is U+00E4, but since this is supposed to be UTF-8 not ISO-8859-1, there should be a "c3 a4" sequence here, not a literal "e4"; "e4 3c" is not a valid UTF-8 sequence, so this XML data is not correctly encoded in UTF-8.

=> bug in the document creator.

However, I guess evince should have better error recovery in this case, too.
Comment 2 Christian Persch 2012-11-29 13:39:16 UTC
So I think we should replace the simple xmlParseMemory with creating a parser context ourself (and setting it up so that it doesn't spew errors to console!) then parsing with that, first as UTF-8 and then if we get an encoding error, re-try with latin1 ?
Comment 3 Germán Poo-Caamaño 2012-11-30 23:26:28 UTC
It happened to be a bug in ghostscript, as it was described in evince mailing list:
https://mail.gnome.org/archives/evince-list/2012-November/msg00028.html
Comment 4 Christian Persch 2012-12-07 17:44:37 UTC
I still think evince should do more error checking as detailed in comment 2.
Comment 5 GNOME Infrastructure Team 2018-05-22 14:52:08 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/evince/issues/318.