GNOME Bugzilla – Bug 689255
parser error : Input is not proper UTF-8, indicate encoding
Last modified: 2018-05-22 14:52:08 UTC
Created attachment 230156 [details] PDF Test case This bug was reported in evince mailing list. I can reproduce it with evince master, but not with poppler-glib-demo. So, it seems a bug that belongs to evince. Here is the original report: The attached pdf triggers an error while opened with evince 3.4.0 (poppler/cairo 0.18.4) on fedora 17: Entity: line 10: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x3C 0x2F 0x72 ='x-default'>Untitled</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li> ^ The pdf was created with lilypond 2.16.0 (music typesetting tool) on fedora 17. According to the developpers of lilypond, the "ä" character that triggers the error is well encoded and the error should not be reported by evince: 00004830 74 3e 3c 2f 64 63 3a 74 69 74 6c 65 3e 3c 64 63 |t></dc:title><dc| 00004840 3a 63 72 65 61 74 6f 72 3e 3c 72 64 66 3a 53 65 |:creator><rdf:Se| 00004850 71 3e 3c 72 64 66 3a 6c 69 3e e4 3c 2f 72 64 66 |q><rdf:li>.</rdf| On this last line, we can see that the code of the "ä" is e4 which is the right unicode code. Other readers like xpdf, okular and pdfinfo do not report any error. Here is the issue link at lilypond: http://code.google.com/p/lilypond/issues/detail?id=2985
The pdf contains this xml: 000043d0 3c 3f 78 70 61 63 6b 65 74 20 62 65 67 69 6e 3d |<?xpacket begin=| 000043e0 27 ef bb bf 27 20 69 64 3d 27 57 35 4d 30 4d 70 |'...' id='W5M0Mp| 000043f0 43 65 68 69 48 7a 72 65 53 7a 4e 54 63 7a 6b 63 |CehiHzreSzNTczkc| 00004400 39 64 27 3f 3e 0a 3c 3f 61 64 6f 62 65 2d 78 61 |9d'?>.<?adobe-xa| 00004410 70 2d 66 69 6c 74 65 72 73 20 65 73 63 3d 22 43 |p-filters esc="C| 00004420 52 4c 46 22 3f 3e 0a 3c 78 3a 78 6d 70 6d 65 74 |RLF"?>.<x:xmpmet| 00004430 61 20 78 6d 6c 6e 73 3a 78 3d 27 61 64 6f 62 65 |a xmlns:x='adobe| 00004440 3a 6e 73 3a 6d 65 74 61 2f 27 20 78 3a 78 6d 70 |:ns:meta/' x:xmp| 00004450 74 6b 3d 27 58 4d 50 20 74 6f 6f 6c 6b 69 74 20 |tk='XMP toolkit | 00004460 32 2e 39 2e 31 2d 31 33 2c 20 66 72 61 6d 65 77 |2.9.1-13, framew| 00004470 6f 72 6b 20 31 2e 36 27 3e 0a 3c 72 64 66 3a 52 |ork 1.6'>.<rdf:R| 00004480 44 46 20 78 6d 6c 6e 73 3a 72 64 66 3d 27 68 74 |DF xmlns:rdf='ht| 00004490 74 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f |tp://www.w3.org/| 000044a0 31 39 39 39 2f 30 32 2f 32 32 2d 72 64 66 2d 73 |1999/02/22-rdf-s| 000044b0 79 6e 74 61 78 2d 6e 73 23 27 20 78 6d 6c 6e 73 |yntax-ns#' xmlns| 000044c0 3a 69 58 3d 27 68 74 74 70 3a 2f 2f 6e 73 2e 61 |:iX='http://ns.a| 000044d0 64 6f 62 65 2e 63 6f 6d 2f 69 58 2f 31 2e 30 2f |dobe.com/iX/1.0/| 000044e0 27 3e 0a 3c 72 64 66 3a 44 65 73 63 72 69 70 74 |'>.<rdf:Descript| 000044f0 69 6f 6e 20 72 64 66 3a 61 62 6f 75 74 3d 27 75 |ion rdf:about='u| 00004500 75 69 64 3a 66 31 36 33 35 31 62 39 2d 37 32 30 |uid:f16351b9-720| 00004510 36 2d 31 31 65 64 2d 30 30 30 30 2d 34 62 31 39 |6-11ed-0000-4b19| 00004520 38 39 66 61 63 36 36 30 27 20 78 6d 6c 6e 73 3a |89fac660' xmlns:| 00004530 70 64 66 3d 27 68 74 74 70 3a 2f 2f 6e 73 2e 61 |pdf='http://ns.a| 00004540 64 6f 62 65 2e 63 6f 6d 2f 70 64 66 2f 31 2e 33 |dobe.com/pdf/1.3| 00004550 2f 27 20 70 64 66 3a 50 72 6f 64 75 63 65 72 3d |/' pdf:Producer=| 00004560 27 47 50 4c 20 47 68 6f 73 74 73 63 72 69 70 74 |'GPL Ghostscript| 00004570 20 39 2e 30 35 27 2f 3e 0a 3c 72 64 66 3a 44 65 | 9.05'/>.<rdf:De| 00004580 73 63 72 69 70 74 69 6f 6e 20 72 64 66 3a 61 62 |scription rdf:ab| 00004590 6f 75 74 3d 27 75 75 69 64 3a 66 31 36 33 35 31 |out='uuid:f16351| 000045a0 62 39 2d 37 32 30 36 2d 31 31 65 64 2d 30 30 30 |b9-7206-11ed-000| 000045b0 30 2d 34 62 31 39 38 39 66 61 63 36 36 30 27 20 |0-4b1989fac660' | 000045c0 78 6d 6c 6e 73 3a 78 6d 70 3d 27 68 74 74 70 3a |xmlns:xmp='http:| 000045d0 2f 2f 6e 73 2e 61 64 6f 62 65 2e 63 6f 6d 2f 78 |//ns.adobe.com/x| 000045e0 61 70 2f 31 2e 30 2f 27 3e 3c 78 6d 70 3a 4d 6f |ap/1.0/'><xmp:Mo| 000045f0 64 69 66 79 44 61 74 65 3e 32 30 31 32 2d 31 31 |difyDate>2012-11| 00004600 2d 32 39 54 30 37 3a 30 32 3a 34 35 2b 30 31 3a |-29T07:02:45+01:| 00004610 30 30 3c 2f 78 6d 70 3a 4d 6f 64 69 66 79 44 61 |00</xmp:ModifyDa| 00004620 74 65 3e 0a 3c 78 6d 70 3a 43 72 65 61 74 65 44 |te>.<xmp:CreateD| 00004630 61 74 65 3e 32 30 31 32 2d 31 31 2d 32 39 54 30 |ate>2012-11-29T0| 00004640 37 3a 30 32 3a 34 35 2b 30 31 3a 30 30 3c 2f 78 |7:02:45+01:00</x| 00004650 6d 70 3a 43 72 65 61 74 65 44 61 74 65 3e 0a 3c |mp:CreateDate>.<| 00004660 78 6d 70 3a 43 72 65 61 74 6f 72 54 6f 6f 6c 3e |xmp:CreatorTool>| 00004670 4c 69 6c 79 50 6f 6e 64 20 32 2e 31 36 2e 30 3c |LilyPond 2.16.0<| 00004680 2f 78 6d 70 3a 43 72 65 61 74 6f 72 54 6f 6f 6c |/xmp:CreatorTool| 00004690 3e 3c 2f 72 64 66 3a 44 65 73 63 72 69 70 74 69 |></rdf:Descripti| 000046a0 6f 6e 3e 0a 3c 72 64 66 3a 44 65 73 63 72 69 70 |on>.<rdf:Descrip| 000046b0 74 69 6f 6e 20 72 64 66 3a 61 62 6f 75 74 3d 27 |tion rdf:about='| 000046c0 75 75 69 64 3a 66 31 36 33 35 31 62 39 2d 37 32 |uuid:f16351b9-72| 000046d0 30 36 2d 31 31 65 64 2d 30 30 30 30 2d 34 62 31 |06-11ed-0000-4b1| 000046e0 39 38 39 66 61 63 36 36 30 27 20 78 6d 6c 6e 73 |989fac660' xmlns| 000046f0 3a 78 61 70 4d 4d 3d 27 68 74 74 70 3a 2f 2f 6e |:xapMM='http://n| 00004700 73 2e 61 64 6f 62 65 2e 63 6f 6d 2f 78 61 70 2f |s.adobe.com/xap/| 00004710 31 2e 30 2f 6d 6d 2f 27 20 78 61 70 4d 4d 3a 44 |1.0/mm/' xapMM:D| 00004720 6f 63 75 6d 65 6e 74 49 44 3d 27 75 75 69 64 3a |ocumentID='uuid:| 00004730 66 31 36 33 35 31 62 39 2d 37 32 30 36 2d 31 31 |f16351b9-7206-11| 00004740 65 64 2d 30 30 30 30 2d 34 62 31 39 38 39 66 61 |ed-0000-4b1989fa| 00004750 63 36 36 30 27 2f 3e 0a 3c 72 64 66 3a 44 65 73 |c660'/>.<rdf:Des| 00004760 63 72 69 70 74 69 6f 6e 20 72 64 66 3a 61 62 6f |cription rdf:abo| 00004770 75 74 3d 27 75 75 69 64 3a 66 31 36 33 35 31 62 |ut='uuid:f16351b| 00004780 39 2d 37 32 30 36 2d 31 31 65 64 2d 30 30 30 30 |9-7206-11ed-0000| 00004790 2d 34 62 31 39 38 39 66 61 63 36 36 30 27 20 78 |-4b1989fac660' x| 000047a0 6d 6c 6e 73 3a 64 63 3d 27 68 74 74 70 3a 2f 2f |mlns:dc='http://| 000047b0 70 75 72 6c 2e 6f 72 67 2f 64 63 2f 65 6c 65 6d |purl.org/dc/elem| 000047c0 65 6e 74 73 2f 31 2e 31 2f 27 20 64 63 3a 66 6f |ents/1.1/' dc:fo| 000047d0 72 6d 61 74 3d 27 61 70 70 6c 69 63 61 74 69 6f |rmat='applicatio| 000047e0 6e 2f 70 64 66 27 3e 3c 64 63 3a 74 69 74 6c 65 |n/pdf'><dc:title| 000047f0 3e 3c 72 64 66 3a 41 6c 74 3e 3c 72 64 66 3a 6c |><rdf:Alt><rdf:l| 00004800 69 20 78 6d 6c 3a 6c 61 6e 67 3d 27 78 2d 64 65 |i xml:lang='x-de| 00004810 66 61 75 6c 74 27 3e 55 6e 74 69 74 6c 65 64 3c |fault'>Untitled<| 00004820 2f 72 64 66 3a 6c 69 3e 3c 2f 72 64 66 3a 41 6c |/rdf:li></rdf:Al| 00004830 74 3e 3c 2f 64 63 3a 74 69 74 6c 65 3e 3c 64 63 |t></dc:title><dc| 00004840 3a 63 72 65 61 74 6f 72 3e 3c 72 64 66 3a 53 65 |:creator><rdf:Se| 00004850 71 3e 3c 72 64 66 3a 6c 69 3e e4 3c 2f 72 64 66 |q><rdf:li>.</rdf| 00004860 3a 6c 69 3e 3c 2f 72 64 66 3a 53 65 71 3e 3c 2f |:li></rdf:Seq></| 00004870 64 63 3a 63 72 65 61 74 6f 72 3e 3c 2f 72 64 66 |dc:creator></rdf| 00004880 3a 44 65 73 63 72 69 70 74 69 6f 6e 3e 0a 3c 2f |:Description>.</| 00004890 72 64 66 3a 52 44 46 3e 0a 3c 2f 78 3a 78 6d 70 |rdf:RDF>.</x:xmp| 000048a0 6d 65 74 61 3e 0a 20 20 20 20 20 20 20 20 20 20 |meta>. | 000048b0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000048c0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000048d0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 000048e0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 0a 20 | . | 000048f0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00004900 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00004910 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00004920 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | | 00004930 20 20 20 20 20 20 20 0a 3c 3f 78 70 61 63 6b 65 | .<?xpacke| 00004940 74 20 65 6e 64 3d 27 77 27 3f 3e 0a 65 6e 64 73 |t end='w'?>.ends| Now first observe this: 000043e0 27 ef bb bf 27 20 69 64 3d 27 57 35 4d 30 4d 70 |'...' id='W5M0Mp| ^^^^^^^^ which is the byte order mark (BOM) in UTF-8. So clearly we're supposed to interpret this XML data as UTF-8. Now let's look at the data: 00004850 71 3e 3c 72 64 66 3a 6c 69 3e e4 3c 2f 72 64 66 |q><rdf:li>.</rdf| ^^^^^ "ä" is U+00E4, but since this is supposed to be UTF-8 not ISO-8859-1, there should be a "c3 a4" sequence here, not a literal "e4"; "e4 3c" is not a valid UTF-8 sequence, so this XML data is not correctly encoded in UTF-8. => bug in the document creator. However, I guess evince should have better error recovery in this case, too.
So I think we should replace the simple xmlParseMemory with creating a parser context ourself (and setting it up so that it doesn't spew errors to console!) then parsing with that, first as UTF-8 and then if we get an encoding error, re-try with latin1 ?
It happened to be a bug in ghostscript, as it was described in evince mailing list: https://mail.gnome.org/archives/evince-list/2012-November/msg00028.html
I still think evince should do more error checking as detailed in comment 2.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/evince/issues/318.