GNOME Bugzilla – Bug 582206
PDF/A detection
Last modified: 2009-05-20 16:07:33 UTC
Evince does not correctly identify PDF/A documents and reports PDF-1.4 (in File / Properties). Since PDF/A is an important international standard it would be useful to correctly identify a conforming document, as Adobe Reader does for example. See also https://bugs.launchpad.net/ubuntu/+source/evince/+bug/373445
Created attachment 134423 [details] [review] PDF/A detection patch Implementation of automatic PDF/A detection; XML metadata is parsed and searched for correct namespaces and tags, following specification at www.pdfa.org. Correct version is then shown is File/Properties
Thanks for the patch Davide I wonder, should this go into poppler (http://poppler.freedesktop.org)?
(In reply to comment #2) > Thanks for the patch Davide > > I wonder, should this go into poppler (http://poppler.freedesktop.org)? > I already looked into poppler source code. Currently poppler reports PDF version as a double, looking only in the header; then poppler-glib formats it as "PDF-%.2g" string and returns it as a property. PDF/A detection could then be correctly done in poppler-glib through libxml2, but in this way the "short" decimal PDF version will be lost (anyway, PDF/A is a PDF-1.4). The optimal way could be adding a different property in poppler-glib, but what about popper-qt etc. then?? So I think this could be done in evince outside the shared libraries. What do you think about it? Maybe someone could also talk to poppler developers.
+ "metadata", &metadata, NULL); + if (metadata != NULL) + { + char* format = pdf_document_get_format_from_metadata(metadata); + if (format != NULL) + info->format = format; + } This leaks the |metadata| string. And you can just always assign format to info->format. And info->format is leaked, since it's never deallocated. + // makes conf lowercase + for(int i = 0; conf[i]; i++) + conf[i] = tolower(conf[i]); + + // return buffer + result = new char[20]; + snprintf(result, 20, "PDF/A - %s%s", part, conf); String should be allocated using glib allocator here, IMHO; use g_strdup_printf(). And g_ascii_tolower(), not tolower(). - $(DISABLE_DEPRECATED) + $(DISABLE_DEPRECATED) \ + `pkg-config --cflags libxml-2.0` This needs to be moved to configure of course using PKG_CHECK_MODULES.
(In reply to comment #4) > + "metadata", &metadata, > NULL); > > + if (metadata != NULL) > + { > + char* format = pdf_document_get_format_from_metadata(metadata); > + if (format != NULL) > + info->format = format; > + } > > This leaks the |metadata| string. And you can just always assign format to > info->format. And info->format is leaked, since it's never deallocated. Ok got the point... sorry but I'm not an expert GNOME programmer :) I initially thought that returned strings were references to internal properties, not copies. > > + // makes conf lowercase > + for(int i = 0; conf[i]; i++) > + conf[i] = tolower(conf[i]); > + > + // return buffer > + result = new char[20]; > + snprintf(result, 20, "PDF/A - %s%s", part, conf); > > String should be allocated using glib allocator here, IMHO; use > g_strdup_printf(). And g_ascii_tolower(), not tolower(). Ok! > > - $(DISABLE_DEPRECATED) > + $(DISABLE_DEPRECATED) \ > + `pkg-config --cflags libxml-2.0` > > This needs to be moved to configure of course using PKG_CHECK_MODULES. libXML is already checked in the main configure.ac PKG_CHECK_MODULES(SHELL_CORE, libxml-2.0 >= $LIBXML_REQUIRED ....... is there a better way the get its cflags here? Bye Davide
Created attachment 134519 [details] [review] PDF/A detection patch (2) I modified my patch following your various advices. Comments appreciated :)
I would still prefer to see this patch submitted to poppler. We can safely live with the API change if new function will be introduced. Something like get_version_string.
I'm not sure, I don't think we are going to parse the metadata in poppler. We'll need to parse the metadata in evince to fix bug #349173 anyway.
ok then, let's have it
Any news about this? I see it is still UNCONFIRMED
Comment on attachment 134519 [details] [review] PDF/A detection patch (2) Thanks for the patch, some comments below. >--- backend/pdf/ev-poppler.cc.orig 2009-04-15 22:10:52.000000000 +0200 >+++ backend/pdf/ev-poppler.cc 2009-05-12 19:38:18.000000000 +0200 >@@ -19,6 +19,7 @@ > > #include "config.h" > >+#include <ctype.h> What do you need ctype.h for? > #include <math.h> > #include <string.h> > #include <gtk/gtk.h> >@@ -50,6 +51,11 @@ > #include "ev-attachment.h" > #include "ev-image.h" > >+#include <libxml/tree.h> >+#include <libxml/parser.h> >+#include <libxml/xpath.h> >+#include <libxml/xpathInternals.h> >+ > #if (defined (HAVE_POPPLER_PAGE_RENDER)) && (defined (HAVE_CAIRO_PDF) || defined (HAVE_CAIRO_PS)) > #define HAVE_CAIRO_PRINT > #endif >@@ -128,6 +134,7 @@ > PopplerAction *action); > static void pdf_document_search_free (PdfDocumentSearch *search); > static void pdf_print_context_free (PdfPrintContext *ctx); >+static char* pdf_document_get_format_from_metadata (char* metadata); No need for prototype, just make sure the function is before pdf_document_get_info(). > EV_BACKEND_REGISTER_WITH_CODE (PdfDocument, pdf_document, > { >@@ -570,6 +577,7 @@ > PopplerViewerPreferences view_prefs; > PopplerPermissions permissions; > EvPage *page; >+ char* metadata; > > info = g_new0 (EvDocumentInfo, 1); > >@@ -606,8 +614,20 @@ > "creation-date", &(info->creation_date), > "mod-date", &(info->modified_date), > "linearized", &(info->linearized), >+ "metadata", &metadata, > NULL); > >+ if (metadata != NULL) >+ { >+ char* fmt = pdf_document_get_format_from_metadata(metadata); >+ if (fmt != NULL) >+ { >+ g_free(info->format); >+ info->format = fmt; >+ } >+ g_free(metadata); >+ } >+ Please, follow the codying style of this file, { in the same line that the if and space between functions and ( > info->n_pages = ev_document_get_n_pages (document); > > if (info->n_pages > 0) { >@@ -2484,3 +2504,91 @@ > iface->hide_layer = pdf_document_layers_hide_layer; > iface->layer_is_visible = pdf_document_layers_layer_is_visible; > } >+ >+// reference: >+// http://www.pdfa.org/lib/exe/fetch.php?id=pdfa%3Aen%3Atechdoc&cache=cache&media=pdfa:techdoc:tn0001_pdfa-1_and_namespaces_2008-03-18.pdf Even though it's a C++ file we don't usually use C++ comments in evince, use /* */ instead, please. The same for variable declarations, place all of them at the beginning of the block. >+static char* >+pdf_document_get_format_from_metadata (char* metadata) >+{ use const char *metadata instead of just char *metadata. The patch looks good to me, please fix the codying style issues and we'll apply it. Thank you very much.
Created attachment 134888 [details] [review] PDF/A detection patch (3) Ok I fixed the patch following your notes! So I suppose this will be in GNOME 2.28?
(In reply to comment #12) > Created an attachment (id=134888) [edit] > PDF/A detection patch (3) > > Ok I fixed the patch following your notes! I've just applied the path with the following modifications: - Move the libxml2 dependency in configure.ac from backends to poppler since it's only required by the pdf backend - Add xmlFree(part) and xmlFree(conf) since we were leaking thouse strings > So I suppose this will be in GNOME 2.28? > sure, it's fixed in git master now. Thanks!