After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 582206 - PDF/A detection
PDF/A detection
Status: RESOLVED FIXED
Product: evince
Classification: Core
Component: PDF
2.26.x
Other All
: Normal enhancement
: ---
Assigned To: Evince Maintainers
Evince Maintainers
Depends on:
Blocks:
 
 
Reported: 2009-05-11 17:23 UTC by Davide Capodaglio
Modified: 2009-05-20 16:07 UTC
See Also:
GNOME target: ---
GNOME version: Unversioned Enhancement


Attachments
PDF/A detection patch (4.75 KB, patch)
2009-05-11 17:26 UTC, Davide Capodaglio
reviewed Details | Review
PDF/A detection patch (2) (5.13 KB, patch)
2009-05-12 18:04 UTC, Davide Capodaglio
needs-work Details | Review
PDF/A detection patch (3) (4.73 KB, patch)
2009-05-18 18:15 UTC, Davide Capodaglio
committed Details | Review

Description Davide Capodaglio 2009-05-11 17:23:09 UTC
Evince does not correctly identify PDF/A documents and reports PDF-1.4 (in File / Properties).
Since PDF/A is an important international standard it would be useful to correctly identify a conforming document, as Adobe Reader does for example.
See also https://bugs.launchpad.net/ubuntu/+source/evince/+bug/373445
Comment 1 Davide Capodaglio 2009-05-11 17:26:37 UTC
Created attachment 134423 [details] [review]
PDF/A detection patch

Implementation of automatic PDF/A detection; XML metadata is parsed and searched for correct namespaces and tags, following specification at www.pdfa.org.
Correct version is then shown is File/Properties
Comment 2 Nickolay V. Shmyrev 2009-05-11 17:28:32 UTC
Thanks for the patch Davide

I wonder, should this go into poppler (http://poppler.freedesktop.org)?
Comment 3 Davide Capodaglio 2009-05-11 17:48:01 UTC
(In reply to comment #2)
> Thanks for the patch Davide
> 
> I wonder, should this go into poppler (http://poppler.freedesktop.org)?
> 

I already looked into poppler source code.
Currently poppler reports PDF version as a double, looking only in the header; then poppler-glib formats it as "PDF-%.2g" string and returns it as a property.
PDF/A detection could then be correctly done in poppler-glib through libxml2, but in this way the "short" decimal PDF version will be lost (anyway, PDF/A is a PDF-1.4).
The optimal way could be adding a different property in poppler-glib, but what about popper-qt etc. then??
So I think this could be done in evince outside the shared libraries.
What do you think about it? Maybe someone could also talk to poppler developers.
Comment 4 Christian Persch 2009-05-11 21:16:23 UTC
+			  "metadata", &metadata,
 		      NULL);
 
+	if (metadata != NULL)
+	{
+		char* format = pdf_document_get_format_from_metadata(metadata);
+		if (format != NULL)
+			info->format = format;
+	}

This leaks the |metadata| string. And you can just always assign format to info->format. And info->format is leaked, since it's never deallocated.

+		// makes conf lowercase
+		for(int i = 0; conf[i]; i++)
+			conf[i] = tolower(conf[i]);
+		
+		// return buffer
+		result = new char[20];
+		snprintf(result, 20, "PDF/A - %s%s", part, conf);

String should be allocated using glib allocator here, IMHO; use g_strdup_printf(). And g_ascii_tolower(), not tolower().

-	$(DISABLE_DEPRECATED)
+	$(DISABLE_DEPRECATED)		\
+	`pkg-config --cflags libxml-2.0` 

This needs to be moved to configure of course using PKG_CHECK_MODULES.
Comment 5 Davide Capodaglio 2009-05-12 07:41:42 UTC
(In reply to comment #4)
> +                         "metadata", &metadata,
>                       NULL);
> 
> +       if (metadata != NULL)
> +       {
> +               char* format = pdf_document_get_format_from_metadata(metadata);
> +               if (format != NULL)
> +                       info->format = format;
> +       }
> 
> This leaks the |metadata| string. And you can just always assign format to
> info->format. And info->format is leaked, since it's never deallocated.

Ok got the point... sorry but I'm not an expert GNOME programmer :)
I initially thought that returned strings were references to internal properties, not copies.

> 
> +               // makes conf lowercase
> +               for(int i = 0; conf[i]; i++)
> +                       conf[i] = tolower(conf[i]);
> +               
> +               // return buffer
> +               result = new char[20];
> +               snprintf(result, 20, "PDF/A - %s%s", part, conf);
> 
> String should be allocated using glib allocator here, IMHO; use
> g_strdup_printf(). And g_ascii_tolower(), not tolower().

Ok!

> 
> -       $(DISABLE_DEPRECATED)
> +       $(DISABLE_DEPRECATED)           \
> +       `pkg-config --cflags libxml-2.0` 
> 
> This needs to be moved to configure of course using PKG_CHECK_MODULES.

libXML is already checked in the main configure.ac
    PKG_CHECK_MODULES(SHELL_CORE, libxml-2.0 >= $LIBXML_REQUIRED .......
is there a better way the get its cflags here?

Bye
Davide
Comment 6 Davide Capodaglio 2009-05-12 18:04:35 UTC
Created attachment 134519 [details] [review]
PDF/A detection patch (2)

I modified my patch following your various advices.
Comments appreciated :)
Comment 7 Nickolay V. Shmyrev 2009-05-13 06:49:40 UTC
I would still prefer to see this patch submitted to poppler. We can safely live with the API change if new function will be introduced. Something like get_version_string.
Comment 8 Carlos Garcia Campos 2009-05-13 07:47:38 UTC
I'm not sure, I don't think we are going to parse the metadata in poppler. We'll need to parse the metadata in evince to fix bug #349173 anyway. 
Comment 9 Nickolay V. Shmyrev 2009-05-13 07:56:35 UTC
ok then, let's have it
Comment 10 Davide Capodaglio 2009-05-16 10:30:22 UTC
Any news about this? I see it is still UNCONFIRMED
Comment 11 Carlos Garcia Campos 2009-05-18 12:39:25 UTC
Comment on attachment 134519 [details] [review]
PDF/A detection patch (2)

Thanks for the patch, some comments below.

>--- backend/pdf/ev-poppler.cc.orig	2009-04-15 22:10:52.000000000 +0200
>+++ backend/pdf/ev-poppler.cc	2009-05-12 19:38:18.000000000 +0200
>@@ -19,6 +19,7 @@
> 
> #include "config.h"
> 
>+#include <ctype.h>

What do you need ctype.h for? 

> #include <math.h>
> #include <string.h>
> #include <gtk/gtk.h>
>@@ -50,6 +51,11 @@
> #include "ev-attachment.h"
> #include "ev-image.h"
> 
>+#include <libxml/tree.h>
>+#include <libxml/parser.h>
>+#include <libxml/xpath.h>
>+#include <libxml/xpathInternals.h>
>+
> #if (defined (HAVE_POPPLER_PAGE_RENDER)) && (defined (HAVE_CAIRO_PDF) || defined (HAVE_CAIRO_PS))
> #define HAVE_CAIRO_PRINT
> #endif
>@@ -128,6 +134,7 @@
> 					     PopplerAction     *action);
> static void        pdf_document_search_free (PdfDocumentSearch *search);
> static void        pdf_print_context_free   (PdfPrintContext   *ctx);
>+static char* pdf_document_get_format_from_metadata (char* metadata);

No need for prototype, just make sure the function is before pdf_document_get_info().

> EV_BACKEND_REGISTER_WITH_CODE (PdfDocument, pdf_document,
> 			 {
>@@ -570,6 +577,7 @@
> 	PopplerViewerPreferences view_prefs;
> 	PopplerPermissions permissions;
> 	EvPage *page;
>+	char* metadata;
> 
> 	info = g_new0 (EvDocumentInfo, 1);
> 
>@@ -606,8 +614,20 @@
> 		      "creation-date", &(info->creation_date),
> 		      "mod-date", &(info->modified_date),
> 		      "linearized", &(info->linearized),
>+			  "metadata", &metadata,
> 		      NULL);
> 
>+	if (metadata != NULL)
>+	{
>+		char* fmt = pdf_document_get_format_from_metadata(metadata);
>+		if (fmt != NULL)
>+		{
>+			g_free(info->format);
>+			info->format = fmt;
>+		}
>+		g_free(metadata);
>+	}
>+

Please, follow the codying style of this file, { in the same line that the if and space between functions and (

> 	info->n_pages = ev_document_get_n_pages (document);
> 
> 	if (info->n_pages > 0) {
>@@ -2484,3 +2504,91 @@
> 	iface->hide_layer = pdf_document_layers_hide_layer;
> 	iface->layer_is_visible = pdf_document_layers_layer_is_visible;
> }
>+
>+// reference:
>+// http://www.pdfa.org/lib/exe/fetch.php?id=pdfa%3Aen%3Atechdoc&cache=cache&media=pdfa:techdoc:tn0001_pdfa-1_and_namespaces_2008-03-18.pdf

Even though it's a C++ file we don't usually use C++ comments in evince, use /* */ instead, please. The same for variable declarations, place all of them at the beginning of the block. 

>+static char* 
>+pdf_document_get_format_from_metadata (char* metadata)
>+{

use const char *metadata instead of just char *metadata. 

The patch looks good to me, please fix the codying style issues and we'll apply it. 

Thank you very much.
Comment 12 Davide Capodaglio 2009-05-18 18:15:58 UTC
Created attachment 134888 [details] [review]
PDF/A detection patch (3)

Ok I fixed the patch following your notes!
So I suppose this will be in GNOME 2.28?
Comment 13 Carlos Garcia Campos 2009-05-20 16:07:33 UTC
(In reply to comment #12)
> Created an attachment (id=134888) [edit]
> PDF/A detection patch (3)
> 
> Ok I fixed the patch following your notes!

I've just applied the path with the following modifications:

 - Move the libxml2 dependency in configure.ac from backends to poppler since it's only required by the pdf backend
 - Add xmlFree(part) and xmlFree(conf) since we were leaking thouse strings

> So I suppose this will be in GNOME 2.28?
> 

sure, it's fixed in git master now. Thanks!