Bug 609075 – Adding support for pdf extractor to extract the index data from the pdf files.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 609075 - Adding support for pdf extractor to extract the index data from the pdf files.


Summary:	Adding support for pdf extractor to extract the index data from the pdf files.


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-02-05 12:19 UTC by Amit
Modified:	2010-03-05 10:28 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch (6.29 KB, text/plain) 2010-02-05 12:20 UTC, Amit		Details
Fixes above issues (6.80 KB, patch) 2010-02-07 16:00 UTC, Martyn Russell	committed	Details \| Review
test (557.87 KB, application/pdf) 2010-02-08 05:52 UTC, Amit		Details
test1 (94.06 KB, application/pdf) 2010-02-08 06:01 UTC, Amit		Details
patch (6.12 KB, patch) 2010-02-08 06:14 UTC, Amit	none	Details \| Review

Description Amit 2010-02-05 12:19:23 UTC

This patch is adding the support to extract the index data from the pdf and adding to the metadata.

Comment 1 Amit 2010-02-05 12:20:12 UTC

Created attachment 153079 [details]
patch

Comment 2 Martyn Russell 2010-02-07 15:50:00 UTC

Comment on attachment 153079 [details]
patch

>diff --git a/src/tracker-extract/tracker-extract-pdf.c b/src/tracker-extract/tracker-extract-pdf.c
>index ff2c6a6..0636e6c 100644
>--- a/src/tracker-extract/tracker-extract-pdf.c
>+++ b/src/tracker-extract/tracker-extract-pdf.c
>@@ -1,5 +1,6 @@
> /*
>  * Copyright (C) 2006, Mr Jamie McCracken (jamiemcc@gnome.org)
>+ * Copyright (C) 2010, Amit Aggarwal (amitcs06@gmail.com)

Please put more recent statements AFTER the last.

>  * Copyright (C) 2008-2009, Nokia
>  *
>  * This program is free software; you can redistribute it and/or
>@@ -44,6 +45,15 @@ typedef struct {
> 	gchar *keywords;
> } PDFData;
> 
>+typedef struct {
>+    GString *title;
>+    GString *uri;
>+    GString *file_name;
>+    GString *params;
>+    GString *named_dest;
>+       

No extra spaces please :)

>+}PDFIndexData;
>+
> static void extract_pdf (const gchar          *uri,
>                          TrackerSparqlBuilder *preupdate,
>                          TrackerSparqlBuilder *metadata);
>@@ -53,6 +63,119 @@ static TrackerExtractData data[] = {
> 	{ NULL, NULL }
> };
> 
>+static void readOutline(PopplerDocument *document, TrackerSparqlBuilder *metadata) 

Coding style here is not correct, should be:

static void
... (foo *f,
     bar *b)

>+{
>+    PopplerIndexIter *index = poppler_index_iter_new (document);
>+    if (index) {
>+        PDFIndexData id= { 0 };

No need to initiate the structure here, you set all the values right below:

>+        id.title = g_string_new ("");
>+        id.uri = g_string_new ("");
>+        id.file_name = g_string_new ("");
>+        id.params = g_string_new ("");
>+        id.named_dest = g_string_new ("");
>+        readIndex(index, &id);
>+   
>+        if(id.title->len != 0){
>+            tracker_sparql_builder_predicate (metadata, "nie:plainTextContent");
>+            tracker_sparql_builder_object_unvalidated (metadata, id.title->str);
>+        }
>+        if(id.uri->len != 0){
>+            tracker_sparql_builder_predicate (metadata, "nie:plainTextContent");
>+            tracker_sparql_builder_object_unvalidated (metadata, id.uri->str);
>+        }
>+        if(id.file_name->len != 0){
>+            tracker_sparql_builder_predicate (metadata, "nie:plainTextContent");
>+            tracker_sparql_builder_object_unvalidated (metadata, id.file_name->str);
>+        }
>+        if(id.params->len != 0){
>+            tracker_sparql_builder_predicate (metadata, "nie:plainTextContent");
>+            tracker_sparql_builder_object_unvalidated (metadata, id.params->str);
>+        }
>+        if(id.named_dest->len != 0){
>+            tracker_sparql_builder_predicate (metadata, "nie:plainTextContent");
>+            tracker_sparql_builder_object_unvalidated (metadata, id.named_dest->str);
>+        }

Please use a space between these if statements, it doesn't look so cramped.

>+void readIndex(PopplerIndexIter *index, PDFIndexData *id)

Internal (i.e. non public functions) should be created with static and our naming policy is not someUpperCaseWords, but some_upper_case_words).

>+{
>+    if ( NULL != index)

Please don't use the value we are checking against as the first argument in the logic. We don't do that in the code base.

>+    {
>+        do
>+        {
>+            PopplerAction *action = poppler_index_iter_get_action (index);
>+            if(action) {

Use if (!action) { return; } instead in places like this to avoid the extra indentation unnecessarily. Also, the indentation is incorrect here too.

>+                if ( POPPLER_ACTION_GOTO_DEST == action->type )

For these it makes more sense to use a switch() statement.

>+                {
>+                    PopplerActionGotoDest *actionGoTo = (PopplerActionGotoDest *)action;
>+                    if(actionGoTo && actionGoTo->title) {

Checking actionGoto again, is not needed, we already checked action. This is the same in the rest of this code block.

...

>+                    if(destination && destination->named_dest) {

Should be "if (" not "if(", again, coding style.

>+                    PopplerActionLaunch *actionLaunch = (PopplerActionLaunch *)action;

Variables should be named action_launch not actionLaunch. Or something smaller to avoid all the extra typing is usually better.

>+            PopplerIndexIter *childIter = poppler_index_iter_get_child (index);
>+            readIndex (childIter, id);

Generally, we try to stick to C89, so variable definitions are done at the top of the code block.

>+        }
>+        while ( poppler_index_iter_next (index) );

Please no spaces round conditions like above.

Comment 3 Martyn Russell 2010-02-07 16:00:25 UTC

Created attachment 153201 [details] [review]
Fixes above issues

Amit, so the issue I have here, is that the patch does get extra data, but I don't really know if we are using it correctly.

This difference in a sample PDF I downloaded was that a bunch of extra text got concatenated into one long string and added as extra plainTextContent.

so my main issue with this patch is WHY do we need it and if we have this why do we need the other method for getting plainTextContent?

The poppler documentation is pretty bad, so all those actions we switch on don't really tell me much about the data itself. The URL is clear and some of the other types are, but how we should be interpreting them is not.

Comment 4 Amit 2010-02-08 05:52:58 UTC

Created attachment 153246 [details]
test

Comment 5 Amit 2010-02-08 06:01:06 UTC

Created attachment 153247 [details]
test1

Comment 6 Amit 2010-02-08 06:10:14 UTC

Hello Martyn Russell,

Thanks for your valuable comments. I have incoported those changes and uploaded the new diff file.


>>Amit, so the issue I have here, is that the patch does get extra data, but I
>>don't really know if we are using it correctly.
>>This difference in a sample PDF I downloaded was that a bunch of extra text got
>>concatenated into one long string and added as extra plainTextContent.
>>so my main issue with this patch is WHY do we need it and if we have this why
>>do we need the other method for getting plainTextContent?
 
I have attached some sample files also. Yes you are correct I am adding all this as a plain text but currently there is no ontology for TOC or index data. I come to know from Ottela in future relase it will going to support it.Thats why I currently added this add as plainText. Once that ontology will be supported I will make the corresponding changes. 

>>The poppler documentation is pretty bad, so all those actions we switch on
>>don't really tell me much about the data itself. The URL is clear and some of
>>the other types are, but how we should be interpreting them is not.

True I also searched also lot of document also not able to find any other action except the ActionGoTo but I have implemented the other actions as I may be present in some pdf file index data. 

Thanks
Amit

Comment 7 Amit 2010-02-08 06:14:12 UTC

Created attachment 153248 [details] [review]
patch

Opps !!! I missed out you also have attached the fixes. Please ignore this patch

Comment 8 Martyn Russell 2010-03-05 10:28:03 UTC

This wasn't marked as fixed, but it was in 0.7.22.