Bug 623005 – Remove odt2txt dependency in the OASIS extractor

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 623005 - Remove odt2txt dependency in the OASIS extractor


Summary:	Remove odt2txt dependency in the OASIS extractor


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-06-28 06:21 UTC by Murugappan
Modified:	2010-08-17 14:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch for tracker plugin of odt, odp, ods (12.57 KB, patch) 2010-06-28 06:22 UTC, Murugappan	needs-work	Details \| Review
Made review corrections in tracker plugin for odt, odp, ods file formats (15.57 KB, patch) 2010-06-28 12:00 UTC, Murugappan	none	Details \| Review
Small style changes in the previous patch (16.08 KB, patch) 2010-06-30 07:06 UTC, Aleksander Morgado	none	Details \| Review
Attaching the ODT file I use for testing (22.51 KB, application/vnd.oasis.opendocument.text) 2010-06-30 07:08 UTC, Aleksander Morgado		Details
Patch for tracker extractor of odt, odp, ods files. (16.21 KB, patch) 2010-07-13 08:13 UTC, Murugappan	none	Details \| Review
Tracker plugin for odt, odp, ods files (16.61 KB, patch) 2010-07-29 14:46 UTC, Murugappan	none	Details \| Review
Tracker plugin for odt, odp, ods files (17.00 KB, patch) 2010-08-04 14:37 UTC, Murugappan	none	Details \| Review
Tracker plugin for odt, odp, ods files (2.75 KB, patch) 2010-08-09 03:51 UTC, Murugappan	none	Details \| Review
Tracker plugin for odt, odp, ods files (17.58 KB, patch) 2010-08-09 05:04 UTC, Murugappan	none	Details \| Review
patch to be applied on top of attachment 167396 (6.43 KB, patch) 2010-08-16 14:56 UTC, Carlos Garnacho	none	Details \| Review
Tracker plugin for odt, odp, ods files (22.10 KB, patch) 2010-08-17 13:37 UTC, Murugappan	committed	Details \| Review

Description Murugappan 2010-06-28 06:21:02 UTC

Tracker plugin for odt, odp, ods files. "odt2txt" functionality is replaced with the xml file extraction functionality.

Comment 1 Murugappan 2010-06-28 06:22:34 UTC

Created attachment 164783 [details] [review]
Patch for tracker plugin of odt, odp, ods

Comment 2 Aleksander Morgado 2010-06-28 08:30:47 UTC

Review of attachment 164783 [details] [review]:

I didn't really test the patch yet, but find below some comments on it.

::: src/tracker-extract/tracker-extract-oasis.c
@@ +60,1 @@
 } ODTParseInfo;

See comments below, but maybe it makes sense to have another independent struct for content xml parsing, instead of re-using the same one. The main reason would be that the variables to be used in the different parsings will be different (for example, metadata variable only used during metadata parsing and content variable only used during content parsing. This would mean also using new methods for content parsing in the markup parser.

@@ +86,3 @@
+static void extract_oasis_content             (const gchar *uri,
+                                               ODTFileType file_type, 
+                                               TrackerSparqlBuilder *metadata);

Style fix needed here, parameters should be aligned to the same column.

@@ +96,2 @@
 extract_oasis_content (const gchar *uri,
+                       ODTFileType file_type, TrackerSparqlBuilder *metadata)

Style fix needed here, each parameter should go in a new line, and parameters aligned to the same column.

@@ +109,2 @@
+	/* Create parse info */
+	info.metadata = metadata;

info.metadata won't be used during content extraction.

@@ +122,3 @@
+	tracker_gsf_parse_xml_in_zip (uri, "content.xml", context);
+
+	if (info.content) {

This check won't be needed, as info.content will always be not NULL, as g_string_new ("") is called when creating the string.

@@ +126,3 @@
+
+		content = g_string_free (info.content, FALSE);
+		info.content = NULL;

No need to re-set info.content to NULL here, as will not be used any more.

@@ +130,1 @@
+		if (content) {

Same as before, if content is taken from info.content, it will never be NULL, so no need to recheck it here.

@@ +210,2 @@
 	info.uri = uri;
+	info.content = NULL;

So the same ODTParseInfo is being used for both metadata XML parsing and contents XML parsing. It would be good to put a comment somewhere here to specify that info.styles_present, info.file_type and info.content variables are not used during metadata extraction. Or use a new different struct for content extraction better...

@@ +291,3 @@
+		break;
+	}
+

I wonder if it's not just better not to re-use the same methods when parsing metadata or contents. I would really use new methods for content xml parsing.

@@ +374,3 @@
+		tracker_text_validate_utf8 (text, -1, &data->content, NULL);
+		g_string_append_c (data->content, ' ');
+		break;

Same here, I would use new methods for content xml parsing.

Comment 3 Murugappan 2010-06-28 12:00:24 UTC

Created attachment 164808 [details] [review]
Made review corrections in tracker plugin for odt, odp, ods file formats

Comment 4 Murugappan 2010-06-28 12:01:10 UTC

Thanks for the review Aleksander Morgado. Made review corrections and attached the patch.

Comment 5 Aleksander Morgado 2010-06-30 06:35:07 UTC

Comment on attachment 164783 [details] [review]
Patch for tracker plugin of odt, odp, ods

Setting previous patch as obsolete

Comment 6 Aleksander Morgado 2010-06-30 07:06:00 UTC

Created attachment 164946 [details] [review]
Small style changes in the previous patch

Some minor style changes done: removing trailing whitespaces, some comment added, and declaration of new variables always at the beginning of the context. Please consider them for future updates of the patch.

Now, regarding the output of the new OASIS extractor, I made one simple test with one of the ODT files I usually use for testing, and the nie:plainTextContent happens to be completely different. The new one without odt2txt just contains some lines of text, while odt2txt dumps a lot of content. Could you review that? Maybe we are losing some tags during the XML parsing which also are text content, or something...

Comment 7 Aleksander Morgado 2010-06-30 07:08:23 UTC

Created attachment 164947 [details]
Attaching the ODT file I use for testing

So, with your patch, the text content I get seems to be just the titles in the document, with no real content from the different sections; while odt2txt fully dumps all contents. Could you try to fix that?

Thanks for the patch, btw!

Comment 8 Murugappan 2010-07-13 08:13:35 UTC

Created attachment 165778 [details] [review]
Patch for tracker extractor of odt, odp, ods files.

Comment 9 Murugappan 2010-07-13 08:16:15 UTC

Hello Aleksander Morgado, sorry for the delayed response got up in another task for the last sprint. Made the review correction, added tags to get the real content.

Can you please have a look at it and please let me know whether need to get more content. Attached the patch. Thanks.

Comment 10 Aleksander Morgado 2010-07-19 10:19:00 UTC

Hi again,

(In reply to comment #9)
> Hello Aleksander Morgado, sorry for the delayed response got up in another task
> for the last sprint. Made the review correction, added tags to get the real
> content.
> 
> Can you please have a look at it and please let me know whether need to get
> more content. Attached the patch. Thanks.

I tested the new patch with the same ODT file as I did before (the one attached in c#7), and still, the contents don't seem to be extracted properly. Some XML tags still missing in extraction?

Extraction for ODS and ODP files seem ok, though.

Comment 11 Murugappan 2010-07-19 11:22:59 UTC

Hi Aleksander Morgado,

I tried creating DOCX file from the attached ODT file and used the 'tracker-extract-msoffice' extractor to see what contents are extracted, I tried adding  tags to get the contents similar to that.

Can you please let me know is that ok or whether to add more tags.

Comment 12 Aleksander Morgado 2010-07-19 13:58:40 UTC

(In reply to comment #11)
> Hi Aleksander Morgado,
> 
> I tried creating DOCX file from the attached ODT file and used the
> 'tracker-extract-msoffice' extractor to see what contents are extracted, I
> tried adding  tags to get the contents similar to that.
> 
> Can you please let me know is that ok or whether to add more tags.

You don't really need that. Try running /usr/libexec/tracker-extract -f FILE.odt with and without your patch, and compare the results.

Comment 13 Murugappan 2010-07-20 10:19:39 UTC

Aleksander Morgado, wwill compare the results and add the required tags. Thanks.

Comment 14 Murugappan 2010-07-29 14:46:27 UTC

Created attachment 166772 [details] [review]
Tracker plugin for odt, odp, ods files

Comment 15 Murugappan 2010-07-29 14:52:05 UTC

Hello Aleksander Morgado,

Sorry for the delayed response. Compared the output with 'odt2txt' output and tried to added tags to get more text content. In 'odt2txt' it adds "=", "\n\n" etc., for headings, paragaphs etc., can you please let me know whether we need to add the special identities to headings, paragraphs etc., to match exactly with 'odt2txt' output.

Attached the patch, please have a look at it. Thank you very much.

Comment 16 Aleksander Morgado 2010-07-29 14:54:15 UTC

Comment on attachment 165778 [details] [review]
Patch for tracker extractor of odt, odp, ods files.

obsolete patch

Comment 17 Aleksander Morgado 2010-07-29 14:54:33 UTC

Comment on attachment 164946 [details] [review]
Small style changes in the previous patch

obsolete patch

Comment 18 Aleksander Morgado 2010-07-29 14:55:42 UTC

> 
> Sorry for the delayed response. Compared the output with 'odt2txt' output and
> tried to added tags to get more text content. In 'odt2txt' it adds "=", "\n\n"
> etc., for headings, paragaphs etc., can you please let me know whether we need
> to add the special identities to headings, paragraphs etc., to match exactly
> with 'odt2txt' output.
> 

No real need of those additional characters, as they really seem an odt2txt artifact, not coming from the original file contents.

> Attached the patch, please have a look at it. Thank you very much.

Will review it, thanks.

Comment 19 Murugappan 2010-07-30 07:47:00 UTC

Thank you very much Aleksander.

Comment 20 Murugappan 2010-08-04 14:37:45 UTC

Created attachment 167117 [details] [review]
Tracker plugin for odt, odp, ods files

Hello Aleksander Morgado,

Attaching the patch according to the review comment. Please have a look at it.
Thanks.

Comment 21 Aleksander Morgado 2010-08-06 08:34:03 UTC

Hi hi,

> Attaching the patch according to the review comment. Please have a look at it.
> Thanks.

The contents are now properly extracted, and the output is quite similar to the one with odt2txt, but still some last thing to be fixed...

You're no longer limiting the maximum number of bytes that can be extracted from the contents of the file. For example, if I configure the max bytes to be 100 in the tracker-extract.cfg file, the extractor should extract a maximum of 100 bytes. You can get the configured threshold with tracker_config_get_max_bytes(), and you can then use the 'valid_len' output parameter of tracker_text_validate_utf8() in xml_text_handler_content() to know how many good bytes were appended to the contents string.

So, you could do something like adding a new gulong bytes_pending in ODTContentParseInfo, which initially is equal to tracker_config_get_max_bytes(). Then:

text_len = strlen (text);
tracker_text_validate_utf8(text,
                           MIN (text_len, context->bytes_pending),
                           &data->content,
                           &written_bytes);
context->bytes_pending -= written_bytes;

And of course, if context->bytes_pending reaches 0, you need to quit extracting data.

Sorry for not having found this issue before, I guess it would be the last thing to get fixed before merging it. Anyway, it's a pretty good patch!

Comment 22 Murugappan 2010-08-09 03:51:17 UTC

Created attachment 167394 [details] [review]
Tracker plugin for odt, odp, ods files

Hello Aleksander Morgado,

Tried limiting the maximum number of bytes that can be extracted
from the contents of the file: since we are using "tracker_gsf_parse_xml_in_zip()" for parsing the 'content.xml' file to stop the extraction once the maximum number of bytes is reached tried calling the "g_markup_parse_context_end_parse()" to end the parsing but it gives the assertion 'g_markup_parse_context_end_parse: assertion `!context->parsing' failed'. With the attached patch the output is  limited to the maximum number of bytes but the extraction continues till the end of the xml file.

Can you please provide your suggestions on this. Thank you very much.

Comment 23 Murugappan 2010-08-09 05:04:50 UTC

Created attachment 167396 [details] [review]
Tracker plugin for odt, odp, ods files

Hello Aleksander Morgado,

Please ignore the previous one.

Tried limiting the maximum number of bytes that can be extracted
from the contents of the file: since we are using
"tracker_gsf_parse_xml_in_zip()" for parsing the 'content.xml' file to stop the
extraction once the maximum number of bytes is reached tried calling the
"g_markup_parse_context_end_parse()" to end the parsing but it gives the
assertion 'g_markup_parse_context_end_parse: assertion `!context->parsing'
failed'. With the attached patch the output is  limited to the maximum number
of bytes but the extraction continues till the end of the xml file.

Can you please provide your suggestions on this. Thank you very much.

Comment 24 Carlos Garnacho 2010-08-16 14:56:06 UTC

Created attachment 167971 [details] [review]
patch to be applied on top of attachment 167396 [details] [review]

This is a patch to be applied on top of patch in comment #23, it basically propagates a known error when the text size is exceeded, this stops the parser and the error is simply ignored, as a plus we now also have error checking for more strange cases such as broken files or misformatted XML

Comment 25 Murugappan 2010-08-17 13:37:10 UTC

Created attachment 168077 [details] [review]
Tracker plugin for odt, odp, ods files

Thank you very much Carlos Garnacho.

Aleksander Morgado, attached the updated patch. Can you please have a look at it.

Comment 26 Aleksander Morgado 2010-08-17 14:55:26 UTC

Pushed to git master!

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.