GNOME Bugzilla – Bug 623005
Remove odt2txt dependency in the OASIS extractor
Last modified: 2010-08-17 14:55:26 UTC
Tracker plugin for odt, odp, ods files. "odt2txt" functionality is replaced with the xml file extraction functionality.
Created attachment 164783 [details] [review] Patch for tracker plugin of odt, odp, ods
Review of attachment 164783 [details] [review]: I didn't really test the patch yet, but find below some comments on it. ::: src/tracker-extract/tracker-extract-oasis.c @@ +60,1 @@ } ODTParseInfo; See comments below, but maybe it makes sense to have another independent struct for content xml parsing, instead of re-using the same one. The main reason would be that the variables to be used in the different parsings will be different (for example, metadata variable only used during metadata parsing and content variable only used during content parsing. This would mean also using new methods for content parsing in the markup parser. @@ +86,3 @@ +static void extract_oasis_content (const gchar *uri, + ODTFileType file_type, + TrackerSparqlBuilder *metadata); Style fix needed here, parameters should be aligned to the same column. @@ +96,2 @@ extract_oasis_content (const gchar *uri, + ODTFileType file_type, TrackerSparqlBuilder *metadata) Style fix needed here, each parameter should go in a new line, and parameters aligned to the same column. @@ +109,2 @@ + /* Create parse info */ + info.metadata = metadata; info.metadata won't be used during content extraction. @@ +122,3 @@ + tracker_gsf_parse_xml_in_zip (uri, "content.xml", context); + + if (info.content) { This check won't be needed, as info.content will always be not NULL, as g_string_new ("") is called when creating the string. @@ +126,3 @@ + + content = g_string_free (info.content, FALSE); + info.content = NULL; No need to re-set info.content to NULL here, as will not be used any more. @@ +130,1 @@ + if (content) { Same as before, if content is taken from info.content, it will never be NULL, so no need to recheck it here. @@ +210,2 @@ info.uri = uri; + info.content = NULL; So the same ODTParseInfo is being used for both metadata XML parsing and contents XML parsing. It would be good to put a comment somewhere here to specify that info.styles_present, info.file_type and info.content variables are not used during metadata extraction. Or use a new different struct for content extraction better... @@ +291,3 @@ + break; + } + I wonder if it's not just better not to re-use the same methods when parsing metadata or contents. I would really use new methods for content xml parsing. @@ +374,3 @@ + tracker_text_validate_utf8 (text, -1, &data->content, NULL); + g_string_append_c (data->content, ' '); + break; Same here, I would use new methods for content xml parsing.
Created attachment 164808 [details] [review] Made review corrections in tracker plugin for odt, odp, ods file formats
Thanks for the review Aleksander Morgado. Made review corrections and attached the patch.
Comment on attachment 164783 [details] [review] Patch for tracker plugin of odt, odp, ods Setting previous patch as obsolete
Created attachment 164946 [details] [review] Small style changes in the previous patch Some minor style changes done: removing trailing whitespaces, some comment added, and declaration of new variables always at the beginning of the context. Please consider them for future updates of the patch. Now, regarding the output of the new OASIS extractor, I made one simple test with one of the ODT files I usually use for testing, and the nie:plainTextContent happens to be completely different. The new one without odt2txt just contains some lines of text, while odt2txt dumps a lot of content. Could you review that? Maybe we are losing some tags during the XML parsing which also are text content, or something...
Created attachment 164947 [details] Attaching the ODT file I use for testing So, with your patch, the text content I get seems to be just the titles in the document, with no real content from the different sections; while odt2txt fully dumps all contents. Could you try to fix that? Thanks for the patch, btw!
Created attachment 165778 [details] [review] Patch for tracker extractor of odt, odp, ods files.
Hello Aleksander Morgado, sorry for the delayed response got up in another task for the last sprint. Made the review correction, added tags to get the real content. Can you please have a look at it and please let me know whether need to get more content. Attached the patch. Thanks.
Hi again, (In reply to comment #9) > Hello Aleksander Morgado, sorry for the delayed response got up in another task > for the last sprint. Made the review correction, added tags to get the real > content. > > Can you please have a look at it and please let me know whether need to get > more content. Attached the patch. Thanks. I tested the new patch with the same ODT file as I did before (the one attached in c#7), and still, the contents don't seem to be extracted properly. Some XML tags still missing in extraction? Extraction for ODS and ODP files seem ok, though.
Hi Aleksander Morgado, I tried creating DOCX file from the attached ODT file and used the 'tracker-extract-msoffice' extractor to see what contents are extracted, I tried adding tags to get the contents similar to that. Can you please let me know is that ok or whether to add more tags.
(In reply to comment #11) > Hi Aleksander Morgado, > > I tried creating DOCX file from the attached ODT file and used the > 'tracker-extract-msoffice' extractor to see what contents are extracted, I > tried adding tags to get the contents similar to that. > > Can you please let me know is that ok or whether to add more tags. You don't really need that. Try running /usr/libexec/tracker-extract -f FILE.odt with and without your patch, and compare the results.
Aleksander Morgado, wwill compare the results and add the required tags. Thanks.
Created attachment 166772 [details] [review] Tracker plugin for odt, odp, ods files
Hello Aleksander Morgado, Sorry for the delayed response. Compared the output with 'odt2txt' output and tried to added tags to get more text content. In 'odt2txt' it adds "=", "\n\n" etc., for headings, paragaphs etc., can you please let me know whether we need to add the special identities to headings, paragraphs etc., to match exactly with 'odt2txt' output. Attached the patch, please have a look at it. Thank you very much.
Comment on attachment 165778 [details] [review] Patch for tracker extractor of odt, odp, ods files. obsolete patch
Comment on attachment 164946 [details] [review] Small style changes in the previous patch obsolete patch
> > Sorry for the delayed response. Compared the output with 'odt2txt' output and > tried to added tags to get more text content. In 'odt2txt' it adds "=", "\n\n" > etc., for headings, paragaphs etc., can you please let me know whether we need > to add the special identities to headings, paragraphs etc., to match exactly > with 'odt2txt' output. > No real need of those additional characters, as they really seem an odt2txt artifact, not coming from the original file contents. > Attached the patch, please have a look at it. Thank you very much. Will review it, thanks.
Thank you very much Aleksander.
Created attachment 167117 [details] [review] Tracker plugin for odt, odp, ods files Hello Aleksander Morgado, Attaching the patch according to the review comment. Please have a look at it. Thanks.
Hi hi, > Attaching the patch according to the review comment. Please have a look at it. > Thanks. The contents are now properly extracted, and the output is quite similar to the one with odt2txt, but still some last thing to be fixed... You're no longer limiting the maximum number of bytes that can be extracted from the contents of the file. For example, if I configure the max bytes to be 100 in the tracker-extract.cfg file, the extractor should extract a maximum of 100 bytes. You can get the configured threshold with tracker_config_get_max_bytes(), and you can then use the 'valid_len' output parameter of tracker_text_validate_utf8() in xml_text_handler_content() to know how many good bytes were appended to the contents string. So, you could do something like adding a new gulong bytes_pending in ODTContentParseInfo, which initially is equal to tracker_config_get_max_bytes(). Then: text_len = strlen (text); tracker_text_validate_utf8(text, MIN (text_len, context->bytes_pending), &data->content, &written_bytes); context->bytes_pending -= written_bytes; And of course, if context->bytes_pending reaches 0, you need to quit extracting data. Sorry for not having found this issue before, I guess it would be the last thing to get fixed before merging it. Anyway, it's a pretty good patch!
Created attachment 167394 [details] [review] Tracker plugin for odt, odp, ods files Hello Aleksander Morgado, Tried limiting the maximum number of bytes that can be extracted from the contents of the file: since we are using "tracker_gsf_parse_xml_in_zip()" for parsing the 'content.xml' file to stop the extraction once the maximum number of bytes is reached tried calling the "g_markup_parse_context_end_parse()" to end the parsing but it gives the assertion 'g_markup_parse_context_end_parse: assertion `!context->parsing' failed'. With the attached patch the output is limited to the maximum number of bytes but the extraction continues till the end of the xml file. Can you please provide your suggestions on this. Thank you very much.
Created attachment 167396 [details] [review] Tracker plugin for odt, odp, ods files Hello Aleksander Morgado, Please ignore the previous one. Tried limiting the maximum number of bytes that can be extracted from the contents of the file: since we are using "tracker_gsf_parse_xml_in_zip()" for parsing the 'content.xml' file to stop the extraction once the maximum number of bytes is reached tried calling the "g_markup_parse_context_end_parse()" to end the parsing but it gives the assertion 'g_markup_parse_context_end_parse: assertion `!context->parsing' failed'. With the attached patch the output is limited to the maximum number of bytes but the extraction continues till the end of the xml file. Can you please provide your suggestions on this. Thank you very much.
Created attachment 167971 [details] [review] patch to be applied on top of attachment 167396 [details] [review] This is a patch to be applied on top of patch in comment #23, it basically propagates a known error when the text size is exceeded, this stops the parser and the error is simply ignored, as a plus we now also have error checking for more strange cases such as broken files or misformatted XML
Created attachment 168077 [details] [review] Tracker plugin for odt, odp, ods files Thank you very much Carlos Garnacho. Aleksander Morgado, attached the updated patch. Can you please have a look at it.
Pushed to git master! This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.