GNOME Bugzilla – Bug 668032
doesn't index content of ODT files
Last modified: 2012-02-13 10:04:30 UTC
tracker doesn't index my ODT files. I discussed this on the #tracker IRC channel, garnacho was able to reproduce it. E.g. for a test file containing just one line of text "Testtext besonderer Art und Güte" tracker-extract reports the metadata, but not the content, see below. Further experimentation shows that underlined or bold text parts are indexed. Thanks for your work Michael Below $ /usr/lib/tracker/tracker-extract -v 3 -f test.odt Initializing tracker-extract... Tracker-Message: Setting up monitor for changes to config file:'/home/mbelow/.config/tracker/tracker-extract.cfg' Locale 'TRACKER_LOCALE_LANGUAGE' was set to 'de_DE.UTF-8' Locale 'TRACKER_LOCALE_TIME' was set to 'de_DE.UTF-8' Locale 'TRACKER_LOCALE_COLLATE' was set to 'de_DE.UTF-8' Locale 'TRACKER_LOCALE_NUMERIC' was set to 'de_DE.UTF-8' Locale 'TRACKER_LOCALE_MONETARY' was set to 'de_DE.UTF-8' Initializing Storage... Mount monitors set up for to watch for added, removed and pre-unmounts... Found '50 GB Dateisystem' mounted on path '/media/sda1' Found mount with volume and drive which can be mounted: Assuming it's removable, if wrong report a bug! Adding mount point with UUID: '19226086202273E9', removable: yes, optical: no, path: '/media/sda1' Setting priority nice level to 19 Loading extractor rules... (/usr/share/tracker/extract-rules) Loaded rule '10-abw.rule' Loaded rule '10-epub.rule' Loaded rule '10-flac.rule' Loaded rule '10-gif.rule' Loaded rule '10-html.rule' Loaded rule '10-ico.rule' Loaded rule '10-jpeg.rule' Loaded rule '10-mp3.rule' Loaded rule '10-msoffice.rule' Loaded rule '10-oasis.rule' Loaded rule '10-pdf.rule' Loaded rule '10-png.rule' Loaded rule '10-ps.rule' Loaded rule '10-svg.rule' Loaded rule '10-tiff.rule' Loaded rule '10-vorbis.rule' Loaded rule '10-xmp.rule' Loaded rule '11-msoffice-xml.rule' Loaded rule '15-gstreamer-guess.rule' Loaded rule '15-playlist.rule' Loaded rule '90-gstreamer-generic.rule' Loaded rule '90-text-generic.rule' Extractor rules loaded Setting memory limitations: total is 1,8 GB, minimum is 256 MB, recommended is ~1 GB Virtual/Heap set to 922,0 MB (50% of total or MAXLONG) Guessing mime type as '(null)' Extracting... Using /usr/lib/tracker-0.12/extract-modules/libextract-oasis.so... Extracting OASIS metadata and contents from 'file:///home/mbelow/temp/test.odt' Parsing 'meta.xml' XML file from 'file:///home/mbelow/temp/test.odt' zip archive... Parsing 'content.xml' XML file from 'file:///home/mbelow/temp/test.odt' zip archive... Done (9 items) SPARQL pre-update: -- -- SPARQL item: -- a nfo:PaginatedTextDocument ; nie:contentCreated "2012-01-16T15:42:54" ; nfo:pageCount "1" ; nfo:wordCount "5" ; nco:publisher [ a nco:Contact ; nco:fullname "Michael Below"] ; nie:generator "LibreOffice/3.4$Unix LibreOffice_project/340m1$Build-402" ; nie:plainTextContent "" . -- SPARQL where clause: -- -- SPARQL post-update: -- --
A (not very educated) guess about the bug: In /tracker-extract/tracker-extract-oasis.c the function extract_oasis_content seems to use the GMarkupParser to parse the OASIS content. I suspect that only some of the OASIS content is valid GMarkup.
I talked to the people on #libreoffice-dev about this, they recommend using libxml2 instead of GMarkupParser. Other suggestions were unoconv or odt2txt, but I guess you don't want to rely on external helper programs.
*** This bug has been marked as a duplicate of bug 664227 ***