Bug 668032 – doesn't index content of ODT files

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 668032 - doesn't index content of ODT files


Summary:	doesn't index content of ODT files


Status:	RESOLVED DUPLICATE of bug 664227

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-01-16 16:44 UTC by below
Modified:	2012-02-13 10:04 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description below 2012-01-16 16:44:25 UTC

tracker doesn't index my ODT files. I discussed this on the #tracker
IRC channel, garnacho was able to reproduce it. E.g. for a test
file containing just one line of text "Testtext besonderer Art und
Güte" tracker-extract reports the metadata, but not the content,
see below. Further experimentation shows that underlined or bold text parts are indexed.

Thanks for your work

Michael Below

$ /usr/lib/tracker/tracker-extract -v 3 -f test.odt
Initializing tracker-extract...
Tracker-Message: Setting up monitor for changes to config
file:'/home/mbelow/.config/tracker/tracker-extract.cfg'
Locale 'TRACKER_LOCALE_LANGUAGE' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_TIME' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_COLLATE' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_NUMERIC' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_MONETARY' was set to 'de_DE.UTF-8'
Initializing Storage...
Mount monitors set up for to watch for added, removed and
pre-unmounts...
Found '50 GB Dateisystem' mounted on path '/media/sda1'
  Found mount with volume and drive which can be mounted: Assuming
it's  removable, if wrong report a bug!
  Adding mount point with UUID: '19226086202273E9', removable:
yes, optical: no, path: '/media/sda1'
Setting priority nice level to 19
Loading extractor rules... (/usr/share/tracker/extract-rules)
  Loaded rule '10-abw.rule'
  Loaded rule '10-epub.rule'
  Loaded rule '10-flac.rule'
  Loaded rule '10-gif.rule'
  Loaded rule '10-html.rule'
  Loaded rule '10-ico.rule'
  Loaded rule '10-jpeg.rule'
  Loaded rule '10-mp3.rule'
  Loaded rule '10-msoffice.rule'
  Loaded rule '10-oasis.rule'
  Loaded rule '10-pdf.rule'
  Loaded rule '10-png.rule'
  Loaded rule '10-ps.rule'
  Loaded rule '10-svg.rule'
  Loaded rule '10-tiff.rule'
  Loaded rule '10-vorbis.rule'
  Loaded rule '10-xmp.rule'
  Loaded rule '11-msoffice-xml.rule'
  Loaded rule '15-gstreamer-guess.rule'
  Loaded rule '15-playlist.rule'
  Loaded rule '90-gstreamer-generic.rule'
  Loaded rule '90-text-generic.rule'
Extractor rules loaded
Setting memory limitations: total is 1,8 GB, minimum is 256 MB,
recommended is ~1 GB
  Virtual/Heap set to 922,0 MB (50% of total or MAXLONG)
Guessing mime type as '(null)'
Extracting...
  Using
/usr/lib/tracker-0.12/extract-modules/libextract-oasis.so...
Extracting OASIS metadata and contents from
'file:///home/mbelow/temp/test.odt'
Parsing 'meta.xml' XML file from
'file:///home/mbelow/temp/test.odt' zip archive...
Parsing 'content.xml' XML file from
'file:///home/mbelow/temp/test.odt' zip archive...
Done (9 items)

SPARQL pre-update:
--
--

SPARQL item:
--
 a nfo:PaginatedTextDocument ;
	 nie:contentCreated "2012-01-16T15:42:54" ;
	 nfo:pageCount "1" ;
	 nfo:wordCount "5" ;
	 nco:publisher [ a nco:Contact ;
	 nco:fullname "Michael Below"] ;
	 nie:generator "LibreOffice/3.4$Unix
LibreOffice_project/340m1$Build-402" ;
	 nie:plainTextContent "" .
--

SPARQL where clause:
--
--

SPARQL post-update:
--
--

Comment 1 below 2012-01-16 17:25:38 UTC

A (not very educated) guess about the bug: In /tracker-extract/tracker-extract-oasis.c the function extract_oasis_content seems to use the GMarkupParser to parse the OASIS content. I suspect that only some of the OASIS content is valid GMarkup.

Comment 2 below 2012-01-17 10:22:21 UTC

I talked to the people on #libreoffice-dev about this, they recommend using libxml2 instead of GMarkupParser. Other suggestions were unoconv or odt2txt, but I guess you don't want to rely on external helper programs.

Comment 3 below 2012-02-13 10:04:30 UTC


*** This bug has been marked as a duplicate of bug 664227 ***