After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 668032 - doesn't index content of ODT files
doesn't index content of ODT files
Status: RESOLVED DUPLICATE of bug 664227
Product: tracker
Classification: Core
Component: Extractor
unspecified
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Jamie McCracken
Depends on:
Blocks:
 
 
Reported: 2012-01-16 16:44 UTC by below
Modified: 2012-02-13 10:04 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description below 2012-01-16 16:44:25 UTC
tracker doesn't index my ODT files. I discussed this on the #tracker
IRC channel, garnacho was able to reproduce it. E.g. for a test
file containing just one line of text "Testtext besonderer Art und
Güte" tracker-extract reports the metadata, but not the content,
see below. Further experimentation shows that underlined or bold text parts are indexed.

Thanks for your work

Michael Below

$ /usr/lib/tracker/tracker-extract -v 3 -f test.odt
Initializing tracker-extract...
Tracker-Message: Setting up monitor for changes to config
file:'/home/mbelow/.config/tracker/tracker-extract.cfg'
Locale 'TRACKER_LOCALE_LANGUAGE' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_TIME' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_COLLATE' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_NUMERIC' was set to 'de_DE.UTF-8'
Locale 'TRACKER_LOCALE_MONETARY' was set to 'de_DE.UTF-8'
Initializing Storage...
Mount monitors set up for to watch for added, removed and
pre-unmounts...
Found '50 GB Dateisystem' mounted on path '/media/sda1'
  Found mount with volume and drive which can be mounted: Assuming
it's  removable, if wrong report a bug!
  Adding mount point with UUID: '19226086202273E9', removable:
yes, optical: no, path: '/media/sda1'
Setting priority nice level to 19
Loading extractor rules... (/usr/share/tracker/extract-rules)
  Loaded rule '10-abw.rule'
  Loaded rule '10-epub.rule'
  Loaded rule '10-flac.rule'
  Loaded rule '10-gif.rule'
  Loaded rule '10-html.rule'
  Loaded rule '10-ico.rule'
  Loaded rule '10-jpeg.rule'
  Loaded rule '10-mp3.rule'
  Loaded rule '10-msoffice.rule'
  Loaded rule '10-oasis.rule'
  Loaded rule '10-pdf.rule'
  Loaded rule '10-png.rule'
  Loaded rule '10-ps.rule'
  Loaded rule '10-svg.rule'
  Loaded rule '10-tiff.rule'
  Loaded rule '10-vorbis.rule'
  Loaded rule '10-xmp.rule'
  Loaded rule '11-msoffice-xml.rule'
  Loaded rule '15-gstreamer-guess.rule'
  Loaded rule '15-playlist.rule'
  Loaded rule '90-gstreamer-generic.rule'
  Loaded rule '90-text-generic.rule'
Extractor rules loaded
Setting memory limitations: total is 1,8 GB, minimum is 256 MB,
recommended is ~1 GB
  Virtual/Heap set to 922,0 MB (50% of total or MAXLONG)
Guessing mime type as '(null)'
Extracting...
  Using
/usr/lib/tracker-0.12/extract-modules/libextract-oasis.so...
Extracting OASIS metadata and contents from
'file:///home/mbelow/temp/test.odt'
Parsing 'meta.xml' XML file from
'file:///home/mbelow/temp/test.odt' zip archive...
Parsing 'content.xml' XML file from
'file:///home/mbelow/temp/test.odt' zip archive...
Done (9 items)

SPARQL pre-update:
--
--

SPARQL item:
--
 a nfo:PaginatedTextDocument ;
	 nie:contentCreated "2012-01-16T15:42:54" ;
	 nfo:pageCount "1" ;
	 nfo:wordCount "5" ;
	 nco:publisher [ a nco:Contact ;
	 nco:fullname "Michael Below"] ;
	 nie:generator "LibreOffice/3.4$Unix
LibreOffice_project/340m1$Build-402" ;
	 nie:plainTextContent "" .
--

SPARQL where clause:
--
--

SPARQL post-update:
--
--
Comment 1 below 2012-01-16 17:25:38 UTC
A (not very educated) guess about the bug: In /tracker-extract/tracker-extract-oasis.c the function extract_oasis_content seems to use the GMarkupParser to parse the OASIS content. I suspect that only some of the OASIS content is valid GMarkup.
Comment 2 below 2012-01-17 10:22:21 UTC
I talked to the people on #libreoffice-dev about this, they recommend using libxml2 instead of GMarkupParser. Other suggestions were unoconv or odt2txt, but I guess you don't want to rely on external helper programs.
Comment 3 below 2012-02-13 10:04:30 UTC

*** This bug has been marked as a duplicate of bug 664227 ***