Bug 615857 – add xml extraction

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 615857 - add xml extraction


Summary:	add xml extraction


Status:	RESOLVED OBSOLETE

Product:	tracker
Classification:	Core
Component:	Supported Formats
Version:	git master
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-04-15 15:13 UTC by Tshepang Lekhonkhobe
Modified:	2021-05-26 22:24 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Tshepang Lekhonkhobe 2010-04-15 15:13:24 UTC

Please add extraction for xml files.

Comment 1 Tshepang Lekhonkhobe 2010-12-12 23:07:43 UTC

I'm desperate for this. It will be me one step closer to me throwing away the venerable gnome-search-tool.

Comment 2 Martyn Russell 2010-12-13 10:09:53 UTC

We accept patches ;)

The HTML extractor pretty much has all the code and boiler plate you would need in place. The problem is, *how* do you extract XML data? I mean the elements can be unique, so how do you deal with those?

Comment 3 Aleksander Morgado 2010-12-13 10:24:17 UTC

(In reply to comment #2)
> We accept patches ;)
> 
> The HTML extractor pretty much has all the code and boiler plate you would need
> in place. The problem is, *how* do you extract XML data? I mean the elements
> can be unique, so how do you deal with those?

That is something we should really take care of. I would imagine lots of situations where you would like a specific extractor for some specific XML file.

This could be done enabling more than one specific extractor for a given 
mime-type. Something like:
 * tracker-extract-xml-type1.c
 * tracker-extract-xml-type2.c
 * tracker-extract-xml-type3.c
 * tracker-extract-xml-default.c

All extractors would be for the same exact mime-type (application/xml). If an XML file is then requested to get extracted, we would do:
 * Try with type1 extractor
  * If type1 extractor doesn't like the XML, try with type2 extractor
   * If type2 extractor doesn't like the XML, try with type3 extractor
    * If type 3 extractor doesn't like the XML, try with default extractor

The order to try non-default specific extractors wouldn't matter, as long as each extractor notifies when it can't process the given XML (maybe looking for some specific XML tags that are mandatory in the specific XML schema supported by each extractor). The default last XML extractor would just do a best try to extract the contents (text inside the tags) into nie:plainTextContent.

Actually this could also be applied to the text extractor, where we could enable additional specific extractors to be executed before the default one; if and only if the extractors notify when they cannot process the file because it's not what they expect.

Comment 4 Sam Thursfield 2021-05-26 22:24:39 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new enhancement request ticket at
  https://gitlab.gnome.org/GNOME/tracker/-/issues/

Thank you for your understanding and your help.