After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 680809 - Make fs-miner more robust against bad data
Make fs-miner more robust against bad data
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Miners
0.12.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-general
Jamie McCracken
Depends on:
Blocks: 613258
 
 
Reported: 2012-07-30 00:15 UTC by Bastien Nocera
Modified: 2012-07-30 12:16 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Bastien Nocera 2012-07-30 00:15:52 UTC
The miners need to parse a number of different file types, with each file type handled, and the quantity of data to index increasing the likelihood of bugs creeping up and wreaking havoc. Despite the best efforts of the tracker team, bugs where the robustness of the libraries (and usually plugins) used by the miner might not be as high as wanted will still creep up.

The miner should be split into watchdog + workers. The thumbnailer service Tumbler, or Rhythmbox's metadata indexer could be used as example of architecture.

Workers could try running as long as possible, and the watchdog ensuring that files that cause problems are identified as such (making bug reporting easier), and that impact on the user's experience is kept to a minimum.

For resource limitations, rlimit could be used in the short-term, with cgroups, and other similar security features added in the future.
Comment 1 Ivan Frade 2012-07-30 12:16:56 UTC
The mining is already split in two processes:

* miner-fs is the watchdog handling the inotify watches, reading the stat of the files and writing the results to the database. It sends the actual extraction work to tracker-extract and respawns that extractor if it crashed (or was shutdown due inactivity).

* tracker-extract is the workers side. It takes the files to extract and deals with the unreliable external libs that do the extraction work. It has a reasonable timeout to cancel problematic extractions and it is already using rlimit. The process stays alive until 30 seconds of inactivity. 

The weakest point is that for performance reasons we send the files to the extractor in batches of 10, and a crash processing one of them cancels the whole lot. This could be improved.

We have all the suggestions of this bug already implemented and available in the latest stable release. We can close it as already fixed. More suggestions to improve the robustness of the extraction are very welcome.