GNOME Bugzilla – Bug 616845
Avoid word-counting in the extractors
Last modified: 2010-05-18 15:08:41 UTC
Currently the extractors are using a tracker_text_normalize() method from libtracker-extract, which does several things: * split the input string in 'words' (more or less) until a given predefined limit. * remove every formatting from the string. As the parser already does word-breaking and limiting the number of words parsed, there is no need to do it in the extractor. Also, the string added in nie:plainTextContent should come with the original formatting of the string, and thus, the extractor shouldn't change it. In the other hand, the extractors should have a limit of bytes to be read from the input file. This max-bytes to read value will be available in the extractors' conf file. This issue applies to all the extractors using tracker_text_normalize(). Side-effects: When limiting the input text stream only based on number of bytes, if the limit reaches in the middle of a word, the word will get split. As the extractor loses now all knowledge about what a 'word' is, there's not clear solution of how this should get solved.
This issue is now fixed in the "extractor-remove-word-counting-review" branch in gnome git.
OK, so I reviewed the branch. Just some small comments: - I fixed the "NIL terminated" in the docs to be "NULL" - Try not to prefix p_ for variable names just because they're pointers to real types. We don't use that policy in the code base generally. - Max_Bytes should adhere to the various formats for the config file and GObject properties. So for GObject, it was changed to max-bytes (s/_/-/) and for the config file, it was MaxBytes (so no _). One further question, I wonder if we should use G_MAXUINT for the maximum bytes? That's quite a large upper limit though and probably not useful at all. We also discussed internally the use of GIOChannels, if you could update that code to use the more modern GIO APIs that would really be appreciated too thanks. Great patch other than that.
> - I fixed the "NIL terminated" in the docs to be "NULL" Re-changed to "NUL" after private chat. > - Try not to prefix p_ for variable names just because they're pointers to real > types. We don't use that policy in the code base generally. Ok. > - Max_Bytes should adhere to the various formats for the config file and > GObject properties. So for GObject, it was changed to max-bytes (s/_/-/) and > for the config file, it was MaxBytes (so no _). > Ok. > One further question, I wonder if we should use G_MAXUINT for the maximum > bytes? That's quite a large upper limit though and probably not useful at all. > Yeah, agree. Which would be a safe limit then? 10MBytes? 100MBytes? > We also discussed internally the use of GIOChannels, if you could update that > code to use the more modern GIO APIs that would really be appreciated too > thanks. > Done now. Using read-from-stream in TXT extractor and read-from-fd in OASIS extractor.
(In reply to comment #3) > Yeah, agree. Which would be a safe limit then? 10MBytes? 100MBytes? I think 10Mb is definitely enough. We can always change it by request if people complain.
(In reply to comment #4) > (In reply to comment #3) > > Yeah, agree. Which would be a safe limit then? 10MBytes? 100MBytes? > > I think 10Mb is definitely enough. We can always change it by request if people > complain. Change done in the branch.
Great, merged to master now. This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.