GNOME Bugzilla – Bug 680897
PDF extraction finishes, then times out, no full text indexed
Last modified: 2012-10-24 15:35:01 UTC
Created attachment 219964 [details] Output from tracker-extract For some PDF files (affected example is attached), tracker does not index the full text contents on my machine. Output of tracker-extract shows that all content is extracted in 1.6 seconds, yet it times out after 10 and finishes without saving any plainText content.
The file is too big for attaching here. URL: http://ubuntuone.com/5Rk3W4yq32UjALpEja210j
(In reply to comment #1) > The file is too big for attaching here. URL: > http://ubuntuone.com/5Rk3W4yq32UjALpEja210j Hello Aleksander, that link doesn't work for me, are you able to post it somewhere else? Or you could send it to me directly, martyn@lanedo.com
This error message is a bit misleading it has to be said. """ Content extraction finished: 68/68 pages indexed in 1,631733 seconds, 81633 bytes extracted """ So we have 2 timeout mechanisms: 1. The timeout mechanism in the child process we start is 5 seconds. 2. The timeout mechanism the parent process implements is 10 seconds. Neither of these times are representative of the errors you reported. I did some debugging here with a random PDF I found and it looks like we're ALWAYS waiting for the 10 second timeout. This is unique to the PDF extractor. The problem to me looks like we're calling g_output_stream_write_all() when we have the data extracted in the forked child process, but at the same time we're in the parent block with sigtimedwait() for up to 10 second. When that finishes, we check if there is anything in the pipe and actually there is (of course). So the issue here is really that we're getting into a deadlock of sorts. The right way forward should be to make sure we can read the written data from the child process in the parent process to avoid the blocking on the g_output_stream_write_all() call. This way, the process would exit properly (instead of being killed) and this would save PDF extraction times by a rather large amount. I didn't get time yet today to work on a fix, but I will ASAP.
Martyn, sounds like you figured out what's going wrong, but just to be sure, a link to the document I tested with: http://www.schlaf-im-stroh.ch/fileadmin/user_upload/dokumente/Katalog/SiS_Brosch_2012_Gesamt.pdf
I have a patch here too, but it needs cleaning up - will try to finish this some time tomorrow! :)
I've pushed the proposed patch here: http://git.gnome.org/browse/tracker/log/?h=pdf-timeout-fix Jürg care to review this before I merge it? I tested with g_usleep() in the child process to make sure it works properly.
Example debug output when things break: """ Extracting... Using /usr/lib/tracker-0.14/extract-modules/libextract-pdf.so... Parent: Content extraction now starting in child process (pid = 2756) Parent: Child process took too long. We waited 10 seconds, so we're going to kill it! Done (3 items) """ Example debug output when things work now (which is much faster): """ ... Child: Extracted 1773 bytes from page 96, 924460 bytes remaining Child: Extracted 1405 bytes from page 97, 923055 bytes remaining Child: Extracted 798 bytes from page 98, 922257 bytes remaining Child: Content extraction finished: 98/98 pages indexed in 0.51 seconds, 126319 bytes extracted Parent: fd is set is 1 Parent: Expected bytes to read is 126417 Child: Content extraction now finished in child process, written 126417 bytes to parent process Parent: Bytes read is 65535,bytes remaining is 60882 Parent: Bytes read is 60882,bytes remaining is 0 Parent: Bytes read is 0,bytes remaining is 0 Parent: Finished reading all bytes Parent: Child process exited Parent: Data received in 0.51 seconds (timeout is 10 seconds) Done (4 items) """
Thanks for working on that, Martyn - can this go in 0.14.3?
I was hoping for some review before it does... any chance you can test this patch?
Martyn, I have built the branch, installed it on my box, and executed tracker-control -rs. After the reindexing, I could successfully find the file in question by searching for words contained in it. Seems to be working!
Martyn, hang on. Maybe there is an issue after all. Some 15 minutes after the reindexing finished, I still have a tracker-extract process using 100% CPU. How to figure out what it's doing?
Sorry for all the noise! So after some 20 minutes, that process terminated itself peacefully. From the tracker-extract log: 29 Aug 2012, 22:12:29: Tracker: Child: Extracted 0 bytes from page 17, 1048576 bytes remaining 29 Aug 2012, 22:13:28: Tracker: Child: Extracted 0 bytes from page 18, 1048576 bytes remaining 29 Aug 2012, 22:14:10: Tracker: Child: Extracted 0 bytes from page 19, 1048576 bytes remaining 29 Aug 2012, 22:14:56: Tracker: Child: Extracted 0 bytes from page 20, 1048576 bytes remaining 29 Aug 2012, 22:15:33: Tracker: Child: Extracted 0 bytes from page 21, 1048576 bytes remaining 29 Aug 2012, 22:16:18: Tracker: Child: Extracted 0 bytes from page 22, 1048576 bytes remaining 29 Aug 2012, 22:17:08: Tracker: Child: Extracted 0 bytes from page 23, 1048576 bytes remaining 29 Aug 2012, 22:17:28: Tracker: Child: Extracted 0 bytes from page 24, 1048576 bytes remaining 29 Aug 2012, 22:17:28: Tracker: Child: Content extraction finished: 24/24 pages indexed in 1113,52 seconds, 0 bytes extracted 29 Aug 2012, 22:17:28: Tracker: Child: Content extraction now finished in child process, written 0 bytes to parent process I wasn't able to figure out what file it struggled with so much, if you have an idea let me know. Anyway I think the original issue is fixed.
Alexander, thank you for the testing. Usually you can see what's going on by setting the verbosity to debug and all messages are logged. You can do this with: tracker-control --set-log-verbosity=debug You should be able to follow what happens with: tail -f ~/.local/share/tracker/tracker-extract.log In there, the file in question should be mentioned. If possible, can you attach that file here, it sounds quite broken to be taking that period of time. I would expect the parent to have killed the child after 10 seconds, but it isn't. We (in master) have a time check in the child process per page but I removed that because it seemed silly to have 2 time checks when the process will be killed anyway if it takes too long. That might be related, but I need to check. If you're able to attach the file here that would help. Thanks,
Martyn, I think I have found the file, I will upload it here as an attachment. I have a feeling the problem is in poppler, as evince seriously chokes on this file as well (uses both cores to 100% for minutes upon opening the file). Chromium's built in PDF reader has no issue with it at all.
Find the file here: http://ubuntuone.com/4ELfHGFXVtDAtU0lWsLT6G
(In reply to comment #14) > Martyn, I think I have found the file, I will upload it here as an attachment. > I have a feeling the problem is in poppler, as evince seriously chokes on this > file as well (uses both cores to 100% for minutes upon opening the file). > Chromium's built in PDF reader has no issue with it at all. I tried this file with the latest patch and it indeed times out. But the other PDFs don't. So the current fix goes a long way towards helping. """ Extracting... Using /usr/lib/tracker-0.14/extract-modules/libextract-pdf.so... Parent: Content extraction now starting in child process (pid = 26997) Child: Extracted 0 bytes from page 1, 1048576 bytes remaining Parent: Child process took too long. We waited 10 seconds, so we're going to kill it! Done (7 items) """ You're right too, Evince completely breaks with this file. The question is, can we do anything about this? I am going to resolve this bug as fixed for now because I think this latest issue is not related to Tracker itself but actually poppler as you suggest AND we've fixed the time out issues we had with normal PDFs. I would suggest you submit another bug for this issue against poppler and reference this bug for details. Thanks again for the test material!
Thanks for committing this! The issue with poppler is now tracked in https://bugs.freedesktop.org/show_bug.cgi?id=54746. Martyn, I took the liberty to CC you.
*** Bug 685378 has been marked as a duplicate of this bug. ***