Bug 680897 – PDF extraction finishes, then times out, no full text indexed

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 680897 - PDF extraction finishes, then times out, no full text indexed


Summary:	PDF extraction finishes, then times out, no full text indexed


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	0.14.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Duplicates:	685378 (view as bug list)
Depends on:
Blocks:

Reported:	2012-07-31 10:46 UTC by Alexander Hunziker
Modified:	2012-10-24 15:35 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Output from tracker-extract (5.99 KB, application/octet-stream) 2012-07-31 10:46 UTC, Alexander Hunziker	Details

Description Alexander Hunziker 2012-07-31 10:46:31 UTC

Created attachment 219964 [details]
Output from tracker-extract

For some PDF files (affected example is attached), tracker does not index the full text contents on my machine. Output of tracker-extract shows that all content is extracted in 1.6 seconds, yet it times out after 10 and finishes without saving any plainText content.

Comment 1 Alexander Hunziker 2012-07-31 10:48:31 UTC

The file is too big for attaching here. URL: http://ubuntuone.com/5Rk3W4yq32UjALpEja210j

Comment 2 Martyn Russell 2012-08-15 08:34:56 UTC

(In reply to comment #1)
> The file is too big for attaching here. URL:
> http://ubuntuone.com/5Rk3W4yq32UjALpEja210j

Hello Aleksander, that link doesn't work for me, are you able to post it somewhere else? Or you could send it to me directly, martyn@lanedo.com

Comment 3 Martyn Russell 2012-08-15 10:28:06 UTC

This error message is a bit misleading it has to be said.

"""
Content extraction finished: 68/68 pages indexed in 1,631733 seconds, 81633 bytes extracted
"""

So we have 2 timeout mechanisms:
1. The timeout mechanism in the child process we start is 5 seconds.
2. The timeout mechanism the parent process implements is 10 seconds.

Neither of these times are representative of the errors you reported.

I did some debugging here with a random PDF I found and it looks like we're ALWAYS waiting for the 10 second timeout. This is unique to the PDF extractor.

The problem to me looks like we're calling g_output_stream_write_all() when we have the data extracted in the forked child process, but at the same time we're in the parent block with sigtimedwait() for up to 10 second. When that finishes, we check if there is anything in the pipe and actually there is (of course).

So the issue here is really that we're getting into a deadlock of sorts. The right way forward should be to make sure we can read the written data from the child process in the parent process to avoid the blocking on the g_output_stream_write_all() call. This way, the process would exit properly (instead of being killed) and this would save PDF extraction times by a rather large amount.

I didn't get time yet today to work on a fix, but I will ASAP.

Comment 4 Alexander Hunziker 2012-08-15 14:17:55 UTC

Martyn, sounds like you figured out what's going wrong, but just to be sure, a link to the document I tested with: http://www.schlaf-im-stroh.ch/fileadmin/user_upload/dokumente/Katalog/SiS_Brosch_2012_Gesamt.pdf

Comment 5 Martyn Russell 2012-08-15 16:34:17 UTC

I have a patch here too, but it needs cleaning up - will try to finish this some time tomorrow! :)

Comment 6 Martyn Russell 2012-08-16 15:25:54 UTC

I've pushed the proposed patch here:
http://git.gnome.org/browse/tracker/log/?h=pdf-timeout-fix

Jürg care to review this before I merge it?

I tested with g_usleep() in the child process to make sure it works properly.

Comment 7 Martyn Russell 2012-08-16 15:29:09 UTC

Example debug output when things break:

"""
Extracting...
  Using /usr/lib/tracker-0.14/extract-modules/libextract-pdf.so...
Parent: Content extraction now starting in child process (pid = 2756)

Parent: Child process took too long. We waited 10 seconds, so we're going to kill it!
Done (3 items)
"""

Example debug output when things work now (which is much faster):

"""
...
Child: Extracted 1773 bytes from page 96, 924460 bytes remaining
Child: Extracted 1405 bytes from page 97, 923055 bytes remaining
Child: Extracted 798 bytes from page 98, 922257 bytes remaining
Child: Content extraction finished: 98/98 pages indexed in 0.51 seconds, 126319 bytes extracted
Parent: fd is set is 1
Parent: Expected bytes to read is 126417
Child: Content extraction now finished in child process, written 126417 bytes to parent process
Parent:   Bytes read is 65535,bytes remaining is 60882
Parent:   Bytes read is 60882,bytes remaining is 0
Parent:   Bytes read is 0,bytes remaining is 0
Parent:   Finished reading all bytes
Parent: Child process exited
Parent: Data received in 0.51 seconds (timeout is 10 seconds)
Done (4 items)


"""

Comment 8 Alexander Hunziker 2012-08-29 09:52:18 UTC

Thanks for working on that, Martyn - can this go in 0.14.3?

Comment 9 Martyn Russell 2012-08-29 09:57:52 UTC

I was hoping for some review before it does... any chance you can test this patch?

Comment 10 Alexander Hunziker 2012-08-29 20:04:05 UTC

Martyn, I have built the branch, installed it on my box, and executed tracker-control -rs. After the reindexing, I could successfully find the file in question by searching for words contained in it. Seems to be working!

Comment 11 Alexander Hunziker 2012-08-29 20:14:15 UTC

Martyn, hang on. Maybe there is an issue after all. Some 15 minutes after the reindexing finished, I still have a tracker-extract process using 100% CPU. How to figure out what it's doing?

Comment 12 Alexander Hunziker 2012-08-29 20:32:43 UTC

Sorry for all the noise!

So after some 20 minutes, that process terminated itself peacefully. From the tracker-extract log:

29 Aug 2012, 22:12:29: Tracker: Child: Extracted 0 bytes from page 17, 1048576 bytes remaining
29 Aug 2012, 22:13:28: Tracker: Child: Extracted 0 bytes from page 18, 1048576 bytes remaining
29 Aug 2012, 22:14:10: Tracker: Child: Extracted 0 bytes from page 19, 1048576 bytes remaining
29 Aug 2012, 22:14:56: Tracker: Child: Extracted 0 bytes from page 20, 1048576 bytes remaining
29 Aug 2012, 22:15:33: Tracker: Child: Extracted 0 bytes from page 21, 1048576 bytes remaining
29 Aug 2012, 22:16:18: Tracker: Child: Extracted 0 bytes from page 22, 1048576 bytes remaining
29 Aug 2012, 22:17:08: Tracker: Child: Extracted 0 bytes from page 23, 1048576 bytes remaining
29 Aug 2012, 22:17:28: Tracker: Child: Extracted 0 bytes from page 24, 1048576 bytes remaining
29 Aug 2012, 22:17:28: Tracker: Child: Content extraction finished: 24/24 pages indexed in 1113,52 seconds, 0 bytes extracted
29 Aug 2012, 22:17:28: Tracker: Child: Content extraction now finished in child process, written 0 bytes to parent process


I wasn't able to figure out what file it struggled with so much, if you have an idea let me know. Anyway I think the original issue is fixed.

Comment 13 Martyn Russell 2012-08-30 08:35:32 UTC

Alexander, thank you for the testing.

Usually you can see what's going on by setting the verbosity to debug and all messages are logged. You can do this with:

  tracker-control --set-log-verbosity=debug

You should be able to follow what happens with:

  tail -f ~/.local/share/tracker/tracker-extract.log

In there, the file in question should be mentioned.

If possible, can you attach that file here, it sounds quite broken to be taking that period of time. I would expect the parent to have killed the child after 10 seconds, but it isn't.

We (in master) have a time check in the child process per page but I removed that because it seemed silly to have 2 time checks when the process will be killed anyway if it takes too long. That might be related, but I need to check. If you're able to attach the file here that would help.

Thanks,

Comment 14 Alexander Hunziker 2012-09-02 21:31:20 UTC

Martyn, I think I have found the file, I will upload it here as an attachment. I have a feeling the problem is in poppler, as evince seriously chokes on this file as well (uses both cores to 100% for minutes upon opening the file). Chromium's built in PDF reader has no issue with it at all.

Comment 15 Alexander Hunziker 2012-09-02 21:33:46 UTC

Find the file here: http://ubuntuone.com/4ELfHGFXVtDAtU0lWsLT6G

Comment 16 Martyn Russell 2012-09-10 15:44:49 UTC

(In reply to comment #14)
> Martyn, I think I have found the file, I will upload it here as an attachment.
> I have a feeling the problem is in poppler, as evince seriously chokes on this
> file as well (uses both cores to 100% for minutes upon opening the file).
> Chromium's built in PDF reader has no issue with it at all.

I tried this file with the latest patch and it indeed times out.

But the other PDFs don't. So the current fix goes a long way towards helping.

"""
Extracting...
  Using /usr/lib/tracker-0.14/extract-modules/libextract-pdf.so...
Parent: Content extraction now starting in child process (pid = 26997)
Child: Extracted 0 bytes from page 1, 1048576 bytes remaining
Parent: Child process took too long. We waited 10 seconds, so we're going to kill it!
Done (7 items)
"""

You're right too, Evince completely breaks with this file.

The question is, can we do anything about this?

I am going to resolve this bug as fixed for now because I think this latest issue is not related to Tracker itself but actually poppler as you suggest AND we've fixed the time out issues we had with normal PDFs.

I would suggest you submit another bug for this issue against poppler and reference this bug for details.

Thanks again for the test material!

Comment 17 Alexander Hunziker 2012-09-10 19:06:16 UTC

Thanks for committing this!

The issue with poppler is now tracked in https://bugs.freedesktop.org/show_bug.cgi?id=54746.

Martyn, I took the liberty to CC you.

Comment 18 Martyn Russell 2012-10-24 15:35:01 UTC

*** Bug 685378 has been marked as a duplicate of this bug. ***