GNOME Bugzilla – Bug 340827
crash when switching images with mouse wheel
Last modified: 2007-05-27 14:30:51 UTC
Steps to reproduce: 1. open image in a folder 2. switch images quickly with mouse wheel 3. crash after a while Stack trace: Backtrace was generated from '/usr/bin/eog' (no debugging symbols found) Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1". (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread -1224742144 (LWP 8392)] [New Thread -1258984528 (LWP 8397)] [New Thread -1226155088 (LWP 8393)] (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) 0xffffe410 in __kernel_vsyscall ()
+ Trace 68057
Thread 1 (Thread -1224742144 (LWP 8392))
Other information: i don't remember if i had the list of images in the folder (F9) open. i just played with it before but i think it was closed at the time of the crash. this is ubuntu dapper up-to-date as of may 6, EOG 2.14.1.
OK, i checked more, and it's like that. in EOG if you focus on the image and the "image collection" pane is not open, mouse wheel will zoom in/out. however if you open the "image collection" pane and have the mouse on it and mousewheel then you'll switch images when mousewheeling. that way you can switch images VERY fast. in a big directory with big images if you just mousewheel a second or so, EOG will be busy for a while (but i must say it's impressively fast, congrats for that!). but after little time of this little game, it will crash. i already reproduced this crash 3 times, it's easy to trigger this way. what's funny is that when the bug-buddy window about the crash appears, EOG still changes pictures in the background for maybe a second before the EOG window pops. surely due to threading.
ugh... i get a crash with the exact same stack if i run a slideshow with 1 second between images and exit it (before it would normally escape) with the ESCAPE key. already got that other crash three times, i reproduce it like the first one 100% of the time. i report it in this bug, since it's the same stack and it also relates to changing images quickly, it's obviously the same bug. i hope the fix for those bugs will be released on time to be fixed for dapper, but the new EOG is otherwise great! the new UI is simply perfect.
related to slideshow exit, i can also reproduce with 2 and 3 and 4 seconds between images. i didn't try to go over 4 seconds, as it's really a lot then. this bug is very bad... slideshow is unusable for me without crashing EOG... it's so bad, i even wonder, maybe it's just a crappy ubuntu package and upstream EOG is not affected!?
Thanks for the bug report. This particular bug has already been reported into our bug tracking system, but please feel free to report any further bugs you find. *** This bug has been marked as a duplicate of 320206 ***
well now that i looked a bit more at the code i'm not sure anymore it's a duplicate. what i saw by looking at the code is that the stack is completely wrong anyway (surely because it's multithread). so the fact that the stack is similar doesn't mean for sure it's the same bug. but let's fix #320206 and if that doesn't fix that bug, we can reopen it. if that fixes it, then it was the same bug. but #320206 in the code is about full screen view (crash in ego-full-screen.c), and in this bug there is no full screen. so now that i looked at the code a bit more... i think they are probably different crashes. we'll see.
As it is mentioned in comment #5, this bug is in no way related to bug #320206. That bug occurrs inside the EogFullScreen widget, and there are no instances of it during the segmentation fault. An updated backtrace:
+ Trace 68313
Thread 3 (Thread -1508238416 (LWP 10158))
Thread 2 (Thread -1497924688 (LWP 10157))
Ubuntu bug about that: https://launchpad.net/distros/ubuntu/+source/eog/+bug/44689 error and backtrace: "** ERROR **: file eog-window.c: line 3061 (display_image_data): assertion failed : (eog_image_has_data (image, EOG_IMAGE_DATA_ALL)) aborting... Program received signal SIGABRT, Aborted.
+ Trace 68568
Thread NaN (LWP 17451)
*** Bug 344166 has been marked as a duplicate of this bug. ***
I can reproduce the stack trace in comments #6 and #7 with a directory of about 1000 files and scrolling in the collection pane using either the mouse or the keyboard (going back and forth seems to be helpful in triggering it). However I can also get this stack trace:
+ Trace 68788
Obviously a similar part of the code. In my case this appears more frequently by about a factor of five. I am using CVS HEAD as of 2006/06/12. A similar crash seems to happen in 2.14 as well.
The crash in comments 6,7, and 9 is a race condition. As you scroll through quickly you get ahead of the drawing and jobs back up in the job queue. Then if you start scrolling the other way (or click on an image that hasn't been rendered yet) you also add it to the queue (since the only check is to see if the image data is available, which it isn't yet because the first job hasn't executed yet). Now the first job executes, rendering the image and filling the priv->image structure. When the second job executes it finds the image already rendered and triggers an assertion (since priv->image != NULL). That's the cause of the trace in comment #9. I think 6 and 7 are the same cause, but a slightly different order of events. Curiously, the race condition isn't between the threads running the jobs. It is within the job queue itself. Hopefully I'll have a patch later tonight.
The stack trace in comments #6 and #7 is also due to the race condition: but in this case I suspect threads are implicated. One thread does the checks on an image then, before it can display it, the other thread deallocates the image (while running the other job).
Created attachment 67329 [details] [review] Patch to fix the image loading race condition This patch fixes the crash seen in comment #9, but not the crash caused by premature deallocation. I am not completely happy with it since I believe that there is still room for two threads to collide and cause the crash. However, the window of opportunity is a lot narrower and I am only getting the second class of race-condition crash now. I think there needs to be some sort of image-global locking, but I'm not sure of the best way to do this.
Created attachment 67385 [details] [review] A complete fix for the race conditions in image-loading This patch fixes both cases. I am fairly confident that the code is also thread-safe.
it is a bit discouraging that after three weeks, on a crash bug marked critical, there is still no apparent reaction from maintainers from a patch by a GNOME developer :-( but of course if it's fixed on time for 2.16 it doesn't matter exactly how long before. although it would be nice to have the fix in 2.14.x too. PS: can someone mark this bug NEW? it's still UNCONFIRMED for now although many people confirmed it, and i don't have the privileges to mark it "NEW".
what different does NEW make to you? Nobody seems to contest that's a bug ...
about NEW: i thought maybe (at least some) maintainers only check bugzilla for "NEW" bugs, ignoring "UNCONFIRMED" bugs. that seems sensitive for maintainers with lots of work. but i don't know, never been a maintainer of free software. from your comment, obviously i overestimated the important of that status.
Callum, (In reply to comment #13) > Created an attachment (id=67385) [edit] > A complete fix for the race conditions in image-loading > > This patch fixes both cases. I am fairly confident that the code is also > thread-safe. With this patch I can still reproduce easily the crash and backtrace of comment #7, but only that one. The ones in comment #6 and #9 seem to be gone. Should this patch fix the crash in comment #7 too?
Claudio: It should stop the cause of #7. The problem went from easily caused (with the trace from #7) to un-reproducible on my machine and with my test set (a large group of images). I suspect it can still be triggered when two threads try rendering the same image, but that case should be rare though since the window of opportunity is a lot smaller. I would think that you aren't seeing that. What exactly are you doing to cause it? In my case it was scrolling forward with either the arrow keys or the mouse-wheel until the rendering lagged and then going back (sometimes going back and forward a few times was necessary). I'll look into it.
I scroll very fast between images, back and forth, in a very stressing way. If you use your mouse-wheel this is easily achievable. I think that doing that, it is probably easy to have two threads trying to render the same image. Doing your test (scrolling forward until the rendering lagged and then going back) I couldn't reproduce the bug. I suggest you to use the mouse wheel as fast as possible. I am sorry not to have much time to look deeply into this bug atm.
Claudio: Some quick questions: what is the typical number of images in the directory you are testing with? Also, are they all images, or are there other files in the mix?
I have tested with several different use cases, including some images in a directory with many other files, a directory with only a few pictures, and a big collection. The result is always the same. Try scrolling back and forth in a reduced range very fast. The goal should be to scroll between 2 or 3 images and not more, several times very fast. You will see the crash then. (man, it would have been nice to talk about this in guadec)
I've finally got it to reproduce (but it takes a lot longer than it used to for me). Still trying to figure out whats wrong, but initial inspection shows all three threads working on different images so it probably isn't that.
*** Bug 345167 has been marked as a duplicate of this bug. ***
*** Bug 325237 has been marked as a duplicate of this bug. ***
Created attachment 68938 [details] [review] The third version of the race-condition patch. This new patch fixes several more problems. The main remaining bug (which happens a lot more frequently with optimisation turned on, this was why I wasn't seeing it initially) is thread-related and has been fixed by adding a mutex to the image. The remaining problem is a lot rarer and I didn't track down the exact cause (it involves the image data being freed too early). Instead I just put a check for the error and recovered gracefully if it ever occurred. Ultimately the solution is to rewrite the image display code since currently it is a set of asynchronous jobs with the following bad properties: a) It doesn't guarantee the order the images will be displayed (i.e. sometimes the image displayed isn't the one selected because the threads finished in a different order to the way they started). b) It can't easily skip over images if the user has moved the selection through an image quickly and doesn't actually want it displayed. This chews a *lot* of CPU time. This is my justification for taking a short-cut on the final fix: the code is broken anyway. At least it shouldn't crash now (it goes at least a factor of 100 times longer without crashing, I can't guarantee another bug isn't lurking in there).
Created attachment 69051 [details] [review] Yet another version. Fixes some extra problems. This is a fourth version of the patch which also fixes some of the criticisms I outlined in the previous comment. It doesn't actually improve the solution, but it does solve the excessive CPU use, responsiveness and makes sure that the last image selected is the one actually shown. It is an alternative to the third fix which is why I haven't marked it as obsolete.
This patch seems to work very well for me, I haven't been able to make EOG crash with it!
Attachment #69051 [details] by Callum fixes the issues. We coincide in the fact that the there is a broken design and the code should be rewritten anyway[1], therefore I am on the opinion of commiting it. Lucas, what do you think? (I even think that the image locking API could help to fix bug #320206, but I am not really sure, I should take a look at the problem again). [1] Actually, the job management was already rewrited by Lucas in the eog-ng branch, but this branch still needs some (a lot of) love before being good enough to merge it into HEAD.
I have commited the attachment #69051 [details] to HEAD in order to make a release for GNOME 2.15.90. Will do the same for branch gnome-2-14 before the next stable release. 2006-07-24 Claudio Saavedra <csaavedra@alumnos.utalca.cl> * libeog/eog-image-private.h: * libeog/eog-image.c: (eog_image_dispose), (eog_image_init), (eog_image_lock), (eog_image_unlock), (eog_image_cancel_load): * libeog/eog-image.h: * libeog/eog-job-manager.c: (thread_start_func): * libeog/eog-job.c: (eog_job_call_action): * shell/eog-window.c: (eog_window_init), (job_image_load_action), (job_image_load_finished), (handle_image_selection_changed): Add image_lock/unlock methods to ensure that there are no race conditions when loading and unloading images (Fixes bug #340827). Patch from Callum McKenzie <callum@spooky-possum.org>.
Applied with one fix: the error dialog for corrupted images (for example) wasn't being shown with this patch. Fixed this in HEAD too. Thanks!
*** Bug 441586 has been marked as a duplicate of this bug. ***