After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 651795 - GC-related crash in pygobject_dealloc
GC-related crash in pygobject_dealloc
Status: RESOLVED FIXED
Product: pygobject
Classification: Bindings
Component: gobject
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Nobody's working on this now (help wanted and appreciated)
Python bindings maintainers
Depends on:
Blocks:
 
 
Reported: 2011-06-03 16:23 UTC by Daniel Drake
Modified: 2011-06-06 13:51 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
the fix (3.50 KB, patch)
2011-06-03 16:23 UTC, Daniel Drake
accepted-commit_now Details | Review

Description Daniel Drake 2011-06-03 16:23:08 UTC
Created attachment 189168 [details] [review]
the fix

We are encountering an occasional Python crash under Sugar (written in Python based on pygobject2 and pygtk2 static bindings), which can be triggered simply by interacting with widgets, launching programs, etc. The sugar process exits with this error:

python: Modules/gcmodule.c:262: update_refs: Assertion `gc->gc.gc_refs != 0' failed.

This error occurs when Python is doing garbage collection (triggered during allocation of new objects when the number of recent allocations passes a threshold).

The comment above this assertion in the Python source code neatly explains a possible cause of this: (Modules/gcmodule.c in Python-2.7.1)

        /* Python's cyclic gc should never see an incoming refcount
         * of 0:  if something decref'ed to 0, it should have been
         * deallocated immediately at that time.
         * Possible cause (if the assert triggers):  a tp_dealloc
         * routine left a gc-aware object tracked during its teardown
         * phase, and did something-- or allowed something to happen --
         * that called back into Python.  gc can trigger then, and may
         * see the still-tracked dying object.  Before this assert
         * was added, such mistakes went on to allow gc to try to
         * delete the object again.  In a debug build, that caused
         * a mysterious segfault, when _Py_ForgetReference tried
         * to remove the object from the doubly-linked list of all
         * objects a second time.  In a release build, an actual
         * double deallocation occurred, which leads to corruption
         * of the allocator's internal bookkeeping pointers.  That's
         * so serious that maybe this should be a release-build
         * check instead of an assert?
         */

And the backtrace from this issue strongly suggests we're looking at the same thing:

http://bugzilla.redhat.com/attachment.cgi?id=409514

The backtrace of thread 1 shows that in frame #21 we are inside the tp_dealloc function for pygobect (pygobject_dealloc), and the calls that continue on from that clearly show it "calling back into Python" (PyObject_ClearWeakRefs -> PyEval_EvalFrameEx -> _PyObject_GC_New -> _PyObject_GC_Malloc) and causing further allocations to happen, which occasionally trigger the GC, leading to this crash.

pygobject_dealloc clears weak refs before asking the GC to untrack the object:

    PyObject_ClearWeakRefs((PyObject *)self);
    PyObject_GC_UnTrack((PyObject *)self);

yet the stack trace above shows that ClearWeakRefs is clearly a case of "calling into python" which leads to further allocations, meaning that we've done exactly what the comment told us not to do. We should untrack the object first.

I found an example of a python module that calls UnTrack before ClearWeakRefs:
http://www.google.com/codesearch/p?hl=en#3UmC8RnwYSc/greenlet/greenlet.c&l=653

pygi gets it right in 3 places too:
pygi-struct.c _struct_dealloc
pygi-boxed.c _boxed_dealloc
pygi-info.c _base_info_dealloc

So I'm convinced this is the right thing to do.

Testing the attached patch while launching and closing a small Sugar app from a script for over an hour indicates that it does indeed solve the problem; previously this test case would trigger the crash in less than 10 minutes.

If this passes review, I'm happy to commit (I have access)
Comment 1 Daniel Drake 2011-06-03 16:29:28 UTC
For completeness, the 3(!) bug trackers where this issue has been looked at:

http://bugzilla.redhat.com/show_bug.cgi?id=640972
http://dev.laptop.org/ticket/10724
http://bugs.sugarlabs.org/ticket/2064
Comment 2 Tomeu Vizoso 2011-06-06 10:54:46 UTC
Comment on attachment 189168 [details] [review]
the fix

But just to play safe, please make sure that 'make check' passes before pushing it.
Comment 3 Daniel Drake 2011-06-06 13:51:45 UTC
pushed as 92aca4416a7930e5, thanks!