GNOME Bugzilla – Bug 726973
most e-d-s tests fail
Last modified: 2014-06-21 16:39:36 UTC
Now that the tests all use (Folks' local fork of) GTestDBus, most of the e-d-s tests fail. The main failure mode is the segmentation fault whose backtrace can be seen in <https://bugzilla.gnome.org/show_bug.cgi?id=711807#c9>, but there are others.
Created attachment 272791 [details] [review] FolksTestDBus: flush and close connection before disposing of it g_dbus_connection_dispose() sets connection->worker = NULL, which means any attempt to send a message will crash with a segfault. If anything still has a ref to the connection - in Folks, that's often a backend reffed by the IndividualAggregator - then it'll tend to assume that it's still usable. Attempts to send messages are guarded by a check whether the connection is closed, which we can use to avoid this. --- This fixes many of the e-d-s tests: I now get 31/34 passing.
(In reply to comment #1) > This fixes many of the e-d-s tests: I now get 31/34 passing. Sorry, that was a lie. I had an additional workaround to get that far; I'll attach that once I've composed a suitable commit message.
Created attachment 272796 [details] [review] Make warnings non-fatal during "final teardown" Historically, the Folks tests have not been expected to cope gracefully with dropping off the session bus. This frequently results in warnings, which default to being fatal. For instance, enable-disable-stores fails with this fatal warning: folks-WARNING **: Error preparing persona store 'eds:test': Couldn't get view for address book ‘test’: The connection is closed The long-term solution would be to make sure that disconnection from the bus is an identifiable error domain/code that gets propagated through the stack, but that would have to be done individually per-backend. --- I don't much like this either, but it seems better than "*shrug* I have no idea whether e-d-s regressed or not" :-)
Created attachment 272802 [details] [review] tests: avoid using dconf tests/eds/perf reliably fails for me with this warning: dconf-WARNING **: failed to commit changes to dconf: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name ca.desrt.dconf was not provided by any .service files and in any case we shouldn't be changing the user's real configuration. --- I'm not sure whether this really helps this bug or not, but it seems like a good thing to do.
Again, those patches *do not* close this bug, but they might be useful as part of a full solution.
Review of attachment 272791 [details] [review]: Looks good
Review of attachment 272796 [details] [review]: Ideally, I'd rather fix the real problems but given the amount of active maintainership on the project, this is probably the best option for now.
Review of attachment 272802 [details] [review]: Looks good
I pushed these to master but I'll keep this bug open since there are still some failures. commit 71f6f79f7a9e95078d2e5a0ad2a48652197339a9 Author: Simon McVittie <simon.mcvittie@collabora.co.uk> Date: Mon Mar 24 14:16:11 2014 +0000 tests: avoid using dconf tests/eds/perf reliably fails for me with this warning: dconf-WARNING **: failed to commit changes to dconf: GDBus.Error:org.freedes and in any case we shouldn't be changing the user's real configuration. https://bugzilla.gnome.org/show_bug.cgi?id=726973 tests/lib/test-case.vala | 3 +++ 1 file changed, 3 insertions(+) commit b16bd87240424632d7881172e87b2571d3df57be Author: Simon McVittie <simon.mcvittie@collabora.co.uk> Date: Mon Mar 24 16:12:23 2014 +0000 Make warnings non-fatal during "final teardown" Historically, the Folks tests have not been expected to cope gracefully with dropping off the session bus. This frequently results in warnings, which default to being fatal. For instance, enable-disable-stores fails with this fatal warning: folks-WARNING **: Error preparing persona store 'eds:test': Couldn't get vie The long-term solution would be to make sure that disconnection from the bus is an identifiable error domain/code that gets propagated through the stack, but that would have to be done individually per-backend. https://bugzilla.gnome.org/show_bug.cgi?id=726973 tests/lib/test-case.vala | 10 ++++++++++ 1 file changed, 10 insertions(+) commit ad92c2a3a6408db165df59a0cc183643ff3f33a3 Author: Simon McVittie <simon.mcvittie@collabora.co.uk> Date: Mon Mar 24 12:37:53 2014 +0000 FolksTestDBus: flush and close connection before disposing of it g_dbus_connection_dispose() sets connection->worker = NULL, which means any attempt to send a message will crash with a segfault. If anything still has a ref to the connection - in Folks, that's often a backend reffed by the IndividualAggregator - then it'll tend to assume that it's still usable. Attempts to send messages are guarded by a check whether the connection is closed, which we can use to avoid this. https://bugzilla.gnome.org/show_bug.cgi?id=726973 tests/lib/gtestdbus.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
Created attachment 273570 [details] [review] tests: Add an assertion that the main context is empty on tear_down Leaving the global default main context with pending events is a common cause of cross-test interference which can cause spurious test failures. Try and ensure there are no pending events at the end of each test’s tear_down function.
Created attachment 273571 [details] [review] tests: Add an assertion that the main context is empty on tear_down Leaving the global default main context with pending events is a common cause of cross-test interference which can cause spurious test failures. Try and ensure there are no pending events at the end of each test’s tear_down function.
Created attachment 273572 [details] [review] tests: Manually iterate main context on tear_down in various tests The code under test in these cases doesn’t (yet) correctly ensure there are no pending events before returning. To avoid assertion failure, manually iterate the global default main context until all pending events have been handled, then continue to tear_down. In future, it would be better to fix the tests and backends themselves to not leak main context events after unprepare() has been called.
Created attachment 273573 [details] [review] bluez: Ignore missing sessions when removing sessions This occasionally happens when running the unit tests. The check only applies to unit tests, so will have no effect in production use.
These three patches don’t completely fix the EDS tests, but do seem to reduce the frequency of failures, and are definitely The Right Thing to do.
Review of attachment 273571 [details] [review]: Looks good
Review of attachment 273572 [details] [review]: This needs the same change for the Dummy backend's test lib. I'd also prefer you file a bug for the remaining issue and update the FIXMEs to reference it.
Review of attachment 273573 [details] [review]: Looks good.
Pushed with the addition of a reference to bug #727700. The lack of main context iteration in the dummy backend test library is deliberate, since I haven't been able to cause a test failure in the dummy backend due to leaking main context events. Attachment 273571 [details] pushed as 5c0c1c6 - tests: Add an assertion that the main context is empty on tear_down Attachment 273572 [details] pushed as b2bdc30 - tests: Manually iterate main context on tear_down in various tests Attachment 273573 [details] pushed as 89ba19d - bluez: Ignore missing sessions when removing sessions
I pushed the same changes to the Dummy backend that I suggested above (since the Dummy tests otherwise were failing for me). Split as two tests on account of forgetting to refresh the FIXME to point out the new bug: commit 12b7985e8f28e66420ec522cb5fcb621e8e07a78 Author: Travis Reitter <travis.reitter@collabora.co.uk> Date: Mon Apr 7 09:38:36 2014 -0700 tests: Manually iterate main context on tear_down in Dummy tests The code under test in these cases doesn’t (yet) correctly ensure there are no pending events before returning. To avoid assertion failure, manually iterate the global default main context until all pending events have been handled, then continue to tear_down. In future, it would be better to fix the tests and backends themselves to not leak main context events after unprepare() has been called. See https://bugzilla.gnome.org/show_bug.cgi?id=727700. https://bugzilla.gnome.org/show_bug.cgi?id=726973 tests/lib/dummy/test-case.vala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) commit 45e1af8c28747154278ee5058a3b144899050abb Author: Travis Reitter <travis.reitter@collabora.co.uk> Date: Mon Apr 7 09:12:02 2014 -0700 tests: Manually iterate main context on tear_down in Dummy tests The code under test in these cases doesn’t (yet) correctly ensure there are no pending events before returning. To avoid assertion failure, manually iterate the global default main context until all pending events have been handled, then continue to tear_down. In future, it would be better to fix the tests and backends themselves to not leak main context events after unprepare() has been called. https://bugzilla.gnome.org/show_bug.cgi?id=726973 tests/lib/dummy/test-case.vala | 8 ++++++++ 1 file changed, 8 insertions(+)
Created attachment 273754 [details] [review] tests: Fix enable-disable-stores EDS test The test was occasionally spuriously failing because the persona stores were being prepare()d after being re-enabled, and these prepare() calls were being suspended part-way through when the test finished and killed its main loop. They were being resumed when the main loop was re-started during tear down — it’s re-started to wait for the singleton D-Bus connection to be destroyed, so the prepare() calls ended up consistently having their D-Bus connection drop out from beneath them, resulting in warnings. These warnings would normally be useful to have, but not in the middle of a unit test. Fix this by rearchitecting the test slightly to force each persona store to be prepared after being re-enabled. Also simplify how this is tracked.
This patch fixes the only failure I found in our code (and it was a problem with the unit test, not with folks itself, although folks should provide a much better unprepare() API; that’s a separate bug). This makes `make check -C tests/eds` pass in its entirety about 80% of the time for me. The remaining failures are characterised by: > (/opt/gnome3/source/folks/tests/eds/.libs/lt-set-is-favourite:4524): folks-DEBUG: individual-aggregator.vala:521: Destroying IndividualAggregator 0x1190980 > (/opt/gnome3/source/folks/tests/eds/.libs/lt-set-is-favourite:4524): eds-DEBUG: eds-backend.vala:350: Removing address book 'test'. > (/opt/gnome3/source/folks/tests/eds/.libs/lt-set-is-favourite:4524): folks-DEBUG: backend.vala:45: Destroying Backend ‘telepathy’ (0x1197cc0) > … > (/opt/gnome3/source/folks/tests/eds/.libs/lt-set-is-favourite:4524): GLib-GObject-WARNING **: gsignal.c:2580: instance '0x7f5b1400a0a0' has no handler with id '112' I can’t work this one out at all. As far as I can tell, the GLib code for that signal is solid, and none of this is happening during a very late run of the main loop. I guess tired eyes aren’t helping. I’ll take a better look at this another time.
Review of attachment 273754 [details] [review]: Looks good. I even had all EDS tests pass 3/5 tries. When I did get failures, they looked like this: (/home/treitter/checkout/gnome/folks/tests/eds/.libs/lt-store-removed:27913): libebook-WARNING **: book_client_view_dispose_cb: The connection is closed g_dbus_connection_real_closed: Remote peer vanished with error: Error receiving message: Connection reset by peer (g-io-error-quark, 0). Exiting. g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting. cleaning up pid 27928
(In reply to comment #21) > > (/opt/gnome3/source/folks/tests/eds/.libs/lt-set-is-favourite:4524): GLib-GObject-WARNING **: gsignal.c:2580: instance '0x7f5b1400a0a0' has no handler with id '112' I also got this a separate time (with avatar-details).
Comment on attachment 273754 [details] [review] tests: Fix enable-disable-stores EDS test Pushed. Attachment 273754 [details] pushed as 7b41098 - tests: Fix enable-disable-stores EDS test
Created attachment 274837 [details] gdb script to run until crash Here is a hacky GDB script which will run a test until it crashes, letting us grab the stack trace. To reproduce the bug we've hit above, just pick an EDS test and run them like: make -C tests/eds add-persona.gdb (gdb) source run-until-crash.gdb run One major caveat: you will need to either run this with G_DEBUG=fatal-warnings (which currently causes a break much earlier than we'd like on some seemingly unrelated issue) OR you can change the g_warning() in gsignal.c:g_signal_handler_disconnect() to a g_error() and build and run against that version of GLib. Eventually, gdb will break on the fatal runtime error and you can get the backtrace
Using the script and technique above, I got two similar, seemingly-meaningful backtraces:
+ Trace 233505
Both point to an excessive signal handler disconnect related to the D-Bus proxy which makes me think we're doing something wrong in our special instance of the D-Bus setup for our EDS tests. I haven't had time to check that out, so, until I say otherwise, could you please check that area out, Philip?
(In reply to comment #23) > (In reply to comment #21) > > > > (/opt/gnome3/source/folks/tests/eds/.libs/lt-set-is-favourite:4524): GLib-GObject-WARNING **: gsignal.c:2580: instance '0x7f5b1400a0a0' has no handler with id '112' > > I also got this a separate time (with avatar-details). Some extra digging indicates this error happens just after handlers with IDs n+1 and n+2 (eg, 113 and 114) get disconnected in book_client_dispose(). Not sure how important this is but could be worth digging into (given that, if there aren't any other hints, I think we're basically going have to study D-Bus and/or GTestDBus in great detail to figure out what's going wrong).
So far, the EBookClient signal that corresponds to the failing ID in each case seems to be "closed"
Created attachment 275090 [details] [review] (SLIGHTLY BROKEN) GLib patch to resolve the "no handler with id" issue With some strategic debugging, I've worked out that this excessive GDBusConnection::closed signal handler disconnection that's happening in gio/gdbusnamewatching.c:client_unref() is due to the connection being disposed before client->disconnected_signal_handler_id gets cleared. Thus, when the EBookClient is finalized in addressbook/libebook/e-book-client.c:book_client_finalize() and unwatches its (D-Bus) addressbook proxy, there's an excessive call to g_signal_handler_disconnect() for a handler that's already been disconnected (during dispose). This patch tries to highlight a potential route for a fix though I clearly get something wrong because it has no change on the tests (I'm not familiar with GWeakRefs) but should highlight approximately what would fix this issue. And it's possible this isn't the best fix but it is bogus that GLib fails to track the disposal of the connection (then again, you could argue that it's a lot of busywork and excessive disconnections should maybe be handled silently...). Philip, based on these details, do you have an idea for a solution here?
Number 73 in an ongoing series of common GLib traps: order of operations during teardown. It's worth looking at gobject.c to see how things actually work here. In my current git checkout, assuming MyDerivedObject is-a MyBaseObject is-a GObject, you get: * begin unref with refcount == 1 - it's going to die unless a callback or another thread manages to resurrect it in time * clear thread-safe weak refs (GWeakRef) * MyDerivedObject dispose() vfunc * MyBaseObject dispose() vfunc * GObject dispose() vfunc - disconnect all its signals - destroy closures with it as object - notify non-thread-safe weak refs (g_object_weak_ref(), g_object_add_weak_pointer()) * decide that we are probably finalizing it - destroy closures with it as object, if any were added during dispose - disconnect all its signals, if any were added during dispose - notify non-thread-safe weak refs, if any were added during dispose * decide that we are definitely finalizing it (point of no return) * MyDerivedObject finalize() vfunc * MyBaseObject finalize() vfunc * GObject finalize() vfunc - run the destroy-notifiers for its qdata * actually free it If you use g_object_run_dispose(), then you just get: * MyDerivedObject dispose() vfunc * MyBaseObject dispose() vfunc * GObject dispose() vfunc - disconnect all its signals - destroy closures that have it as their object - notify non-thread-safe weak refs (g_object_weak_ref(), g_object_add_weak_pointer()) and GWeakRef-style weak refs are *not* cleared. That's probably a bug - g_object_run_dispose() should probably trash those weak refs before running the dispose vfunc. Probably my fault, I wrote the initial GWeakRef implementation. Cc'ing desrt since he also contributed heavily to it. (In reply to comment #29) > And it's possible this isn't the best fix but it is bogus that GLib fails to > track the disposal of the connection (then again, you could argue that it's a > lot of busywork and excessive disconnections should maybe be handled > silently...). The problem with that is that if an object is resurrected by being reffed in a dispose callback, it becomes valid to connect to its signals again, and old signal-connection IDs (small integers) could be reused, leading to you disconnecting someone else's signal when you thought you were disconnecting your own signal. You could maybe use the non-thread-safe weak refs as an alternative - zero out the signal ID when you are notified?
(In reply to comment #30) > If you use g_object_run_dispose(), then you just get: > > * MyDerivedObject dispose() vfunc > * MyBaseObject dispose() vfunc > * GObject dispose() vfunc > - disconnect all its signals > - destroy closures that have it as their object > - notify non-thread-safe weak refs (g_object_weak_ref(), > g_object_add_weak_pointer()) > > and GWeakRef-style weak refs are *not* cleared. That's probably a bug - > g_object_run_dispose() should probably trash those weak refs > before running the dispose vfunc. Indeed, looking at its docs, the combination of GWeakRef and g_object_run_dispose() has never behaved as documented. GWeakRef says: > ... Before the > * object's #GObjectClass.dispose method is called, every #GWeakRef > * associated with becomes empty (i.e. points to %NULL). but that isn't true when g_object_run_dispose() is used. Then again, g_object_run_dispose() says > * This functions should only be called from object system implementations. so arguably GTestDBus shouldn't be using it. *shrug*
More details: In the passing case, gdbusnamewatching correctly announces the vanishing of the service from the bus and EBookClient correctly handles that (by not trying to unwatch later, which cases the eventual warning -> error). In the failing case, gdbusnamewatching never announces the vanishing service. So, maybe our copy of GTestDBus (or the way we're using it) somehow kills the service so abruptly that gdbusnamewatching doesn't catch it (in time)?
Comment on attachment 275090 [details] [review] (SLIGHTLY BROKEN) GLib patch to resolve the "no handler with id" issue Replaced by patch on https://bugzilla.gnome.org/show_bug.cgi?id=729152
(In reply to comment #31) > (In reply to comment #30) > > If you use g_object_run_dispose(), then you just get: > > > > * MyDerivedObject dispose() vfunc > > * MyBaseObject dispose() vfunc > > * GObject dispose() vfunc > > - disconnect all its signals > > - destroy closures that have it as their object > > - notify non-thread-safe weak refs (g_object_weak_ref(), > > g_object_add_weak_pointer()) > > > > and GWeakRef-style weak refs are *not* cleared. That's probably a bug - > > g_object_run_dispose() should probably trash those weak refs > > before running the dispose vfunc. > > Indeed, looking at its docs, the combination of GWeakRef and > g_object_run_dispose() has never behaved as documented. GWeakRef says: > > > ... Before the > > * object's #GObjectClass.dispose method is called, every #GWeakRef > > * associated with becomes empty (i.e. points to %NULL). > > but that isn't true when g_object_run_dispose() is used. You called it. I created a patch that makes it clear out the weak refs (see bug #729152) and that, combined with the patch to bug #729152, the full set of Folks EDS tests pass 88 to 93% of the time; individual test cases pass 99.8% of the time in my observations. There are some very rare failures which seem to be specific to either Folks or the specific test cases themselves. I think we can reasonably defer those to their own bugs and close this (once we get the GLib patches merged). > Then again, g_object_run_dispose() says > > > * This functions should only be called from object system implementations. > > so arguably GTestDBus shouldn't be using it. *shrug* Ideally, yeah. I'm actually surprised g_object_run_dispose() isn't deprecated. But, thanks to some debug output, it looks like the place where we use it in our copy of GTestDBus, the object in question has a refcount of 54. So relying on g_object_unref() looks like it might not be feasible (and it's not totally unreasonable to be a little forceful in teardown for tests).
(In reply to comment #34) > There are some very rare failures which seem to be specific to either Folks or > the specific test cases themselves. I think we can reasonably defer those to > their own bugs and close this (once we get the GLib patches merged). Actually, they seem to mostly end like this: > g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes on an async read (g-io-error-quark, 0). Exiting. That still might be fair to defer to its own bug but I'll give it a little time while waiting on feedback for the GLib bugs.
With the patch in bug #731981, all the EDS tests pass consistently for me with ~stock GLib 2.40 (no patches which touch GTestDBus or GDBusConnection).
Created attachment 278863 [details] [review] eds: Ignore errors from unimplemented EDS methods We can assume the methods are deliberately not implemented, so we shouldn’t warn about them. This complements the patch in https://bugzilla.gnome.org/show_bug.cgi?id=731981.
Created attachment 278864 [details] [review] eds: Bypass test failures on GTestDBus teardown GTestDBus can sometimes fail on teardown, causing our EDS tests to spuriously fail, and this has become a real thorn in our side. While we wait for the proper solution to the relevant GLib bugs to be hashed out, let’s simply exit early from the tests to ignore failures in teardown. This will not affect (non-teardown) test failures from true positives, as it only happens on final teardown.
(In reply to comment #36) > With the patch in bug #731981, all the EDS tests pass consistently for me with > ~stock GLib 2.40 (no patches which touch GTestDBus or GDBusConnection). I lied. You also need these two patches, which I think are reasonable for now. The exit() patch can be reverted once the GTestDBus situation is sorted out (and any necessary changes have been copied from upstream GTestDBus to our copy).
Review of attachment 278863 [details] [review]: Looks good!
Review of attachment 278864 [details] [review]: Looks good!
The EDS tests pass for me with these patches now as well! Great way to start a weekend :) I think we're ready for a release now (whichever one of us gets to it first). It may require temporarily disabling the Tracker backend (all those tests fail for me now due possibly to some version mis-matching; I did an uninstall and fresh install of 0.16.4 with no luck) but I don't see any problem with that. Literally no one uses it. I don't think any (major?) distros build it.
Pushed. Let’s mark this as fixed. We can catch up with GTestDBus changes whenever they happen, and then revert commit 0d894ff621680587a3e16d636ca3b517cfe7d1ab. commit 0d894ff621680587a3e16d636ca3b517cfe7d1ab Author: Philip Withnall <philip.withnall@collabora.co.uk> Date: Fri Jun 20 20:35:28 2014 +0100 eds: Bypass test failures on GTestDBus teardown GTestDBus can sometimes fail on teardown, causing our EDS tests to spuriously fail, and this has become a real thorn in our side. While we wait for the proper solution to the relevant GLib bugs to be hashed out, let’s simply exit early from the tests to ignore failures in teardown. This will not affect (non-teardown) test failures from true positives, as it only happens on final teardown. https://bugzilla.gnome.org/show_bug.cgi?id=726973 tests/lib/Makefile.am | 1 + tests/lib/test-case.vala | 25 +++++++++++++++++++------ 2 files changed, 20 insertions(+), 6 deletions(-) commit ae1f281855d2f47102a61b56188f841807571537 Author: Philip Withnall <philip.withnall@collabora.co.uk> Date: Fri Jun 20 20:34:34 2014 +0100 eds: Ignore errors from unimplemented EDS methods We can assume the methods are deliberately not implemented, so we shouldn’t warn about them. This complements the patch in https://bugzilla.gnome.org/show_bug.cgi?id=731981. https://bugzilla.gnome.org/show_bug.cgi?id=726973 backends/eds/lib/edsf-persona-store.vala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)