Bug 700460 – rewrite tests to not rely on precise timing of timeouts

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 700460 - rewrite tests to not rely on precise timing of timeouts


Summary:	rewrite tests to not rely on precise timing of timeouts


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	mainloop
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	703060 755741 (view as bug list)
Depends on:
Blocks:

Reported:	2013-05-16 13:48 UTC by Colin Walters
Modified:	2018-05-24 15:20 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
0001-tests-mainloop-Fix-assertions-to-really-avoid-race-c.patch (1.28 KB, patch) 2013-07-11 15:10 UTC, Colin Walters	reviewed	Details \| Review
0001-tests-mainloop-Fix-assertions-to-really-avoid-race-c.patch (1.40 KB, patch) 2013-07-12 13:53 UTC, Colin Walters	committed	Details \| Review
Fix races in tests/unix (5.90 KB, patch) 2013-07-21 14:23 UTC, Colin Walters	reviewed	Details \| Review
don't immediately error on timeout (2.03 KB, patch) 2013-07-22 18:34 UTC, Colin Walters	committed	Details \| Review
0001-tests-mainloop-Relax-conditions-for-non-idle-systems.patch (1.16 KB, patch) 2013-07-24 01:25 UTC, Colin Walters	none	Details \| Review
mainloop: redo child source tests to not use timeouts (6.04 KB, patch) 2013-07-24 14:13 UTC, Dan Winship	committed	Details \| Review
Avoid spurious timing failures in tests (3.56 KB, patch) 2015-09-28 15:15 UTC, Dan Winship	none	Details \| Review
Avoid spurious timing failures in tests (6.03 KB, patch) 2015-11-05 16:02 UTC, Dan Winship	none	Details \| Review

Description Colin Walters 2013-05-16 13:48:23 UTC

I saw this at least once:

GLib:ERROR:../../../glib/tests/mainloop.c:494:test_recursive_child_sources: assertion failed (a == 9): (8 == 9)

The test is "orange" though because it passed later without a source code change.

Comment 1 Dan Winship 2013-05-16 14:14:57 UTC

need to rewrite this (and as many other tests as possible) to not use timeouts

Comment 2 Dan Winship 2013-06-25 16:49:35 UTC

*** Bug 703060 has been marked as a duplicate of this bug. ***

Comment 3 Dan Winship 2013-06-25 16:51:13 UTC

from 703060, same basic issue:

"GLib-GIO:ERROR:../../../gio/tests/socket.c:631:test_timed_wait: assertion
failed (poll_duration < 112000): (126884 < 112000)" }

Comment 4 Colin Walters 2013-07-11 14:59:53 UTC

The mainloop test originated with:

commit 1b01109377a4780431996d617643704351a27d8f
Author:     Matthias Clasen <mclasen@redhat.com>
AuthorDate: Tue Nov 22 23:57:18 2011 -0500

    Add some more mainloop tests

Since then the specific assertions have been revised for this exact issue:

commit 95f29687e14423ca541cac1c00137375b935168d
Author:     Martin Pitt <martinpitt@gnome.org>
AuthorDate: Wed Jun 27 11:46:28 2012 +0200

    /mainloop/timeouts test: Reduce race condition


Now looking at this code...I'm not sure how to make this test as written work at all.  The thing is, we may be delayed for an arbitrary amount of time, so for example a = b = c = 1.   In practice we've seen a = 8 in the gnome-ostre eintegrationtest.  I think the only assertions we can make of this form are "a < b".  Matthias, do you agree?

Comment 5 Colin Walters 2013-07-11 15:10:48 UTC

Created attachment 248933 [details] [review]
0001-tests-mainloop-Fix-assertions-to-really-avoid-race-c.patch

This was actually really easy to reproduce, just do:

$ while ./mainloop; do true; done &

$ pkill -SIGSTOP lt-mainloop
(wait a few seconds)
$ pkill -CONT lt-mainloop

Without this patch, it aborts pretty quickly.  With it, no race conditions.

Comment 6 Dan Winship 2013-07-11 21:48:34 UTC

Comment on attachment 248933 [details] [review]
0001-tests-mainloop-Fix-assertions-to-really-avoid-race-c.patch

I think we can also assert

  g_assert_cmpint (a, >=, b);
  g_assert_cmpint (b, >=, c);

?

But since this test is supposed to actually be testing that timeouts work (as opposed to using timeouts to test that something else works), maybe we could also have non-fatal warnings when we get unexpected results?

  if (a != 10 || b != 4 || c != 3)
    g_test_message ("Unexpected timeout results (a = %d, b = %d, c = %d); this is a bug if it occurs reliably when running on a fast, idle machine", a, b, c);

That way people running the tests by hand still might notice (other than the fact that g_test_message() is essentially a black hole...)

Comment 7 Colin Walters 2013-07-12 13:34:19 UTC

(In reply to comment #6)
> (From update of attachment 248933 [details] [review])
> I think we can also assert
> 
>   g_assert_cmpint (a, >=, b);
>   g_assert_cmpint (b, >=, c);

Yes, that is a lot nicer looking and a stronger assertion.  Fixed.

>   if (a != 10 || b != 4 || c != 3)
>     g_test_message ("Unexpected timeout results (a = %d, b = %d, c = %d); this
> is a bug if it occurs reliably when running on a fast, idle machine", a, b, c);

Eh.  The thing is that it really is a normal condition.  In multiple ways.  Hopefully "most" real-world deployments using GLib now are not actually idle.  If the computer's on, it should be doing work or scheduled to do some soon; otherwise you're just burning power.

And personally I tend to use "gnome-desktop-testing-runner -p 0" a lot - running the tests in parallel especially on a fast machine *really* cuts down on the test latency.

Comment 8 Colin Walters 2013-07-12 13:53:29 UTC

Created attachment 249005 [details] [review]
0001-tests-mainloop-Fix-assertions-to-really-avoid-race-c.patch

Comment 9 Matthias Clasen 2013-07-14 04:18:41 UTC

Review of attachment 249005 [details] [review]:

sure, if this works

Comment 10 Colin Walters 2013-07-21 14:23:37 UTC

Created attachment 249749 [details] [review]
Fix races in tests/unix

Comment 11 Dan Winship 2013-07-22 13:19:53 UTC

Comment on attachment 249749 [details] [review]
Fix races in tests/unix

>...and rewrite the test into the newly idiomatic "look ma no GMainLoop!"
>style.  The reason this style is better is because we don't want:
>
>on_timeout ()
>{
>  g_error ("timed out!");
>}

(That's orthogonal to whether you use a GMainLoop or not; you could just do

  sig_timeout = TRUE;
  g_main_loop_quit (loop);

rather than g_error() immediately.)

>Furthermore the sighup-nested test had a subtle race condition:
>Because we're running a nested main context, it's possible
>that the GLib worker thread will grab the outer main context and
>run our callback first before we process the inner loop.  Fix
>this by creating separate conditions for the nested and outer
>loops.

I don't see how this is possible; first, I'm not sure what you mean by "will grab the outer main context"; the worker thread does not do any context grabbing that I can see. And even if it did try to acquire the default GMainContext, it should fail, because nested_idle() is running from inside that context, and so the context must already be acquired by the main thread.

Comment 12 Colin Walters 2013-07-22 16:46:13 UTC

(In reply to comment #11)

> I don't see how this is possible; 

Right, OK, it's a different bug than what I thought.  The *previous* test, sighup_add_remove, adds a SIGHUP handler and removes it, then sends SIGHUP.  This leaves a pending SIGHUP state.

When the sighup_nested test then iterates the context, we still have that pending state, which terminates the loop early.

My patch "fixed" this because I changed the assertion to be that we had received two signals.

This gets to an interesting question: What are the expected semantics of adding and then removing a Unix signal watch?  Should we revert to the default behavior?  Probably - if you add a SIGTERM handler, then for some reason remove it, you'd expect SIGTERM to still terminate the process.

Comment 13 Dan Winship 2013-07-22 17:37:17 UTC

(In reply to comment #12)
> This gets to an interesting question: What are the expected semantics of adding
> and then removing a Unix signal watch?  Should we revert to the default
> behavior?  Probably - if you add a SIGTERM handler, then for some reason remove
> it, you'd expect SIGTERM to still terminate the process.

Yeah... adding and removing a signal watch should be like setting and removing a signal handler I guess.

Comment 14 Colin Walters 2013-07-22 18:15:16 UTC

Broke out a patch for that in: https://bugzilla.gnome.org/show_bug.cgi?id=704699

Comment 15 Colin Walters 2013-07-22 18:34:28 UTC

Created attachment 249830 [details] [review]
don't immediately error on timeout

Comment 16 Dan Winship 2013-07-23 18:40:18 UTC

Comment on attachment 249830 [details] [review]
don't immediately error on timeout

sure

Comment 17 Colin Walters 2013-07-23 18:44:32 UTC

Comment on attachment 249830 [details] [review]
don't immediately error on timeout

Ok, still to go is at least the rest of glib/tests/mainloop.c.

Comment 18 Colin Walters 2013-07-24 01:25:23 UTC

Created attachment 249983 [details] [review]
0001-tests-mainloop-Relax-conditions-for-non-idle-systems.patch

Comment 19 Colin Walters 2013-07-24 01:35:59 UTC

Ok, one thing we need to figure out is how to handle "should not be reached" timeouts, like the one in test_swapping_child_sources().  At the moment it's pretty easy to make this test fail since the timeout is a mere 50ms.  We could define some "reasonable test timeout" like 10 seconds.

Or alternatively, we could ditch these random timeouts, and rely on the test harness to have a timeout.

Yet another alternative would be doing something a bit more clever and emphasizing time during which the process has been scheduled for these test timeouts.  g_process_timeout_new() ?  We'd still need an outer non-scheduling timeout, but a way to schedule a timeout that only counts time spent inside its GMainContext would be pretty interesting potentially.

Comment 20 Dan Winship 2013-07-24 14:13:11 UTC

Created attachment 250043 [details] [review]
mainloop: redo child source tests to not use timeouts

The timeout-based tests could fail on slow or heavily-loaded machines.
Change them to use a counter-based "timeout" source rather than a
time-based one, so they no longer depend on wall time.

====

How about this instead? (which is more or less like your "process
timeout" idea)

Comment 21 Colin Walters 2013-07-24 14:54:33 UTC

Review of attachment 250043 [details] [review]:

This is interesting; but at a high level, we should still be testing timeout sources I think; we just have to have weaker assertions for them.  You're not modifying the assertions I already weakened, but with this counter source you could create new tests with tighter assertions.

::: glib/tests/mainloop.c
@@ +360,3 @@
+
+static gint counter;
+static gint64 last_counter_update;

Were we to make this a bit more robust, this data should be looked up via GHashTable<GMainContext,CounterSourceData> or something.

Secondly, you are not resetting the counter variable between tests, but we need to do that, right?

Comment 22 Dan Winship 2013-07-24 15:46:58 UTC

(In reply to comment #21)
> Review of attachment 250043 [details] [review]:
> 
> This is interesting; but at a high level, we should still be testing timeout
> sources I think

Sure, but we still have test_timeouts() in mainloop.c, plus all of timeout.c. I only changed the child source tests, which were using timeouts to control the ordering of events, but don't actually care about *time* per se.

> +static gint counter;
> +static gint64 last_counter_update;
> 
> Were we to make this a bit more robust, this data should be looked up via
> GHashTable<GMainContext,CounterSourceData> or something.

As originally written, the tests assumed that "time" keeps passing for the sources in the outer GMainContext even when there's an inner GMainContext running and keeping them from being dispatched. Making the counter be per-GMainContext and/or per-source would break that, requiring more rewriting of the tests. (Technically, counter should probably be per-thread, but that wasn't relevant for these tests, so I didn't bother.)

> Secondly, you are not resetting the counter variable between tests, but we need
> to do that, right?

No, when you create a new counter source, it sets its timeout relative to the current value of counter.

Comment 23 Colin Walters 2013-07-24 15:52:38 UTC

Review of attachment 250043 [details] [review]:

That all makes sense; looks good to me then!

Comment 24 Dan Winship 2013-07-24 18:22:52 UTC

Comment on attachment 250043 [details] [review]
mainloop: redo child source tests to not use timeouts

Attachment 250043 [details] pushed as 2127654 - mainloop: redo child source tests to not use timeouts

still got at least gio/tests/socket

Comment 25 Jasper St. Pierre (not reading bugmail) 2013-10-22 16:49:51 UTC

I assume this is fixed then?

Comment 26 Dan Winship 2013-10-22 16:53:29 UTC

(In reply to comment #25)
> I assume this is fixed then?

no:

(In reply to comment #24)
> still got at least gio/tests/socket

Comment 27 Dan Winship 2015-09-28 14:51:22 UTC

*** Bug 755741 has been marked as a duplicate of this bug. ***

Comment 28 Dan Winship 2015-09-28 15:15:40 UTC

Created attachment 312310 [details] [review]
Avoid spurious timing failures in tests

Slow/heavily-loaded build machines tend to have problems with tests
that involve precise timing. So if the g_socket_condition_timed_wait()
or g_async_queue_timed_pop() tests fail, just mark the test skipped
rather than failed, with a message explaining that it may or may not
indicate a bug.

Comment 29 Colin Walters 2015-09-28 16:00:48 UTC

Will we ever debug "nondeterministic skips" though?

Hmm.  On Linux, we could look at /proc/self/status:

              * voluntary_context_switches,     nonvoluntary_context_switches:
                Number of voluntary and involuntary  context  switches  (since
                Linux 2.6.23).

We could say if number is unchanged, assert stronger bounds on the monotonic clock, e.g. that no more than 1 monotonic second elapsed?

Comment 30 Colin Walters 2015-09-28 16:02:18 UTC

Review of attachment 312310 [details] [review]:

::: glib/tests/asyncqueue.c
@@ +237,3 @@
+      g_test_skip (msg);
+      g_free (msg);
+    }

This looks like it could be extracted into a helper function?

Comment 31 Dan Winship 2015-11-05 16:02:15 UTC

Created attachment 314930 [details] [review]
Avoid spurious timing failures in tests

====

(In reply to Colin Walters from comment #30)
> This looks like it could be extracted into a helper function?

OK, the new version adds "g_assert_timing()". It seemed to make sense
to make it a public function, since other packages have timing-related
tests too...

Comment 32 GNOME Infrastructure Team 2018-05-24 15:20:06 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/703.