GNOME Bugzilla – Bug 722604
Various tests are failing with 2.39.3
Last modified: 2018-05-24 16:14:16 UTC
[ I've lumped these into one bug but if you want them split out then I can do that ] Several tests fail under `make check' and/or in Ubuntu's CI environment when running the installed-tests intermittently with 2.39.3, some more intermittently than others. ERROR: asyncqueue ================= GLib (gthread-posix.c): Unexpected error from C library during 'pthread_cond_timedwait': Invalid argument. Aborting. Aborted # random seed: R02Sfeecf2f8ab125750310ad438d8538d1d # Start of asyncqueue tests ok 1 /asyncqueue/sort PASS: asyncqueue 1 /asyncqueue/sort ok 2 /asyncqueue/destroy PASS: asyncqueue 2 /asyncqueue/destroy ok 3 /asyncqueue/threads PASS: asyncqueue 3 /asyncqueue/threads ERROR: asyncqueue - missing test plan ERROR: asyncqueue - exited with status 134 (terminated by signal 6?) ERROR: timer ============ ** GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0) Aborted # random seed: R02Sfaff4fa66d3b51cb8f6556638213c224 # Start of timer tests # GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0) ERROR: timer - missing test plan ERROR: timer - exited with status 134 (terminated by signal 6?) ERROR: timer ============ ** GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: assertion failed (elapsed == elapsed2): (1e-06 == 1e-06) Aborted # random seed: R02S32481fbfbcf1cf268f33e22b0937e8a8 # Start of timer tests # GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: assertion failed (elapsed == elapsed2): (1e-06 == 1e-06) ERROR: timer - missing test plan ERROR: timer - exited with status 134 (terminated by signal 6?) ERROR: gapplication =================== (/build/buildd/glib2.0-2.39.3/debian/build/deb/gio/tests/.libs/lt-gapplication:381): GLib-GIO-ERROR **: Expected ----- got ./cmd 0 activate action1 change action2 1 exit status: 0 ----- Got (full) ----- got ./cmd 0 change action2 1 activate action1 exit status: 0 ----- Trace/breakpoint trap # random seed: R02S6ad0c8efa2d6cd27271e68464d7f4841 # Start of gapplication tests ok 1 /gapplication/no-dbus PASS: gapplication 1 /gapplication/no-dbus ok 2 /gapplication/basic PASS: gapplication 2 /gapplication/basic ok 3 /gapplication/no-appid PASS: gapplication 3 /gapplication/no-appid ok 4 /gapplication/properties PASS: gapplication 4 /gapplication/properties ok 5 /gapplication/app-id PASS: gapplication 5 /gapplication/app-id ok 6 /gapplication/quit PASS: gapplication 6 /gapplication/quit ok 7 /gapplication/local-actions PASS: gapplication 7 /gapplication/local-actions # GLib-GIO-FATAL-ERROR: Expected ----- got ./cmd 0 activate action1 change action2 1 exit status: 0 ----- Got (full) ----- got ./cmd 0 change action2 1 activate action1 exit status: 0 ----- cleaning up pid 451 ERROR: gapplication - missing test plan ERROR: gapplication - exited with status 133 (terminated by signal 5?) and also (from installed-tests) # GLib-GIO-FATAL-ERROR: Expected ----- got ./cmd 0 got ./cmd 1 cmdline ./cmd echo --abc -d environment TEST=1 getenv TEST=1 file /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W/foo properties ok cwd /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W busy idle stdin ok exit status: 0 ----- Got (full) ----- got ./cmd 0 got ./cmd 1 cmdline ./cmd echo --abc -d environment TEST=1 getenv TEST=1 file /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W/foo cwd /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W properties ok busy idle stdin ok exit status: 0 -----
(In reply to comment #0) > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: > assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0) This is probably floating-point precision lossage. > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: > assertion failed (elapsed == elapsed2): (1e-06 == 1e-06) And this *definitely* is. We should probably deprecate g_assert_cmpfloat() and replace it with a new version that takes an epsilon as well...
Created attachment 270038 [details] [review] asyncqueue: fix timeout math on 32bit systems 88182d375e13ae6519a288d5295220c83ca27e73 caught this issue in g_async_queue_timed_pop() but failed to fix the same bug in the _unlocked() variant. This is only a problem on 32bit systems. On 64bit systems, the tv_sec in a timeval is already 64 bits, so no overflow occurs.
Comment on attachment 270038 [details] [review] asyncqueue: fix timeout math on 32bit systems Attachment 270038 [details] pushed as 356fe2c - asyncqueue: fix timeout math on 32bit systems gapplication bug is patched around for now.. This still leaves the 'timer' problems to be solved here.
(In reply to comment #0) > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: > assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0) > Aborted This is a rounding error. elapsed * 1e6 might be 0.9999 but it will be truncated to 0 by (guint64). Only trouble is that I've tried hardcoding the timer return value to 1 microsecond to reproduce this issue locally and I've failed. The calculation seems to work out here. Is this failure on some weird arch? > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: > assertion failed (elapsed == elapsed2): (1e-06 == 1e-06) This one _really_ has no business failing. This is the code in question: g_timer_stop (timer); elapsed = g_timer_elapsed (timer, NULL); g_usleep (100); elapsed2 = g_timer_elapsed (timer, NULL); g_assert_cmpfloat (elapsed, ==, elapsed2); Where g_timer_elapsed() is: if (timer->active) timer->end = g_get_monotonic_time (); elapsed = timer->end - timer->start; total = elapsed / 1e6; return total; ie: this is performing the same calculation twice, and getting two different answers. What platform is this on?
(In reply to comment #4) > (In reply to comment #0) > > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: > > assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0) > > Aborted > > This is a rounding error. elapsed * 1e6 might be 0.9999 but it will be > truncated to 0 by (guint64). > > Only trouble is that I've tried hardcoding the timer return value to 1 > microsecond to reproduce this issue locally and I've failed. The calculation > seems to work out here. Is this failure on some weird arch? It failed on i386 though with different values, see https://buildd.debian.org/status/fetch.php?pkg=glib2.0&arch=i386&ver=2.39.90-1&stamp=1392761446 GLib:ERROR:/«PKGBUILDDIR»/./glib/tests/timer.c:38:test_timer_basic: assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (2 == 1) Aborted It seems to have passed on all other linux architectures in 2.39.90-1. > > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: > > assertion failed (elapsed == elapsed2): (1e-06 == 1e-06) > > This one _really_ has no business failing. > > This is the code in question: > > g_timer_stop (timer); > > elapsed = g_timer_elapsed (timer, NULL); > g_usleep (100); > elapsed2 = g_timer_elapsed (timer, NULL); > > g_assert_cmpfloat (elapsed, ==, elapsed2); > > > Where g_timer_elapsed() is: > > if (timer->active) > timer->end = g_get_monotonic_time (); > > elapsed = timer->end - timer->start; > > total = elapsed / 1e6; > > return total; > > > ie: this is performing the same calculation twice, and getting two different > answers. What platform is this on? That one didn't fail on any linux architecture on Debian for 2.39.90-1. After that I disabled the test. Iain?
The logs, both from autopkgtest (installed-tests) and build-time testing (make check) have expired, I'm afraid. Launchpad doesn't keep the logs for failed builds. I'm going to re-enable this test for .91 so we'll see if it comes up again.
After working with Iain and Seb I'm fairly sure I understand the nature of the problem. The machine floating point registers often have "extra precision" beyond what is required for IEEE double. This machine-defined extra precision can be used to reduce propagation of rounding errors in complex calculations, and is explicitly permitted by the C specification. What we're seeing here is that the elapsed value returned from the first call (by register) is being stored in the variable, while the elapsed value returned from the second call is being left in its register for the comparison. This means that one value has higher precision than the other, which is failing the compare. I think this is a GCC bug because the spec has this to say: """ 8 Except for assignment and cast (which remove all extra range and precision), the values of operations with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. """ and indeed we are doing an assignment. I'll file this against GCC. I expect a fun argument.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60413 Meanwhile, we can work around locally by using 'volatile' to give the compiler an 'extra push' towards truncating the values.
Created attachment 270910 [details] [review] timer test: use 'volatile' for locals GCC seems to be failing to follow the letter of the C spec by allowing extra precision in floating point values to persist across assignments which are optimised away. Force its hand by using 'volatile' on the locals in question.
Seems that GCC is intentionally violating the spec in the name of performance. From its manpage: -fexcess-precision=style This option allows further control over excess precision on machines where floating-point registers have more precision than the IEEE "float" and "double" types and the processor does not support operations rounding to those types. By default, -fexcess-precision=fast is in effect; this means that operations are carried out in the precision of the registers and that it is unpredictable when rounding to the types specified in the source code takes place. When compiling C, if -fexcess-precision=standard is specified then excess precision follows the rules specified in ISO C99; in particular, both casts and assignments cause values to be rounded to their semantic types (whereas -ffloat-store only affects assignments). This option is enabled by default for C if a strict conformance option such as -std=c99 is used. As such, I don't have any expectation that this issue will be fixed upstream. We should either build GLib with -fexcess-precision=standard or -std=c99, or just accept that we will need to have the workaround in the testcase forever.
(In reply to comment #10) > We should either build GLib with -fexcess-precision=standard or -std=c99, or > just accept that we will need to have the workaround in the testcase forever. Except that really, we should stop comparing floats for equality entirely. Even with those flags, there are still existing tests that could legally fail. Eg, gio/tests/gsettings.c effectively asserts that 123.456 == strtod("123.456").
2.40.0 and 2.42.2 still fail on gcc-4.9.3 Please see the Gentoo bug: https://bugs.gentoo.org/show_bug.cgi?id=512218 As specified also there, of the workarounds proposed: - -std=c99 fails to build the package with gcc-4.9.3 (gbitlock.c:320:3: error: 'asm' undeclared (first use in this function), not the only one) - -fexcess-precision=standard does not work with gcc-4.9.3 (test "timer" still fails)
(In reply to Dan Winship from comment #11) > Even with those flags, there are still existing tests that could legally > fail. Eg, gio/tests/gsettings.c effectively asserts that 123.456 == > strtod("123.456"). If the float is just some arbitrary value, using one that can be represented exactly in binary (like 123.375) would avoid this issue. (Obviously that can't work for timings though.)
(In reply to Simon McVittie from comment #13) > (In reply to Dan Winship from comment #11) > > Even with those flags, there are still existing tests that could legally > > fail. Eg, gio/tests/gsettings.c effectively asserts that 123.456 == > > strtod("123.456"). > > If the float is just some arbitrary value, using one that can be represented > exactly in binary (like 123.375) would avoid this issue. (Obviously that > can't work for timings though.) Have you seen such a failure in gio/tests/gsettings.c? If so, a patch for it like you suggest, would be welcome.
Iain, are these tests now enabled (without failures) in Debian? Is there anything more to do here?
We are applying Attachment #270910 [details] <https://salsa.debian.org/gnome-team/glib/blob/debian/master/debian/patches/0001-timer-test-use-volatile-for-locals.patch> and with that change, test_timer_stop() seems to be reliable. Unfortunately, test_timer_basic() seems to have been unreliable even with that change, and is still commented out <https://salsa.debian.org/gnome-team/glib/blob/debian/master/debian/patches/skip-broken-timer-test.patch>. I don't know whether those test results from 2014 are still valid - to reproduce timing issues we generally have to do the build on a slow machine and then run the test in a loop.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/820.