Bug 722604 – Various tests are failing with 2.39.3

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 722604 - Various tests are failing with 2.39.3


Summary:	Various tests are failing with 2.39.3


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	general
Version:	2.39.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-01-20 14:18 UTC by Iain Lane
Modified:	2018-05-24 16:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
asyncqueue: fix timeout math on 32bit systems (1.46 KB, patch) 2014-02-23 06:19 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review
timer test: use 'volatile' for locals (1.03 KB, patch) 2014-03-04 15:19 UTC, Allison Karlitskaya (desrt)	none	Details \| Review

Description Iain Lane 2014-01-20 14:18:06 UTC

[ I've lumped these into one bug but if you want them split out then I can do that ]

Several tests fail under `make check' and/or in Ubuntu's CI environment when running the installed-tests intermittently with 2.39.3, some more intermittently than others.

ERROR: asyncqueue
=================

GLib (gthread-posix.c): Unexpected error from C library during 'pthread_cond_timedwait': Invalid argument.  Aborting.
Aborted
# random seed: R02Sfeecf2f8ab125750310ad438d8538d1d
# Start of asyncqueue tests
ok 1 /asyncqueue/sort
PASS: asyncqueue 1 /asyncqueue/sort
ok 2 /asyncqueue/destroy
PASS: asyncqueue 2 /asyncqueue/destroy
ok 3 /asyncqueue/threads
PASS: asyncqueue 3 /asyncqueue/threads
ERROR: asyncqueue - missing test plan
ERROR: asyncqueue - exited with status 134 (terminated by signal 6?)

ERROR: timer
============

**
GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0)
Aborted
# random seed: R02Sfaff4fa66d3b51cb8f6556638213c224
# Start of timer tests
# GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic: assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0)
ERROR: timer - missing test plan
ERROR: timer - exited with status 134 (terminated by signal 6?)

ERROR: timer
============

**
GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: assertion failed (elapsed == elapsed2): (1e-06 == 1e-06)
Aborted
# random seed: R02S32481fbfbcf1cf268f33e22b0937e8a8
# Start of timer tests
# GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop: assertion failed (elapsed == elapsed2): (1e-06 == 1e-06)
ERROR: timer - missing test plan
ERROR: timer - exited with status 134 (terminated by signal 6?)


ERROR: gapplication
===================


(/build/buildd/glib2.0-2.39.3/debian/build/deb/gio/tests/.libs/lt-gapplication:381): GLib-GIO-ERROR **: 
Expected
-----
got ./cmd 0
activate action1
change action2 1
exit status: 0
-----
Got (full)
-----
got ./cmd 0
change action2 1
activate action1
exit status: 0
-----

Trace/breakpoint trap
# random seed: R02S6ad0c8efa2d6cd27271e68464d7f4841
# Start of gapplication tests
ok 1 /gapplication/no-dbus
PASS: gapplication 1 /gapplication/no-dbus
ok 2 /gapplication/basic
PASS: gapplication 2 /gapplication/basic
ok 3 /gapplication/no-appid
PASS: gapplication 3 /gapplication/no-appid
ok 4 /gapplication/properties
PASS: gapplication 4 /gapplication/properties
ok 5 /gapplication/app-id
PASS: gapplication 5 /gapplication/app-id
ok 6 /gapplication/quit
PASS: gapplication 6 /gapplication/quit
ok 7 /gapplication/local-actions
PASS: gapplication 7 /gapplication/local-actions
# GLib-GIO-FATAL-ERROR: 
Expected
-----
got ./cmd 0
activate action1
change action2 1
exit status: 0
-----
Got (full)
-----
got ./cmd 0
change action2 1
activate action1
exit status: 0
-----

cleaning up pid 451
ERROR: gapplication - missing test plan
ERROR: gapplication - exited with status 133 (terminated by signal 5?)

and also (from installed-tests)

# GLib-GIO-FATAL-ERROR: 
Expected
-----
got ./cmd 0
got ./cmd 1
cmdline ./cmd echo --abc -d
environment TEST=1
getenv TEST=1
file /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W/foo
properties ok
cwd /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W
busy
idle
stdin ok
exit status: 0
-----
Got (full)
-----
got ./cmd 0
got ./cmd 1
cmdline ./cmd echo --abc -d
environment TEST=1
getenv TEST=1
file /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W/foo
cwd /tmp/adt-run.QJFFc8/dsc0t-installed-tests-testtmp/tmpdir/test-tmp-glib_gapplication.test-ZJ8L9W
properties ok
busy
idle
stdin ok
exit status: 0
-----

Comment 1 Dan Winship 2014-02-15 23:39:13 UTC

(In reply to comment #0)
> GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic:
> assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0)

This is probably floating-point precision lossage.

> GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop:
> assertion failed (elapsed == elapsed2): (1e-06 == 1e-06)

And this *definitely* is.

We should probably deprecate g_assert_cmpfloat() and replace it with a new version that takes an epsilon as well...

Comment 2 Allison Karlitskaya (desrt) 2014-02-23 06:19:31 UTC

Created attachment 270038 [details] [review]
asyncqueue: fix timeout math on 32bit systems

88182d375e13ae6519a288d5295220c83ca27e73 caught this issue in
g_async_queue_timed_pop() but failed to fix the same bug in the _unlocked()
variant.

This is only a problem on 32bit systems.  On 64bit systems, the tv_sec
in a timeval is already 64 bits, so no overflow occurs.

Comment 3 Allison Karlitskaya (desrt) 2014-02-23 06:21:01 UTC

Comment on attachment 270038 [details] [review]
asyncqueue: fix timeout math on 32bit systems

Attachment 270038 [details] pushed as 356fe2c - asyncqueue: fix timeout math on 32bit systems

gapplication bug is patched around for now..

This still leaves the 'timer' problems to be solved here.

Comment 4 Allison Karlitskaya (desrt) 2014-03-03 03:34:22 UTC

(In reply to comment #0)
> GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic:
> assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0)
> Aborted

This is a rounding error.  elapsed * 1e6 might be 0.9999 but it will be truncated to 0 by (guint64).

Only trouble is that I've tried hardcoding the timer return value to 1 microsecond to reproduce this issue locally and I've failed.  The calculation seems to work out here.  Is this failure on some weird arch?

> GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop:
> assertion failed (elapsed == elapsed2): (1e-06 == 1e-06)

This one _really_ has no business failing.

This is the code in question:

  g_timer_stop (timer);

  elapsed = g_timer_elapsed (timer, NULL);
  g_usleep (100);
  elapsed2 = g_timer_elapsed (timer, NULL);

  g_assert_cmpfloat (elapsed, ==, elapsed2);


Where g_timer_elapsed() is:

  if (timer->active)
    timer->end = g_get_monotonic_time ();

  elapsed = timer->end - timer->start;

  total = elapsed / 1e6;

  return total;


ie: this is performing the same calculation twice, and getting two different answers.  What platform is this on?

Comment 5 Emilio Pozuelo Monfort 2014-03-03 23:52:46 UTC

(In reply to comment #4)
> (In reply to comment #0)
> > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:38:test_timer_basic:
> > assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (1 == 0)
> > Aborted
> 
> This is a rounding error.  elapsed * 1e6 might be 0.9999 but it will be
> truncated to 0 by (guint64).
> 
> Only trouble is that I've tried hardcoding the timer return value to 1
> microsecond to reproduce this issue locally and I've failed.  The calculation
> seems to work out here.  Is this failure on some weird arch?

It failed on i386 though with different values, see

https://buildd.debian.org/status/fetch.php?pkg=glib2.0&arch=i386&ver=2.39.90-1&stamp=1392761446

GLib:ERROR:/«PKGBUILDDIR»/./glib/tests/timer.c:38:test_timer_basic: assertion failed (micros == ((guint64)(elapsed * 1e6)) % 1000000): (2 == 1)
Aborted

It seems to have passed on all other linux architectures in 2.39.90-1.

> > GLib:ERROR:/build/buildd/glib2.0-2.39.3/./glib/tests/timer.c:57:test_timer_stop:
> > assertion failed (elapsed == elapsed2): (1e-06 == 1e-06)
> 
> This one _really_ has no business failing.
> 
> This is the code in question:
> 
>   g_timer_stop (timer);
> 
>   elapsed = g_timer_elapsed (timer, NULL);
>   g_usleep (100);
>   elapsed2 = g_timer_elapsed (timer, NULL);
> 
>   g_assert_cmpfloat (elapsed, ==, elapsed2);
> 
> 
> Where g_timer_elapsed() is:
> 
>   if (timer->active)
>     timer->end = g_get_monotonic_time ();
> 
>   elapsed = timer->end - timer->start;
> 
>   total = elapsed / 1e6;
> 
>   return total;
> 
> 
> ie: this is performing the same calculation twice, and getting two different
> answers.  What platform is this on?

That one didn't fail on any linux architecture on Debian for 2.39.90-1. After that I disabled the test. Iain?

Comment 6 Iain Lane 2014-03-04 09:34:36 UTC

The logs, both from autopkgtest (installed-tests) and build-time testing (make check) have expired, I'm afraid. Launchpad doesn't keep the logs for failed builds.

I'm going to re-enable this test for .91 so we'll see if it comes up again.

Comment 7 Allison Karlitskaya (desrt) 2014-03-04 14:52:16 UTC

After working with Iain and Seb I'm fairly sure I understand the nature of the problem.

The machine floating point registers often have "extra precision" beyond what is required for IEEE double.  This machine-defined extra precision can be used to reduce propagation of rounding errors in complex calculations, and is explicitly permitted by the C specification.

What we're seeing here is that the elapsed value returned from the first call (by register) is being stored in the variable, while the elapsed value returned from the second call is being left in its register for the comparison.  This means that one value has higher precision than the other, which is failing the compare.

I think this is a GCC bug because the spec has this to say:

"""

8  Except for assignment and cast (which remove all extra range and precision), the values of operations with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type.

"""

and indeed we are doing an assignment.

I'll file this against GCC.  I expect a fun argument.

Comment 8 Allison Karlitskaya (desrt) 2014-03-04 15:17:12 UTC

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60413

Meanwhile, we can work around locally by using 'volatile' to give the compiler an 'extra push' towards truncating the values.

Comment 9 Allison Karlitskaya (desrt) 2014-03-04 15:19:51 UTC

Created attachment 270910 [details] [review]
timer test: use 'volatile' for locals

GCC seems to be failing to follow the letter of the C spec by allowing extra
precision in floating point values to persist across assignments which are
optimised away.

Force its hand by using 'volatile' on the locals in question.

Comment 10 Allison Karlitskaya (desrt) 2014-03-04 15:45:19 UTC

Seems that GCC is intentionally violating the spec in the name of performance.

From its manpage:

       -fexcess-precision=style
           This option allows further control over excess precision on
           machines where floating-point registers have more precision
           than the IEEE "float" and "double" types and the processor
           does not support operations rounding to those types.  By
           default, -fexcess-precision=fast is in effect; this means
           that operations are carried out in the precision of the
           registers and that it is unpredictable when rounding to the
           types specified in the source code takes place.  When
           compiling C, if -fexcess-precision=standard is specified
           then excess precision follows the rules specified in ISO
           C99; in particular, both casts and assignments cause values
           to be rounded to their semantic types (whereas
           -ffloat-store only affects assignments).  This option is
           enabled by default for C if a strict conformance option
           such as -std=c99 is used.


As such, I don't have any expectation that this issue will be fixed upstream.  We should either build GLib with -fexcess-precision=standard or -std=c99, or just accept that we will need to have the workaround in the testcase forever.

Comment 11 Dan Winship 2014-03-04 16:03:07 UTC

(In reply to comment #10)
> We should either build GLib with -fexcess-precision=standard or -std=c99, or
> just accept that we will need to have the workaround in the testcase forever.

Except that really, we should stop comparing floats for equality entirely. Even with those flags, there are still existing tests that could legally fail. Eg, gio/tests/gsettings.c effectively asserts that 123.456 == strtod("123.456").

Comment 12 Fabio Scaccabarozzi 2015-07-18 10:48:14 UTC

2.40.0 and 2.42.2 still fail on gcc-4.9.3

Please see the Gentoo bug: https://bugs.gentoo.org/show_bug.cgi?id=512218

As specified also there, of the workarounds proposed:
- -std=c99 fails to build the package with gcc-4.9.3 (gbitlock.c:320:3: error: 'asm' undeclared (first use in this function), not the only one)
- -fexcess-precision=standard does not work with gcc-4.9.3 (test "timer" still fails)

Comment 13 Simon McVittie 2017-12-21 14:37:06 UTC

(In reply to Dan Winship from comment #11)
> Even with those flags, there are still existing tests that could legally
> fail. Eg, gio/tests/gsettings.c effectively asserts that 123.456 ==
> strtod("123.456").

If the float is just some arbitrary value, using one that can be represented exactly in binary (like 123.375) would avoid this issue. (Obviously that can't work for timings though.)

Comment 14 Philip Withnall 2017-12-21 15:18:09 UTC

(In reply to Simon McVittie from comment #13)
> (In reply to Dan Winship from comment #11)
> > Even with those flags, there are still existing tests that could legally
> > fail. Eg, gio/tests/gsettings.c effectively asserts that 123.456 ==
> > strtod("123.456").
> 
> If the float is just some arbitrary value, using one that can be represented
> exactly in binary (like 123.375) would avoid this issue. (Obviously that
> can't work for timings though.)

Have you seen such a failure in gio/tests/gsettings.c? If so, a patch for it like you suggest, would be welcome.

Comment 15 Philip Withnall 2018-05-03 10:54:33 UTC

Iain, are these tests now enabled (without failures) in Debian? Is there anything more to do here?

Comment 16 Simon McVittie 2018-05-03 15:48:00 UTC

We are applying Attachment #270910 [details] <https://salsa.debian.org/gnome-team/glib/blob/debian/master/debian/patches/0001-timer-test-use-volatile-for-locals.patch> and with that change, test_timer_stop() seems to be reliable.

Unfortunately, test_timer_basic() seems to have been unreliable even with that change, and is still commented out <https://salsa.debian.org/gnome-team/glib/blob/debian/master/debian/patches/skip-broken-timer-test.patch>.

I don't know whether those test results from 2014 are still valid - to reproduce timing issues we generally have to do the build on a slow machine and then run the test in a loop.

Comment 17 GNOME Infrastructure Team 2018-05-24 16:14:16 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/820.