Bug 791706 – test suite times out on arm64 when running the closures test

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 791706 - test suite times out on arm64 when running the closures test


Summary:	test suite times out on arm64 when running the closures test


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	general
Version:	2.54.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	2.58
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-12-17 14:39 UTC by Michael Biebl
Modified:	2018-05-24 20:02 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Optimised closure ref_count accesses (2.91 KB, patch) 2018-01-18 22:34 UTC, Punit Agrawal	needs-work	Details \| Review
Implement g_closure_[ref\|unref]() in terms of 16-bit atomics (4.03 KB, patch) 2018-01-25 00:11 UTC, Punit Agrawal	needs-work	Details \| Review

Description Michael Biebl 2017-12-17 14:39:23 UTC

Version: 2.54.2
Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880883

In Debian we have the problem, that glib does not build reliably on arm64, or more specifically, the test-suite fails rather often with a timeout, if it runs on specific (slower) arm64 hardware.

See https://buildd.debian.org/status/logs.php?pkg=glib2.0&arch=arm64

This seems to happen when the closures test is run, which is killed by the buildd after 150min.
If you want to read the whole bug report, see the referenced downstream bug report. I'm copying here the reply from Punit Agrawal, which seem to be the most relevant ones:


I had a chance to dig into this further and think I now know what's going on.

I ran the test on an AMD Seattle system (same as arm-arm-04) and
noticed that the closures test (tests/refcount/closures) didn't finish
in >150 minutes. Profiling the test using "perf" pointed to
g_closure_ref() and g_closure_unref() as taking the bulk of execution
time. This confirms Emilio's suspicion in #20.

$ perf report

# To display the perf.data header info, please use
--header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:uppp'
# Event count (approx.): 1497293182208
#
# Overhead  Command   Shared Object               Symbol
# ........  ........  ..........................
............................................
#
    67.85%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_ref
    29.10%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_unref
     2.45%  closures  libgobject-2.0.so.0.5400.2  [.] g_closure_invoke
     0.22%  closures  closures                    [.] test_closure
     0.11%  closures  closures                    [.] g_closure_unref@plt


Copying the annotated profile of g_closure_ref(). First column is %
samples at instructions as observed by "perf record"

          │
          │       INC_ASSIGN (closure, ref_count, &new_ref_count);
  0.20 │ 30: ldr    w1, [x19]
  0.31 │       str    w1, [x29, #32]
  1.06 │       ldr    w0, [x29, #32]
  0.32 │       ldr    w0, [x29, #32]
  0.24 │       ldr    w2, [x29, #32]
          │       add  w0, w0, #0x1                 <------- A
          │       bfxil  w2, w0, #0, #15
  0.13 │       str    w2, [x29, #32]
  2.35 │       ldr    w0, [x29, #32]
  0.46 │       ldr    w2, [x29, #32]
          │       and  w0, w0, #0x7fff
 54.73│ 5c: ldxr   w3, [x19]                       <------- B
          │       cmp  w3, w1
  0.02 │     ↓b.ne  70
  0.97 │       stlxr  w4, w2, [x19]
          │     ↑cbnz  w4, 5c
 38.56│ 70: dmb  ish
          │     ↑b.ne  30
          │       g_return_val_if_fail (new_ref_count > 1, NULL);
  0.00 │       cmp  w0, #0x1
          │     ↓b.ls   c8
          │
          │       return closure;


ldxr and the dmb instruction together make up ~92% of all the samples.

Also the generated code looks quite bad - notice the redundant loads
to w0/w2. It's as if the code is compiled with optimisations turned
off - which is not the case ( it's compile with -O2 )

The duplicate loads also lead to a large window between incrementing
the reference[A] and trying to do a compare and swap starting at [B].
Another thread performing the same accesses between A and B will cause
the cache line to bounce between the two cores and also fail the ldxr,
restarting the loop.

Aside: The x86 assembly for ref/unref also has duplicate loads but it
seems to not mind them much.

On further inspection, I noticed that the following definition of the
GCloure structure.

struct _GClosure
{
  /*< private >*/
  volatile              guint    ref_count : 15;
  /* meta_marshal is not used anymore but must be zero for historical reasons
     as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */
  volatile              guint    meta_marshal_nouse : 1;
  volatile              guint    n_guards : 1;
  volatile              guint    n_fnotifiers : 2;      /*
finalization notifiers */
  volatile              guint    n_inotifiers : 8;      /*
invalidation notifiers */
  volatile              guint    in_inotify : 1;
  volatile              guint    floating : 1;
  /*< protected >*/
  volatile              guint    derivative_flag : 1;
  /*< public >*/
  volatile              guint    in_marshal : 1;
  volatile              guint    is_invalid : 1;
...
}

Specifically, the ref_count is defined as a volatile bit field. That
coupled with the implementation of increment explains the bad assembly
sequence generated for g_closure_ref()/g_closure_unref().

AFAICT, there is no need for the ref_count (or the other bit fields in
the structure) to be volatile, especially as they seem to be updated
with gcc __sync builtins.

I have a few ideas on how to improve the code by re-factoring
INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in
gobject/gclosure.c. I'll give those a shot and see if they lead to an
improvement in performance. Happy to discuss any suggestions or try
patches and report back.

It might also be worth raising/discussing the findings here with
upstream. I am not at all familiar with glib and they might be able to
help explain the code structure and have suggestions for improvement.




Since I hadn't dug into this enough... ;)

I also ran a few iteration of the test on some other ARM64 platforms,
Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2
the test always finished in a couple of minutes. But on Juno R0, it
sometimes finished in a couple of minutes but seemed to get tied up in
knots and took 15min (at which point I terminated the test). After
some experiments with restricting the cores the test ran on (using
taskset), I concluded that when the test ran on the A57s (incidentally
the same cores as on the AMD Seattle system) it took a long time but
not when it was running on the A53s.

So it seems the micro-architecture behaviour combined with the poor
code sequence is making this especially painful while other cores seem
to be chug through without breaking a sweat.

Comment 1 Philip Withnall 2018-01-04 12:59:36 UTC

The `volatile` qualifiers in GClosure seem to have been added in commit f881779a8cf1904757a5cb292279ee87a0252df1 (13 years ago) to get atomics working on s390. I think there’s definitely scope to revisit this.

Reworking g_closure_ref() and g_closure_unref() to use more specialised code, rather than CHANGE_FIELD, seems like a good idea. You should be able to use g_atomic_int_inc() and friends, rather than a manual compare-and-swap.

Comment 2 Punit Agrawal 2018-01-18 22:32:10 UTC

(In reply to Philip Withnall from comment #1)
> Reworking g_closure_ref() and g_closure_unref() to use more specialised
> code, rather than CHANGE_FIELD, seems like a good idea. You should be able
> to use g_atomic_int_inc() and friends, rather than a manual compare-and-swap.

I followed your suggestion and came up with the attached patch. 

It is a bigger than I expected due to the need to move ref_count out of the bit field to use the atomic* helpers. The move also means now there are wasted bytes in the structure. I wasn't sure if moving the ref_count to the end of the definition would have any adverse effects.

The generated code looks a lot tighter.

Would something like the below patch be acceptable? (Needs testing on the Arm system where the ref/unref was particularly slow)

Comment 3 Punit Agrawal 2018-01-18 22:34:05 UTC

Created attachment 367050 [details] [review]
Optimised closure ref_count accesses

Comment 4 Philip Withnall 2018-01-19 12:10:14 UTC

Review of attachment 367050 [details] [review]:

Unfortunately, as a struct in a public header, we can’t change the size of GClosure, as that would be an ABI break. Sorry, I should have noticed that before.

Two options I can think of:
 • Rework the code in CHANGE_FIELD to be more efficient on arm64.
 • Use a custom refcounting implementation with the existing field. (This might just end up looking like CHANGE_FIELD though.)
 • Combine the ref_count:15 and meta_marshal_nouse:1 bitfields into a single gint16 ref_count, and use 16-bit atomic intrinsics on that. I don’t know if the arm64 implementation of a 16-bit atomic increment/decrement-and-test is native or a compiler fallback though.

Comment 5 Punit Agrawal 2018-01-25 00:08:52 UTC

(In reply to Philip Withnall from comment #4)
> Review of attachment 367050 [details] [review] [review]:

Thanks a lot for taking a look at the patch.

> 
> Unfortunately, as a struct in a public header, we can’t change the size of
> GClosure, as that would be an ABI break. Sorry, I should have noticed that
> before.

Too bad the structure can't be updated. Being able to split out the ref_count into a separate field would've made the code a lot cleaner and also brought GClosure closer to other ref_counting schemes in glib.

> 
> Two options I can think of:
>  • Rework the code in CHANGE_FIELD to be more efficient on arm64.

I had a stab at this but was able to reduce only two instructions (coming from _SET_NEW) outside the critical section between changing the value of the field and atomic CAS.

>  • Use a custom refcounting implementation with the existing field. (This
> might just end up looking like CHANGE_FIELD though.)
>  • Combine the ref_count:15 and meta_marshal_nouse:1 bitfields into a single
> gint16 ref_count, and use 16-bit atomic intrinsics on that. I don’t know if
> the arm64 implementation of a 16-bit atomic increment/decrement-and-test is
> native or a compiler fallback though.

Using 16-bit atomics gave the best improvement in generated code (patch attached). If you think the change can stick it should give an improvement in code regardless of architecture.

Comment 6 Punit Agrawal 2018-01-25 00:11:46 UTC

Created attachment 367406 [details] [review]
Implement g_closure_[ref|unref]() in terms of 16-bit atomics

Comment 7 Philip Withnall 2018-02-22 13:10:47 UTC

Thanks for your patience. This is on my review queue (in 10th place out of 31 bugs), but I don’t think I’m going to get time to review it thoroughly before the 2.56 code freeze (2018-03-05). In any case, this is a sufficiently deep-seated change that it should probably be merged early in the 2.58 cycle to get some good testing time before 2.58 (rather than really late in the 2.56 cycle).

Is it going to be shipped in Debian as a vendor patch at all? If it does, that would provide us with useful testing data.

Comment 8 Philip Withnall 2018-02-22 13:19:14 UTC

Review of attachment 367406 [details] [review]:

::: glib/gatomic.h
@@ +182,3 @@
+    (void) (0 ? *(atomic) ^ *(atomic) : 0);                                  \
+    (void) __sync_fetch_and_sub ((atomic), 1);                               \
+  }))

If these are going to be in the gatomic.h public header, they are public API, and should be treated as such, so:
 • List them in docs/reference/glib/glib-sections.txt
 • Add tests in glib/tests/atomic.c
 • Implement the Windows and non-compiler-intrinsic based implementation in gatomic.c (otherwise GClosure will stop working on those platforms)
 • Split the gatomic changes out into a separate commit from the gclosure.c changes
 • Change g_atomic_int16_dec() to g_atomic_int16_dec_and_test() for consistency with the existing 32-bit API and increased usefulness

::: gobject/gclosure.c
@@ +141,3 @@
+G_STMT_START {                                                   \
+  ClosureInt16 *cunion = (ClosureInt16*) _closure;               \
+  g_atomic_int16_inc(&cunion->vint16);                           \

Nitpick: Coding style requires a space before `(` in function calls and declarations (in several places in this patch).

@@ +561,3 @@
   g_return_val_if_fail (closure->ref_count < CLOSURE_MAX_REF_COUNT, NULL);
 
+  INC_REF_COUNT(closure);

Nitpick: Coding style requires a space before `(`.

@@ +623,1 @@
+  if (closure->ref_count == 0)

These two lines should be combined to a single line which calls g_atomic_int16_dec_and_test() to atomically decrement *and* check that the reference count has hit zero, otherwise there’s a race condition here in the (closure->ref_count == 0) check.

Comment 9 Philip Withnall 2018-04-11 12:05:14 UTC

Punit, any progress on this?

Comment 10 Punit Agrawal 2018-04-13 14:12:25 UTC

(In reply to Philip Withnall from comment #9)
> Punit, any progress on this?

Hi Philip,

Apologies for not getting back earlier. Looks like this one slipped through the crack.

TBH, I'm not sure about the best way to progress this patch. Outline below.

The issue for Debian builds was worked around by reducing the test iteration count - after which there was no longer timeouts during builds.

Although the patch improves the generated assembly for g_closure_ref/unref, further testing revealed that it still bottlenecks on the particular micro-architecture on which the issue was reported.

It's an improvement but doesn't fix the problem. As such, I am of the opinion to park these change until - either glib atomics are updated, or we have further reports of issues in this area.

Thoughts?

Comment 11 Philip Withnall 2018-04-17 21:58:58 UTC

(In reply to Punit Agrawal from comment #10)
> (In reply to Philip Withnall from comment #9)
> > Punit, any progress on this?
> 
> Hi Philip,
> 
> Apologies for not getting back earlier. Looks like this one slipped through
> the crack.
> 
> TBH, I'm not sure about the best way to progress this patch. Outline below.
> 
> The issue for Debian builds was worked around by reducing the test iteration
> count - after which there was no longer timeouts during builds.

Yeah, but that doesn’t solve the underlying problem.

> Although the patch improves the generated assembly for g_closure_ref/unref,
> further testing revealed that it still bottlenecks on the particular
> micro-architecture on which the issue was reported.

Are the bottlenecks still on the same instructions, or on other parts of the GClosure implementation now? Are we always going to bottleneck here because a GClosure (presumably) mostly fits into a single arm64 cache line, and hence all atomic operations on a GClosure will conflict for access to the same cache line, or something like that?

> It's an improvement but doesn't fix the problem. As such, I am of the
> opinion to park these change until - either glib atomics are updated, or we
> have further reports of issues in this area.

I would prefer to push this through and get the changes in, assuming that fixing the rest of the bottleneck won’t require these changes to be overwritten again. If we can benefit the generated assembly for all architectures with a switch to 16-bit atomics, that would be worthwhile.

Comment 12 Punit Agrawal 2018-05-07 20:16:16 UTC

(In reply to Philip Withnall from comment #11)
> (In reply to Punit Agrawal from comment #10)
> > (In reply to Philip Withnall from comment #9)
> > > Punit, any progress on this?
> > 
> > Hi Philip,
> > 
> > Apologies for not getting back earlier. Looks like this one slipped through
> > the crack.
> > 
> > TBH, I'm not sure about the best way to progress this patch. Outline below.
> > 
> > The issue for Debian builds was worked around by reducing the test iteration
> > count - after which there was no longer timeouts during builds.
> 
> Yeah, but that doesn’t solve the underlying problem.

Agreed - it is a bodge to get the build through. Doesn't really change behaviour.

To refresh my memory on the behaviour with the patch, I had another go at testing the patch on the machine where the problem was detected and found that the closures test ran in ~20min (the two runs I did). Previously this was taking >150 minutes without the patch.

I can only surmise that I'd not run the test for long enough to see the reduced runtime but only seen the profile and extrapolated no change in behaviour. More below.

> 
> > Although the patch improves the generated assembly for g_closure_ref/unref,
> > further testing revealed that it still bottlenecks on the particular
> > micro-architecture on which the issue was reported.
> 
> Are the bottlenecks still on the same instructions, or on other parts of the
> GClosure implementation now? Are we always going to bottleneck here because
> a GClosure (presumably) mostly fits into a single arm64 cache line, and
> hence all atomic operations on a GClosure will conflict for access to the
> same cache line, or something like that?

My theory is that due to the way the closure refcounting was
implemented each refcount change takes a number of iteration around
the loop to perform the update atomically.

The patch makes the atomic update a lot tighter (compare with the
sequence from comment 0). First column is percentage samples. ldxrh
and stlxrh are half-word exclusive load and store instructions.

46.23 │ 30:   ldxrh  w0, [x19]
      │        add    w0, w0, #0x1
  0.20│        stlxrh w1, w0, [x19]
      │      ↑ cbnz   w1, 30
 48.64│        dmb    ish
      │        g_return_val_if_fail (closure->ref_count > 1, NULL);
  1.66│        ldr    w0, [x19]
      │        and    w0, w0, #0x7fff
  0.03│        cmp    w0, #0x1
      │      ↓ b.ls   a0


The profile shows that the exclusive load is still the hottest
instruction but due to the reducing of the length of the critical
section, the atomic increment/decrement is successful more times than
not. Atleast that's my theory. :)

> 
> > It's an improvement but doesn't fix the problem. As such, I am of the
> > opinion to park these change until - either glib atomics are updated, or we
> > have further reports of issues in this area.
> 
> I would prefer to push this through and get the changes in, assuming that
> fixing the rest of the bottleneck won’t require these changes to be
> overwritten again. If we can benefit the generated assembly for all
> architectures with a switch to 16-bit atomics, that would be worthwhile.

Agreed - I've started addressing your review comments in preparation
for uploading the patches.

Comment 13 Punit Agrawal 2018-05-07 20:25:37 UTC

One query about your review comment.

(In reply to Philip Withnall from comment #8)
> Review of attachment 367406 [details] [review] [review]:
> 

[...]


> @@ +561,3 @@
>    g_return_val_if_fail (closure->ref_count < CLOSURE_MAX_REF_COUNT, NULL);
>  
> +  INC_REF_COUNT(closure);
> 
> Nitpick: Coding style requires a space before `(`.
> 
> @@ +623,1 @@
> +  if (closure->ref_count == 0)
> 
> These two lines should be combined to a single line which calls
> g_atomic_int16_dec_and_test() to atomically decrement *and* check that the
> reference count has hit zero, otherwise there’s a race condition here in the
> (closure->ref_count == 0) check.

Can you elaborate what the race is? The decrementing update and the comparison will be two separate instructions and the thread can still get switched out after the decrement but before the comparison.

Comment 14 Philip Withnall 2018-05-09 10:30:16 UTC

(In reply to Punit Agrawal from comment #13)
> > @@ +623,1 @@
> > +  if (closure->ref_count == 0)
> > 
> > These two lines should be combined to a single line which calls
> > g_atomic_int16_dec_and_test() to atomically decrement *and* check that the
> > reference count has hit zero, otherwise there’s a race condition here in the
> > (closure->ref_count == 0) check.
> 
> Can you elaborate what the race is? The decrementing update and the
> comparison will be two separate instructions and the thread can still get
> switched out after the decrement but before the comparison.

If the decrement and the comparison are not atomic wrt each other, it’s possible for one thread to unref() and reduce the refcount to 0, but get descheduled before the comparison. Another thread could then be scheduled and resurrect the object from the dead with a ref() (which should not be allowed). The first thread may then continue to finalise everything, depending on whether it re-fetches the refcount for its comparison, or whether it uses a register copy.

This is precisely the use case that atomic decrement-and-test was designed to handle.

Comment 15 GNOME Infrastructure Team 2018-05-24 20:02:47 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1316.