GNOME Bugzilla – Bug 791706
test suite times out on arm64 when running the closures test
Last modified: 2018-05-24 20:02:47 UTC
Version: 2.54.2 Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880883 In Debian we have the problem, that glib does not build reliably on arm64, or more specifically, the test-suite fails rather often with a timeout, if it runs on specific (slower) arm64 hardware. See https://buildd.debian.org/status/logs.php?pkg=glib2.0&arch=arm64 This seems to happen when the closures test is run, which is killed by the buildd after 150min. If you want to read the whole bug report, see the referenced downstream bug report. I'm copying here the reply from Punit Agrawal, which seem to be the most relevant ones: I had a chance to dig into this further and think I now know what's going on. I ran the test on an AMD Seattle system (same as arm-arm-04) and noticed that the closures test (tests/refcount/closures) didn't finish in >150 minutes. Profiling the test using "perf" pointed to g_closure_ref() and g_closure_unref() as taking the bulk of execution time. This confirms Emilio's suspicion in #20. $ perf report # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3M of event 'cycles:uppp' # Event count (approx.): 1497293182208 # # Overhead Command Shared Object Symbol # ........ ........ .......................... ............................................ # 67.85% closures libgobject-2.0.so.0.5400.2 [.] g_closure_ref 29.10% closures libgobject-2.0.so.0.5400.2 [.] g_closure_unref 2.45% closures libgobject-2.0.so.0.5400.2 [.] g_closure_invoke 0.22% closures closures [.] test_closure 0.11% closures closures [.] g_closure_unref@plt Copying the annotated profile of g_closure_ref(). First column is % samples at instructions as observed by "perf record" │ │ INC_ASSIGN (closure, ref_count, &new_ref_count); 0.20 │ 30: ldr w1, [x19] 0.31 │ str w1, [x29, #32] 1.06 │ ldr w0, [x29, #32] 0.32 │ ldr w0, [x29, #32] 0.24 │ ldr w2, [x29, #32] │ add w0, w0, #0x1 <------- A │ bfxil w2, w0, #0, #15 0.13 │ str w2, [x29, #32] 2.35 │ ldr w0, [x29, #32] 0.46 │ ldr w2, [x29, #32] │ and w0, w0, #0x7fff 54.73│ 5c: ldxr w3, [x19] <------- B │ cmp w3, w1 0.02 │ ↓b.ne 70 0.97 │ stlxr w4, w2, [x19] │ ↑cbnz w4, 5c 38.56│ 70: dmb ish │ ↑b.ne 30 │ g_return_val_if_fail (new_ref_count > 1, NULL); 0.00 │ cmp w0, #0x1 │ ↓b.ls c8 │ │ return closure; ldxr and the dmb instruction together make up ~92% of all the samples. Also the generated code looks quite bad - notice the redundant loads to w0/w2. It's as if the code is compiled with optimisations turned off - which is not the case ( it's compile with -O2 ) The duplicate loads also lead to a large window between incrementing the reference[A] and trying to do a compare and swap starting at [B]. Another thread performing the same accesses between A and B will cause the cache line to bounce between the two cores and also fail the ldxr, restarting the loop. Aside: The x86 assembly for ref/unref also has duplicate loads but it seems to not mind them much. On further inspection, I noticed that the following definition of the GCloure structure. struct _GClosure { /*< private >*/ volatile guint ref_count : 15; /* meta_marshal is not used anymore but must be zero for historical reasons as it was exposed in the G_CLOSURE_N_NOTIFIERS macro */ volatile guint meta_marshal_nouse : 1; volatile guint n_guards : 1; volatile guint n_fnotifiers : 2; /* finalization notifiers */ volatile guint n_inotifiers : 8; /* invalidation notifiers */ volatile guint in_inotify : 1; volatile guint floating : 1; /*< protected >*/ volatile guint derivative_flag : 1; /*< public >*/ volatile guint in_marshal : 1; volatile guint is_invalid : 1; ... } Specifically, the ref_count is defined as a volatile bit field. That coupled with the implementation of increment explains the bad assembly sequence generated for g_closure_ref()/g_closure_unref(). AFAICT, there is no need for the ref_count (or the other bit fields in the structure) to be volatile, especially as they seem to be updated with gcc __sync builtins. I have a few ideas on how to improve the code by re-factoring INC_ASSIGN/DEC_ASSIGN to use specialised versions of CHANGE_FIELD in gobject/gclosure.c. I'll give those a shot and see if they lead to an improvement in performance. Happy to discuss any suggestions or try patches and report back. It might also be worth raising/discussing the findings here with upstream. I am not at all familiar with glib and they might be able to help explain the code structure and have suggestions for improvement. Since I hadn't dug into this enough... ;) I also ran a few iteration of the test on some other ARM64 platforms, Juno R2 and Juno R0 (this one is the same as arm-arm-01). On the R2 the test always finished in a couple of minutes. But on Juno R0, it sometimes finished in a couple of minutes but seemed to get tied up in knots and took 15min (at which point I terminated the test). After some experiments with restricting the cores the test ran on (using taskset), I concluded that when the test ran on the A57s (incidentally the same cores as on the AMD Seattle system) it took a long time but not when it was running on the A53s. So it seems the micro-architecture behaviour combined with the poor code sequence is making this especially painful while other cores seem to be chug through without breaking a sweat.
The `volatile` qualifiers in GClosure seem to have been added in commit f881779a8cf1904757a5cb292279ee87a0252df1 (13 years ago) to get atomics working on s390. I think there’s definitely scope to revisit this. Reworking g_closure_ref() and g_closure_unref() to use more specialised code, rather than CHANGE_FIELD, seems like a good idea. You should be able to use g_atomic_int_inc() and friends, rather than a manual compare-and-swap.
(In reply to Philip Withnall from comment #1) > Reworking g_closure_ref() and g_closure_unref() to use more specialised > code, rather than CHANGE_FIELD, seems like a good idea. You should be able > to use g_atomic_int_inc() and friends, rather than a manual compare-and-swap. I followed your suggestion and came up with the attached patch. It is a bigger than I expected due to the need to move ref_count out of the bit field to use the atomic* helpers. The move also means now there are wasted bytes in the structure. I wasn't sure if moving the ref_count to the end of the definition would have any adverse effects. The generated code looks a lot tighter. Would something like the below patch be acceptable? (Needs testing on the Arm system where the ref/unref was particularly slow)
Created attachment 367050 [details] [review] Optimised closure ref_count accesses
Review of attachment 367050 [details] [review]: Unfortunately, as a struct in a public header, we can’t change the size of GClosure, as that would be an ABI break. Sorry, I should have noticed that before. Two options I can think of: • Rework the code in CHANGE_FIELD to be more efficient on arm64. • Use a custom refcounting implementation with the existing field. (This might just end up looking like CHANGE_FIELD though.) • Combine the ref_count:15 and meta_marshal_nouse:1 bitfields into a single gint16 ref_count, and use 16-bit atomic intrinsics on that. I don’t know if the arm64 implementation of a 16-bit atomic increment/decrement-and-test is native or a compiler fallback though.
(In reply to Philip Withnall from comment #4) > Review of attachment 367050 [details] [review] [review]: Thanks a lot for taking a look at the patch. > > Unfortunately, as a struct in a public header, we can’t change the size of > GClosure, as that would be an ABI break. Sorry, I should have noticed that > before. Too bad the structure can't be updated. Being able to split out the ref_count into a separate field would've made the code a lot cleaner and also brought GClosure closer to other ref_counting schemes in glib. > > Two options I can think of: > • Rework the code in CHANGE_FIELD to be more efficient on arm64. I had a stab at this but was able to reduce only two instructions (coming from _SET_NEW) outside the critical section between changing the value of the field and atomic CAS. > • Use a custom refcounting implementation with the existing field. (This > might just end up looking like CHANGE_FIELD though.) > • Combine the ref_count:15 and meta_marshal_nouse:1 bitfields into a single > gint16 ref_count, and use 16-bit atomic intrinsics on that. I don’t know if > the arm64 implementation of a 16-bit atomic increment/decrement-and-test is > native or a compiler fallback though. Using 16-bit atomics gave the best improvement in generated code (patch attached). If you think the change can stick it should give an improvement in code regardless of architecture.
Created attachment 367406 [details] [review] Implement g_closure_[ref|unref]() in terms of 16-bit atomics
Thanks for your patience. This is on my review queue (in 10th place out of 31 bugs), but I don’t think I’m going to get time to review it thoroughly before the 2.56 code freeze (2018-03-05). In any case, this is a sufficiently deep-seated change that it should probably be merged early in the 2.58 cycle to get some good testing time before 2.58 (rather than really late in the 2.56 cycle). Is it going to be shipped in Debian as a vendor patch at all? If it does, that would provide us with useful testing data.
Review of attachment 367406 [details] [review]: ::: glib/gatomic.h @@ +182,3 @@ + (void) (0 ? *(atomic) ^ *(atomic) : 0); \ + (void) __sync_fetch_and_sub ((atomic), 1); \ + })) If these are going to be in the gatomic.h public header, they are public API, and should be treated as such, so: • List them in docs/reference/glib/glib-sections.txt • Add tests in glib/tests/atomic.c • Implement the Windows and non-compiler-intrinsic based implementation in gatomic.c (otherwise GClosure will stop working on those platforms) • Split the gatomic changes out into a separate commit from the gclosure.c changes • Change g_atomic_int16_dec() to g_atomic_int16_dec_and_test() for consistency with the existing 32-bit API and increased usefulness ::: gobject/gclosure.c @@ +141,3 @@ +G_STMT_START { \ + ClosureInt16 *cunion = (ClosureInt16*) _closure; \ + g_atomic_int16_inc(&cunion->vint16); \ Nitpick: Coding style requires a space before `(` in function calls and declarations (in several places in this patch). @@ +561,3 @@ g_return_val_if_fail (closure->ref_count < CLOSURE_MAX_REF_COUNT, NULL); + INC_REF_COUNT(closure); Nitpick: Coding style requires a space before `(`. @@ +623,1 @@ + if (closure->ref_count == 0) These two lines should be combined to a single line which calls g_atomic_int16_dec_and_test() to atomically decrement *and* check that the reference count has hit zero, otherwise there’s a race condition here in the (closure->ref_count == 0) check.
Punit, any progress on this?
(In reply to Philip Withnall from comment #9) > Punit, any progress on this? Hi Philip, Apologies for not getting back earlier. Looks like this one slipped through the crack. TBH, I'm not sure about the best way to progress this patch. Outline below. The issue for Debian builds was worked around by reducing the test iteration count - after which there was no longer timeouts during builds. Although the patch improves the generated assembly for g_closure_ref/unref, further testing revealed that it still bottlenecks on the particular micro-architecture on which the issue was reported. It's an improvement but doesn't fix the problem. As such, I am of the opinion to park these change until - either glib atomics are updated, or we have further reports of issues in this area. Thoughts?
(In reply to Punit Agrawal from comment #10) > (In reply to Philip Withnall from comment #9) > > Punit, any progress on this? > > Hi Philip, > > Apologies for not getting back earlier. Looks like this one slipped through > the crack. > > TBH, I'm not sure about the best way to progress this patch. Outline below. > > The issue for Debian builds was worked around by reducing the test iteration > count - after which there was no longer timeouts during builds. Yeah, but that doesn’t solve the underlying problem. > Although the patch improves the generated assembly for g_closure_ref/unref, > further testing revealed that it still bottlenecks on the particular > micro-architecture on which the issue was reported. Are the bottlenecks still on the same instructions, or on other parts of the GClosure implementation now? Are we always going to bottleneck here because a GClosure (presumably) mostly fits into a single arm64 cache line, and hence all atomic operations on a GClosure will conflict for access to the same cache line, or something like that? > It's an improvement but doesn't fix the problem. As such, I am of the > opinion to park these change until - either glib atomics are updated, or we > have further reports of issues in this area. I would prefer to push this through and get the changes in, assuming that fixing the rest of the bottleneck won’t require these changes to be overwritten again. If we can benefit the generated assembly for all architectures with a switch to 16-bit atomics, that would be worthwhile.
(In reply to Philip Withnall from comment #11) > (In reply to Punit Agrawal from comment #10) > > (In reply to Philip Withnall from comment #9) > > > Punit, any progress on this? > > > > Hi Philip, > > > > Apologies for not getting back earlier. Looks like this one slipped through > > the crack. > > > > TBH, I'm not sure about the best way to progress this patch. Outline below. > > > > The issue for Debian builds was worked around by reducing the test iteration > > count - after which there was no longer timeouts during builds. > > Yeah, but that doesn’t solve the underlying problem. Agreed - it is a bodge to get the build through. Doesn't really change behaviour. To refresh my memory on the behaviour with the patch, I had another go at testing the patch on the machine where the problem was detected and found that the closures test ran in ~20min (the two runs I did). Previously this was taking >150 minutes without the patch. I can only surmise that I'd not run the test for long enough to see the reduced runtime but only seen the profile and extrapolated no change in behaviour. More below. > > > Although the patch improves the generated assembly for g_closure_ref/unref, > > further testing revealed that it still bottlenecks on the particular > > micro-architecture on which the issue was reported. > > Are the bottlenecks still on the same instructions, or on other parts of the > GClosure implementation now? Are we always going to bottleneck here because > a GClosure (presumably) mostly fits into a single arm64 cache line, and > hence all atomic operations on a GClosure will conflict for access to the > same cache line, or something like that? My theory is that due to the way the closure refcounting was implemented each refcount change takes a number of iteration around the loop to perform the update atomically. The patch makes the atomic update a lot tighter (compare with the sequence from comment 0). First column is percentage samples. ldxrh and stlxrh are half-word exclusive load and store instructions. 46.23 │ 30: ldxrh w0, [x19] │ add w0, w0, #0x1 0.20│ stlxrh w1, w0, [x19] │ ↑ cbnz w1, 30 48.64│ dmb ish │ g_return_val_if_fail (closure->ref_count > 1, NULL); 1.66│ ldr w0, [x19] │ and w0, w0, #0x7fff 0.03│ cmp w0, #0x1 │ ↓ b.ls a0 The profile shows that the exclusive load is still the hottest instruction but due to the reducing of the length of the critical section, the atomic increment/decrement is successful more times than not. Atleast that's my theory. :) > > > It's an improvement but doesn't fix the problem. As such, I am of the > > opinion to park these change until - either glib atomics are updated, or we > > have further reports of issues in this area. > > I would prefer to push this through and get the changes in, assuming that > fixing the rest of the bottleneck won’t require these changes to be > overwritten again. If we can benefit the generated assembly for all > architectures with a switch to 16-bit atomics, that would be worthwhile. Agreed - I've started addressing your review comments in preparation for uploading the patches.
One query about your review comment. (In reply to Philip Withnall from comment #8) > Review of attachment 367406 [details] [review] [review]: > [...] > @@ +561,3 @@ > g_return_val_if_fail (closure->ref_count < CLOSURE_MAX_REF_COUNT, NULL); > > + INC_REF_COUNT(closure); > > Nitpick: Coding style requires a space before `(`. > > @@ +623,1 @@ > + if (closure->ref_count == 0) > > These two lines should be combined to a single line which calls > g_atomic_int16_dec_and_test() to atomically decrement *and* check that the > reference count has hit zero, otherwise there’s a race condition here in the > (closure->ref_count == 0) check. Can you elaborate what the race is? The decrementing update and the comparison will be two separate instructions and the thread can still get switched out after the decrement but before the comparison.
(In reply to Punit Agrawal from comment #13) > > @@ +623,1 @@ > > + if (closure->ref_count == 0) > > > > These two lines should be combined to a single line which calls > > g_atomic_int16_dec_and_test() to atomically decrement *and* check that the > > reference count has hit zero, otherwise there’s a race condition here in the > > (closure->ref_count == 0) check. > > Can you elaborate what the race is? The decrementing update and the > comparison will be two separate instructions and the thread can still get > switched out after the decrement but before the comparison. If the decrement and the comparison are not atomic wrt each other, it’s possible for one thread to unref() and reduce the refcount to 0, but get descheduled before the comparison. Another thread could then be scheduled and resurrect the object from the dead with a ref() (which should not be allowed). The first thread may then continue to finalise everything, depending on whether it re-fetches the refcount for its comparison, or whether it uses a register copy. This is precisely the use case that atomic decrement-and-test was designed to handle.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1316.