GNOME Bugzilla – Bug 756241
Race condition in parallel DOT runs
Last modified: 2015-12-30 10:19:47 UTC
Created attachment 312910 [details] [review] Hacky fix for the crash I'm running doxygen on a 24-core machine, and doxygen spawns that many threads to run dot. There is a concurrency bug in the reference counting of the QCString class which is creating a sporadic use-after-free error which exhibits in different ways. The problem goes away when setting DOT_NUM_THREADS = 1 In my dot config file, I have DOT_PATH = /tools/bin Sometimes, doxygen will segfault, with a backtrace like:
+ Trace 235553
Sometimes, all instances of dot execution will fail with: sh: /tools/bin/: is a directory Problems running dot: exit code=126, command='/tools/bin/', arguments='...' (notice the lack of "dot" being appended to the command above). Sometimes glibc throws a malloc error. Sometimes it all runs just fine. Here is what I think the root cause is: The LSHeader::refCount variable in qcstring.h is not modified atomically. This affects the reference counting on the string returned by Config_getString("DOT_PATH") in DotRunner::run(). Because the threads all start at roughly the same time on different CPUs and the reference count is not incremented atomically, the value of refCount ends up being less than it should be. This leads to the string being prematurely freed by LSData::dispose(), which results in all of the various crashes above. It's a small race window but when you've got 24 cores contending for the same cache line the chances of hitting it increase significantly. If I replace the increments and decrements of refCount with appropriate calls to the gcc builtin __sync_add_and_fetch() to make these accesses atomic (see terribly hacky attached patch), the bug seems to be resolved. Previously doxygen would crash 2/10 times, and with the attached patch it crashed 0/30 times. Sorry I can't share a test case to reproduce it, but hopefully this information is sufficient to resolve the issue. The bug is quite sensitive to the input data to doxygen and happens much more reliably with certain sets of input sources files, but then disappears after minor changes. I could readily reproduce the bug with doxygen 1.8.9.1, and this is the version I tested the fix on. However, on 1.8.10 and also on git master ffff6954, I couldn't reproduce it. But I believe that's more likely because other changes have affected the timing around dot generation, and the bug is in fact still there.
Hi Bernard, Thanks for the clear description and analysis. The QCString class has recently been reimplemented using reference counting, and this doesn't work well with multi-threading as you observed. I've pushed the following fix: https://github.com/doxygen/doxygen/commit/f196c9f1d69238a814ff3152103f3bd310efdf0d This fix introduces a simple constant string class that is used during DotRunner::run(). Please check if this solves the problem.
This bug was previously marked ASSIGNED, which means it should be fixed in doxygen version 1.8.11. Please verify if this is indeed the case. Reopen the bug if you think it is not fixed and please include any additional information that you think can be relevant (preferably in the form of a self-contained example).