After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 756241 - Race condition in parallel DOT runs
Race condition in parallel DOT runs
Status: RESOLVED FIXED
Product: doxygen
Classification: Other
Component: general
1.8.9.1
Other Linux
: Normal normal
: ---
Assigned To: Dimitri van Heesch
Dimitri van Heesch
Depends on:
Blocks:
 
 
Reported: 2015-10-08 12:33 UTC by Bernard B
Modified: 2015-12-30 10:19 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Hacky fix for the crash (1.27 KB, patch)
2015-10-08 12:33 UTC, Bernard B
none Details | Review

Description Bernard B 2015-10-08 12:33:12 UTC
Created attachment 312910 [details] [review]
Hacky fix for the crash

I'm running doxygen on a 24-core machine, and doxygen spawns that many threads to run dot. There is a concurrency bug in the reference counting of the QCString class which is creating a sporadic use-after-free error which exhibits in different ways.

The problem goes away when setting DOT_NUM_THREADS = 1

In my dot config file, I have DOT_PATH = /tools/bin

Sometimes, doxygen will segfault, with a backtrace like:

  • #0 memcpy
    from /lib64/libc.so.6
  • #1 QCString::LSData::resize
    at ../qtools/qcstring.h line 417
  • #2 QCString::StringRep::resize
    at ../qtools/qcstring.h line 673
  • #3 QCString::fill
    at ../qtools/qcstring.h line 245
  • #4 QCString::operator+=
    at ../qtools/qcstring.h line 312
  • #5 operator+
    at ../qtools/qcstring.h line 780
  • #6 DotRunner::run
    at dot.cpp line 813
  • #7 DotWorkerThread::run
    at dot.cpp line 1193
  • #8 QThreadPrivate::start
    at qthread_unix.cpp line 87
  • #9 start_thread
    from /lib64/libpthread.so.0
  • #10 clone
    from /lib64/libc.so.6

Sometimes, all instances of dot execution will fail with:

sh: /tools/bin/: is a directory
Problems running dot: exit code=126, command='/tools/bin/', arguments='...'

(notice the lack of "dot" being appended to the command above).

Sometimes glibc throws a malloc error.

Sometimes it all runs just fine.

Here is what I think the root cause is: The LSHeader::refCount variable in qcstring.h is not modified atomically. This affects the reference counting on the string returned by Config_getString("DOT_PATH") in DotRunner::run(). Because the threads all start at roughly the same time on different CPUs and the reference count is not incremented atomically, the value of refCount ends up being less than it should be. This leads to the string being prematurely freed by LSData::dispose(), which results in all of the various crashes above. It's a small race window but when you've got 24 cores contending for the same cache line the chances of hitting it increase significantly.

If I replace the increments and decrements of refCount with appropriate calls to the gcc builtin __sync_add_and_fetch() to make these accesses atomic (see terribly hacky attached patch), the bug seems to be resolved. Previously doxygen would crash 2/10 times, and with the attached patch it crashed 0/30 times.

Sorry I can't share a test case to reproduce it, but hopefully this information is sufficient to resolve the issue. The bug is quite sensitive to the input data to doxygen and happens much more reliably with certain sets of input sources files, but then disappears after minor changes.

I could readily reproduce the bug with doxygen 1.8.9.1, and this is the version I tested the fix on. However, on 1.8.10 and also on git master ffff6954, I couldn't reproduce it. But I believe that's more likely because other changes have affected the timing around dot generation, and the bug is in fact still there.
Comment 1 Dimitri van Heesch 2015-10-17 15:27:58 UTC
Hi Bernard,

Thanks for the clear description and analysis. The QCString class has recently been reimplemented using reference counting, and this doesn't work well with multi-threading as you observed.  

I've pushed the following fix:
https://github.com/doxygen/doxygen/commit/f196c9f1d69238a814ff3152103f3bd310efdf0d

This fix introduces a simple constant string class that is used during DotRunner::run().

Please check if this solves the problem.
Comment 2 Dimitri van Heesch 2015-12-30 10:19:47 UTC
This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.8.11. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information 
that you think can be relevant (preferably in the form of a self-contained example).