GNOME Bugzilla – Bug 731752
Write after free cross thread race in PipeCapture::_OnReadable()
Last modified: 2014-07-15 16:48:33 UTC
Created attachment 278563 [details] [review] Write after free race debug patch 1 Hardware: Dual core VirtualBox guest OS: Fedora 20 GParted current master tip: GPARTED_0_19_0-2-gbfaa53e (With some print debugging applied) Compile with: ./autogen.sh CXXFLAGS='-g -O0' && make clean && cd src && make Run under valgrind: su root valgrind --track-origins=yes ./gpartedbin 2>&1 | tee out Test: Just load GParted and immediately close This appears to be a race between two threads: main(), set_devices_thread(). _OnReadable() callback is sometimes being called in the main() thread after the object destructor has run in the set_devices_thread() thread. Seems to happen every time under valgrind and quite often when running directly. Fragment of output under valgrind: (Full output attached as: write-after-free-race-valgrind-1.txt) --8<-- D: tid=20464 main() ... D: tid=20474 GParted_Core::set_devices_thread() ... D: tid=20474 execute_command(command="dumpe2fs -h /dev/sda1", output, error, use_C_locale=1) D: tid=20474 this=0x138054a0 PipeCapture::PipeCapture() D: tid=20474 this=0x138054f0 PipeCapture::PipeCapture() D: tid=20474 this=0x138054a0 PipeCapture::connect_signal() D: tid=20474 this=0x138054f0 PipeCapture::connect_signal() D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() D: tid=20464 data=0x138054f0 PipeCapture::_OnReadable() D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() D: tid=20464 data=0x138054f0 PipeCapture::_OnReadable() D: tid=20464 data=0x138054f0 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x13805518 D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() D: tid=20474 this=0x138054f0 PipeCapture::~PipeCapture() D: tid=20474 this=0x138054a0 PipeCapture::~PipeCapture() D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x138054c8 ==20464== Thread 1: ==20464== Invalid write of size 4 ==20464== at 0x49120E: GParted::PipeCapture::_OnReadable(_GIOChannel*, GIOCondition, void*) (PipeCapture.cc:56) ==20464== by 0x38662492A5: g_main_context_dispatch (gmain.c:3066) ==20464== by 0x3866249627: g_main_context_iterate.isra.24 (gmain.c:3713) ==20464== by 0x3866249A39: g_main_loop_run (gmain.c:3907) ==20464== by 0x3D33D45C26: gtk_main (gtkmain.c:1257) ==20464== by 0x46A122: GParted::GParted_Core::set_devices(std::vector<GParted::Device, std::allocator<GParted::Device> >&) (GParted_Core.cc:156) ==20464== by 0x4A87EB: GParted::Win_GParted::menu_gparted_refresh_devices() (Win_GParted.cc:1259) ==20464== by 0x4A8780: GParted::Win_GParted::on_show() (Win_GParted.cc:1253) ==20464== by 0x3D35B2009C: Gtk::Widget_Class::show_callback(_GtkWidget*) (widget.cc:3855) ==20464== by 0x3867210297: g_closure_invoke (gclosure.c:777) ==20464== by 0x3867221B86: signal_emit_unlocked_R (gsignal.c:3516) ==20464== by 0x386722A0F1: g_signal_emit_valist (gsignal.c:3330) ==20464== Address 0x138054c8 is not stack'd, malloc'd or (recently) free'd ==20464== Fragment of output run direct: (Full output attached as: write-after-free-race-direct-1.txt) --8<-- D: tid=20626 main() ... D: tid=20635 GParted_Core::set_devices_thread() ... D: tid=20635 execute_command(command="udevadm settle --timeout=1", output, error, use_C_locale=0) D: tid=20635 this=0x7f319f48f540 PipeCapture::PipeCapture() D: tid=20635 this=0x7f319f48f590 PipeCapture::PipeCapture() D: tid=20635 this=0x7f319f48f540 PipeCapture::connect_signal() D: tid=20635 this=0x7f319f48f590 PipeCapture::connect_signal() D: tid=20626 data=0x7f319f48f540 PipeCapture::_OnReadable() D: tid=20626 data=0x7f319f48f540 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f568 D: tid=20626 data=0x7f319f48f590 PipeCapture::_OnReadable() D: tid=20635 this=0x7f319f48f590 PipeCapture::~PipeCapture() D: tid=20635 this=0x7f319f48f540 PipeCapture::~PipeCapture() D: tid=20626 data=0x7f319f48f590 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f5b8 ... D: tid=20635 execute_command(command="btrfs filesystem show /dev/sdb3", output, error, use_C_locale=1) D: tid=20635 this=0x7f319f48f230 PipeCapture::PipeCapture() D: tid=20635 this=0x7f319f48f280 PipeCapture::PipeCapture() D: tid=20635 this=0x7f319f48f230 PipeCapture::connect_signal() D: tid=20635 this=0x7f319f48f280 PipeCapture::connect_signal() D: tid=20626 data=0x7f319f48f230 PipeCapture::_OnReadable() D: tid=20626 data=0x7f319f48f230 PipeCapture::_OnReadable() D: tid=20626 data=0x7f319f48f230 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f258 D: tid=20626 data=0x7f319f48f280 PipeCapture::_OnReadable() D: tid=20635 this=0x7f319f48f280 PipeCapture::~PipeCapture() D: tid=20635 this=0x7f319f48f230 PipeCapture::~PipeCapture() D: tid=20635 execute_command(command="btrfs filesystem show /dev/sdb1", output, error, use_C_locale=1) D: tid=20626 data=0x7f319f48f280 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f2a8
Created attachment 278564 [details] Debug output 1 - run under valgrind
Created attachment 278565 [details] Debug output 2 - run direct
Hi Mike, Thank you for reporting this problem with a high level of detail. Using the attachment in comment #0 I am able to reproduce the behaviour using the gparted 0.19.0 code in a single CPU VMWare Player ubuntu 14.04 virtual machine (wow, that was a mouthful ;-). I have been reading up on pipes and signal handling in an effort to understand what is occurring. Use reentrant functions for safer signal handling -- How and when to employ reentrancy to keep your code bug free http://www.ibm.com/developerworks/linux/library/l-reent/index.html The problem with PipeCapture being destroyed before all the signals are handled might be similar to a problem we had earlier where PipeCapture was created and destroyed before we could connect the signal handlers. To address the earlier problem in Utils::execute_command() we had to lock a mutex so we had time to setup PipeCapture for output and error streams before connecting the input/output signal handler. We might need to implement something similar to prevent the PipeCapture destruction before all signals have been handled and all data read. Phillip, Your thoughts on this issue would be appreciated. Curtis
Without fully understanding how I might implement a mutex to ensure that all data is read and all signals are handled before closing the pipes, I tried adding a simple sleep(1) before closing the pipes. For example: diff --git a/src/Utils.cc b/src/Utils.cc index 7ae1450..2bbad68 100644 --- a/src/Utils.cc +++ b/src/Utils.cc @@ -547,6 +547,8 @@ int Utils::execute_command( const Glib::ustring & command, status.cond.wait( status.mutex ); status.mutex.unlock(); } + //Simplistic attempt to ensure all signals handled before closing + sleep( 1 ); close( out ); close( err ); return status.exit_status; In my testing, this non-elegant solution does appear to permit enough time for the data to be read and signals handled before the pipes are closed. I realize that this is not an ideal solution, and is instead a work-around that might not work in extreme situations. A better solution would somehow guarantee that the data was read and signals handled before the pipes are closed.
Hi Curtis, I think we need to add the code I previously got Phillip to exclude, back in to the end of PipeCapture::OnReadable(): bug 697727 comment #53 This will remove the _OnReadable() callback before the signal_eof signal is generated and should prevent the glib event loop detecting further change on the IO Channel and queuing another callback. There is still the issue of at least PipeCapture member variable sourceid being read and written by both threads and it effectively being used as control between those threads on whether to call g_source_remove() from the destructor or not. For safe multi-threaded access to a single variable 2 things need to be true: 1) Compiler must generate memory access, no register caching of the value within a function. (In C this is what volatile does); 2) CPU must not reorder memory loads and stores (memory barriers). Don't know how to do either of these in C++ class members / Glib. It might be that with changing the end of OnReadable() to this: // signal completion if( sourceid > 0 ) g_source_remove( sourceid ); sourceid = 0; eof(); return false; or this: // signal completion g_source_remove( sourceid ); sourceid = 0; eof(); return false; an empty PipeCapture destructor will work correctly, but I don't really know the guarantees of the glib IO Channel API. I will be a over a week before I have the time to research and test this properly. Thanks Mike
(In reply to comment #3) > Use reentrant functions for safer signal handling > -- How and when to employ reentrancy to keep your code bug free > http://www.ibm.com/developerworks/linux/library/l-reent/index.html In the terminology of this IBM developer works document, reading and writing PipeCapture member variable sourceid by multiple threads would be considered global variable access.
Created attachment 278960 [details] [review] Onreadable use after free fix (Draft 1) Hi Curtis, Managed to find some time to put together this draft patch. Seems to work on Fedora 20. Needs a better commit message and thoroughly testing. Thanks, Mike
More work required. On CentOS 5.10 GParted sometimes reports these messages: (gpartedbin:10252): GLib-CRITICAL **: g_source_remove: assertion `tag > 0' failed When the call flow is like this: set_devices_thread main thread connect_signal() call g_io_add_watch() Command executed, EOF occurs on the IO Channel OnReadable() g_source_remove(0) above assertion g_io_add_watch() returns and assigns sourceid
Created attachment 279960 [details] [review] Write after free debug patch (v2) Here a debug 2 patch which reports more a bit more details about what is going on. It's what I used when documenting exactly what is happening for following patch v1 ...
Created attachment 279962 [details] [review] Onreadable use after free fix (v1) Hi Curtis, Here's a patch to fix this bug. Turns out that we don't need to use any Glib Mutex's and Cond's. Instead just use the standard capability of Glib. Quoting from the patch commit message: This is happening because PipeCapture member variable sourceid is being saved, in a different thread, just so the _OnReadable() callback can be removed. However a glib IOChannel callback, type GIOFunc(), returning false will be automatically removed. GLib Reference Manual 2.26 / IO Channels https://developer.gnome.org/glib/2.26/glib-IO-Channels.html#GIOFunc GIOFunc() Returns : the function should return FALSE if the event source should be removed Therefore fix by just not saving the event sourceid at all, and not calling g_source_remove() to manually remove the callback, but instead letting glib automatically remove the callback when it returns false. Please excuse the enormous commit message. It seemed necessary. Tested successfully on Fedora 20 and CentOS 5.10. Thanks, Mike
Hi Mike, Thank you for persevering on with this problem. Crashes are nasty business that we with to eliminate. I'm busy this weekend but plan to extensively test this early next week on a variety of distros. Curtis
Hi Curtis, First to clarify previous testing mentioned in comment #10: 1) Run GParted under valgrind. Load and immediately exit Expect: No invalid write of size 4 in PipeCapture::_OnReadable() Passed successfully on: Fedora 20 CentOS 5.10 CentOS 6.5 Second to perform some extra testing so hopefully you don't need to do quite so much :-) 1) Load GParted and exit Expect: No Gtk or Glib errors on the console 2) Perform a few operations, such as: create new partition, label and check Expect: External commands executed successfully, output viewable 3) Copy partition into empty space (repeating testing for bug #697727 "Segfault in Gparted 0.15.0 when copying partition") Expect: Successfully partition copy Passed successfully on: CentOS 5.10 CentOS 6.5 Debian 6 Fedora 14 Fedora 19 Fedora 20 Kubuntu 12.04 LTS Ubuntu 13.10 Xubuntu 14.04 LTS [NOTE:1] [NOTE:1] When expanding tree in operation results the following messages are generated: (gpartedbin:6475): GLib-CRITICAL **: Source ID 3337 was not found when attempting to remove it Already documented as bug #731740 "Expand 2nd level Progress Details yields GLib-CRITICAL Source ID message". Thanks, Mike
Hi Mike, Great work on the patch in comment #10. And a big thank you for the extra testing on a variety of distros -- that saved me lots of time enabling me to finish testing during the weekend. :-) Regarding the detailed commit message, I think it is warranted especially with the trouble we've had with elusive random crashes with asynchronous thread handling. For my testing I performed the four test steps listed in comment #12 and some additional random operations such as resize and move for good measure. All of these tests passed successfully on: Debian 7 fedora 12 <--- the oldest GNU/Linux VM I have :-) openSUSE 12.3 openSUSE 13.1 ubuntu 10.04 LTS kubuntu 12.04 LTS <--- My physical development computer ubuntu 14.04 LTS [Same as Note 1 in comment #12] As such, the patch in comment #10 has been committed to the git repository for inclusion in the next release of GParted. The relevant git commit can be viewed at the following link: Prevent cross thread write after free in _OnReadable() (#731752) https://git.gnome.org/browse/gparted/commit/?id=0fcfd18061a45f208415971f190818a41cc59f47 Since this patch fixes a potential crash that might result in a loss of data, if there are no immediate objections then I will plan a 0.19.1 release soon. Curtis
Hi Curtis, I think we should re-test patch 278205 "Multi-device btrfs support (v2)" from bug 723842 "GParted resizes the wrong filesystem", just to see if there are any more crashed which we can identify in existing code before releasing 0.19.1. Mike
Good suggestion Mike. I will begin testing on Debian/ubuntu with the patch for bug 723842 "GParted resizes the wrong filesystem". This will help us build confidence in this patch set, or else show where more improvement is needed. I will hold off on releasing 0.19.1 while testing bug 723842. Curtis
Hi Curtis, I've also re-tested patchset v2 on Fedora 20 doing several resizes of multi-device btrfs successfully so from the point of view of crashes bug 731752 - "Write after free cross thread race in PipeCapture::OnReadable()" has done the job. Q: Include Multi-device btrfs support patch in GParted 0.19.1? I think not. We make 0.x.1 releases with minimal changes on top of 0.x.0 releases just to address crashes and other serious issues. While resizing the wrong device in a multi-device btrfs is bad I think it is a low volume use case and this patchset adds lots of other enhancements which for me makes it 0.y.0 release material. I think we should release 0.19.1 with what is already committed to master and no more. Thanks, Mike
Above comment #16, was meant for bug 723842 - "GParted resizes the wrong filesystem" instead.
It sounds like we are both in agreement that the patch in comment #10 is stable and does fix the problem outlined in this report. If there are no objections, I will start the 0.19.1 release process soon. Curtis
I hadn't seen this thread until now because I'm not on the CC list ( I guess I need to get bugzilla to cc me on all gparted bugs ). After seeing the new point release announcement I updated my git repo and read the change log and at first was thinking that this might fix some of the crash reports I've seen on the ubuntu bug tracker lately, but looking at the changes, I think they are incorrect. The reason that I saved the sourceid and had the dtor kill the source was to handle the case of the PipeCapture object being destroyed prior to reading eof from the pipe. By removing that code, and instead relying on the return from OnReadable to kill the source, gparted will now not be able to handle that case. I believe that case can arise during a cancel event but will need to study the code some more to be sure.
Hi Phillip, Can you explain a case where you think that the PipeCapture will be destroyed before the callback returns false to get the callback unregistered? I have just quickly tested the following cases which work correctly: 1) Kill external command while Applying pending operations. -> Executable receives SIGTERM and exits PipeCapture callback calls signal_eof.emit() and returns false PipeCapture destructor called 2) Cancel external command from Applying pending operations dialog. -> Executable receives SIGINT and exits PipeCapture callback call signal_eof.emit() and returns false PipeCapture destructor called Thanks, Mike
Now that I think about it, I suppose that I did have the cancel path wait for eof so I suppose there currently isn't a way to have the PipeCapture be destroyed before eof, but I'm a little worried that is likely to change at some time in the future and then this would bite us in the ass. I'll have to think on it some more.
Hi Phillip and Mike, In order to check if we re-introduced problems with moves and cancels I have run a series of tests similar to those we used in troubleshooting Bug 697727 - Segfault in Gparted 0.15.0 when copying partition. Specifically I created an 10238 MiB ext4 partition on my kubuntu 12.04 physical development machine. I then moved the partition left or right by a small overlapping amount. For the Cancel half-way I moved left to be sure that the start of the partition was moved first and hence would have to be rolled back. Test results are as follows: Success - Five move only operations (operation takes 11 minutes) Success - Five move then Cancel about half way through move (~11 minutes) Success - Five move then cancel with Force Cancel then Cancel Operations By "Success" I mean that the operation finished or cancelled as expected. Based on these tests I believe that the problem we encountered with GParted 0.15.0, improved with 0.16.0, and then implemented the fix in a second location in 0.16.1 is still addressed. Curtis
The patch to address this bug report was included in GParted 0.19.1 released on July 15, 2014.