Bug 731752 – Write after free cross thread race in PipeCapture::_OnReadable()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 731752 - Write after free cross thread race in PipeCapture::_OnReadable()


Summary:	Write after free cross thread race in PipeCapture::_OnReadable()


Status:	RESOLVED FIXED

Product:	gparted
Classification:	Other
Component:	application
Version:	GIT HEAD
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gparted maintainers alias
QA Contact:	gparted maintainers alias

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-06-17 00:19 UTC by Mike Fleetwood
Modified:	2014-07-15 16:48 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Write after free race debug patch 1 (5.39 KB, patch) 2014-06-17 00:19 UTC, Mike Fleetwood	none	Details \| Review
Debug output 1 - run under valgrind (27.62 KB, text/plain) 2014-06-17 00:22 UTC, Mike Fleetwood		Details
Debug output 2 - run direct (13.34 KB, text/plain) 2014-06-17 00:23 UTC, Mike Fleetwood		Details
Onreadable use after free fix (Draft 1) (2.71 KB, patch) 2014-06-22 22:55 UTC, Mike Fleetwood	none	Details \| Review
Write after free debug patch (v2) (8.21 KB, patch) 2014-07-05 19:29 UTC, Mike Fleetwood	none	Details \| Review
Onreadable use after free fix (v1) (8.52 KB, patch) 2014-07-05 19:34 UTC, Mike Fleetwood	none	Details \| Review

Description Mike Fleetwood 2014-06-17 00:19:43 UTC

Created attachment 278563 [details] [review]
Write after free race debug patch 1

Hardware:
    Dual core VirtualBox guest
OS:
    Fedora 20
GParted current master tip:
    GPARTED_0_19_0-2-gbfaa53e
    (With some print debugging applied)
Compile with:
    ./autogen.sh CXXFLAGS='-g -O0' && make clean && cd src && make
Run under valgrind:
    su root
    valgrind --track-origins=yes ./gpartedbin 2>&1 | tee out
Test:
    Just load GParted and immediately close


This appears to be a race between two threads: main(),
set_devices_thread().  _OnReadable() callback is sometimes being called
in the main() thread after the object destructor has run in the
set_devices_thread() thread.  Seems to happen every time under valgrind
and quite often when running directly.


Fragment of output under valgrind:
(Full output attached as: write-after-free-race-valgrind-1.txt)
--8<--
D: tid=20464 main()
...
D: tid=20474 GParted_Core::set_devices_thread()
...
D: tid=20474 execute_command(command="dumpe2fs -h /dev/sda1", output, error, use_C_locale=1)
D: tid=20474 this=0x138054a0 PipeCapture::PipeCapture()
D: tid=20474 this=0x138054f0 PipeCapture::PipeCapture()
D: tid=20474 this=0x138054a0 PipeCapture::connect_signal()
D: tid=20474 this=0x138054f0 PipeCapture::connect_signal()
D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable()
D: tid=20464 data=0x138054f0 PipeCapture::_OnReadable()
D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable()
D: tid=20464 data=0x138054f0 PipeCapture::_OnReadable()
D: tid=20464 data=0x138054f0 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x13805518
D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable()
D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable()
D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable()
D: tid=20474 this=0x138054f0 PipeCapture::~PipeCapture()
D: tid=20474 this=0x138054a0 PipeCapture::~PipeCapture()
D: tid=20464 data=0x138054a0 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x138054c8
==20464== Thread 1:
==20464== Invalid write of size 4
==20464==    at 0x49120E: GParted::PipeCapture::_OnReadable(_GIOChannel*, GIOCondition, void*) (PipeCapture.cc:56)
==20464==    by 0x38662492A5: g_main_context_dispatch (gmain.c:3066)
==20464==    by 0x3866249627: g_main_context_iterate.isra.24 (gmain.c:3713)
==20464==    by 0x3866249A39: g_main_loop_run (gmain.c:3907)
==20464==    by 0x3D33D45C26: gtk_main (gtkmain.c:1257)
==20464==    by 0x46A122: GParted::GParted_Core::set_devices(std::vector<GParted::Device, std::allocator<GParted::Device> >&) (GParted_Core.cc:156)
==20464==    by 0x4A87EB: GParted::Win_GParted::menu_gparted_refresh_devices() (Win_GParted.cc:1259)
==20464==    by 0x4A8780: GParted::Win_GParted::on_show() (Win_GParted.cc:1253)
==20464==    by 0x3D35B2009C: Gtk::Widget_Class::show_callback(_GtkWidget*) (widget.cc:3855)
==20464==    by 0x3867210297: g_closure_invoke (gclosure.c:777)
==20464==    by 0x3867221B86: signal_emit_unlocked_R (gsignal.c:3516)
==20464==    by 0x386722A0F1: g_signal_emit_valist (gsignal.c:3330)
==20464==  Address 0x138054c8 is not stack'd, malloc'd or (recently) free'd
==20464== 


Fragment of output run direct:
(Full output attached as: write-after-free-race-direct-1.txt)
--8<--
D: tid=20626 main()
...
D: tid=20635 GParted_Core::set_devices_thread()
...
D: tid=20635 execute_command(command="udevadm settle --timeout=1", output, error, use_C_locale=0)
D: tid=20635 this=0x7f319f48f540 PipeCapture::PipeCapture()
D: tid=20635 this=0x7f319f48f590 PipeCapture::PipeCapture()
D: tid=20635 this=0x7f319f48f540 PipeCapture::connect_signal()
D: tid=20635 this=0x7f319f48f590 PipeCapture::connect_signal()
D: tid=20626 data=0x7f319f48f540 PipeCapture::_OnReadable()
D: tid=20626 data=0x7f319f48f540 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f568
D: tid=20626 data=0x7f319f48f590 PipeCapture::_OnReadable()
D: tid=20635 this=0x7f319f48f590 PipeCapture::~PipeCapture()
D: tid=20635 this=0x7f319f48f540 PipeCapture::~PipeCapture()
D: tid=20626 data=0x7f319f48f590 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f5b8
...
D: tid=20635 execute_command(command="btrfs filesystem show /dev/sdb3", output, error, use_C_locale=1)
D: tid=20635 this=0x7f319f48f230 PipeCapture::PipeCapture()
D: tid=20635 this=0x7f319f48f280 PipeCapture::PipeCapture()
D: tid=20635 this=0x7f319f48f230 PipeCapture::connect_signal()
D: tid=20635 this=0x7f319f48f280 PipeCapture::connect_signal()
D: tid=20626 data=0x7f319f48f230 PipeCapture::_OnReadable()
D: tid=20626 data=0x7f319f48f230 PipeCapture::_OnReadable()
D: tid=20626 data=0x7f319f48f230 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f258
D: tid=20626 data=0x7f319f48f280 PipeCapture::_OnReadable()
D: tid=20635 this=0x7f319f48f280 PipeCapture::~PipeCapture()
D: tid=20635 this=0x7f319f48f230 PipeCapture::~PipeCapture()
D: tid=20635 execute_command(command="btrfs filesystem show /dev/sdb1", output, error, use_C_locale=1)
D: tid=20626 data=0x7f319f48f280 PipeCapture::_OnReadable() if (!rc) &(pc->sourceid)=0x7f319f48f2a8

Comment 1 Mike Fleetwood 2014-06-17 00:22:33 UTC

Created attachment 278564 [details]
Debug output 1 - run under valgrind

Comment 2 Mike Fleetwood 2014-06-17 00:23:20 UTC

Created attachment 278565 [details]
Debug output 2 - run direct

Comment 3 Curtis Gedak 2014-06-18 19:53:40 UTC

Hi Mike,

Thank you for reporting this problem with a high level of detail.
Using the attachment in comment #0 I am able to reproduce the
behaviour using the gparted 0.19.0 code in a single CPU VMWare Player
ubuntu 14.04 virtual machine (wow, that was a mouthful ;-).

I have been reading up on pipes and signal handling in an effort to
understand what is occurring.

   Use reentrant functions for safer signal handling
     -- How and when to employ reentrancy to keep your code bug free
   http://www.ibm.com/developerworks/linux/library/l-reent/index.html

The problem with PipeCapture being destroyed before all the signals
are handled might be similar to a problem we had earlier where
PipeCapture was created and destroyed before we could connect the
signal handlers.

To address the earlier problem in Utils::execute_command() we had to lock
a mutex so we had time to setup PipeCapture for output and error streams
before connecting the input/output signal handler.

We might need to implement something similar to prevent the
PipeCapture destruction before all signals have been handled and
all data read.


Phillip,

Your thoughts on this issue would be appreciated.

Curtis

Comment 4 Curtis Gedak 2014-06-19 01:03:37 UTC

Without fully understanding how I might implement a mutex to ensure that all data is read and all signals are handled before closing the pipes, I tried adding a simple sleep(1) before closing the pipes.

For example:

diff --git a/src/Utils.cc b/src/Utils.cc
index 7ae1450..2bbad68 100644
--- a/src/Utils.cc
+++ b/src/Utils.cc
@@ -547,6 +547,8 @@ int Utils::execute_command( const Glib::ustring & command,
                status.cond.wait( status.mutex );
                status.mutex.unlock();
        }
+       //Simplistic attempt to ensure all signals handled before closing
+       sleep( 1 );
        close( out );
        close( err );
        return status.exit_status;


In my testing, this non-elegant solution does appear to permit enough time for the data to be read and signals handled before the pipes are closed.

I realize that this is not an ideal solution, and is instead a work-around that might not work in extreme situations.  A better solution would somehow guarantee that the data was read and signals handled before the pipes are closed.

Comment 5 Mike Fleetwood 2014-06-19 20:06:05 UTC

Hi Curtis,

I think we need to add the code I previously got Phillip to exclude,
back in to the end of PipeCapture::OnReadable():
    bug 697727 comment #53

This will remove the _OnReadable() callback before the signal_eof signal
is generated and should prevent the glib event loop detecting further
change on the IO Channel and queuing another callback.

There is still the issue of at least PipeCapture member variable
sourceid being read and written by both threads and it effectively being
used as control between those threads on whether to call
g_source_remove() from the destructor or not.

For safe multi-threaded access to a single variable 2 things need to be
true:
1) Compiler must generate memory access, no register caching of the
   value within a function.  (In C this is what volatile does);
2) CPU must not reorder memory loads and stores (memory barriers).
Don't know how to do either of these in C++ class members / Glib.

It might be that with changing the end of OnReadable() to this:
       // signal completion
       if( sourceid > 0 )
               g_source_remove( sourceid );
       sourceid = 0;
       eof();
       return false;
or this:
       // signal completion
       g_source_remove( sourceid );
       sourceid = 0;
       eof();
       return false;
an empty PipeCapture destructor will work correctly, but I don't really
know the guarantees of the glib IO Channel API.


I will be a over a week before I have the time to research and test this
properly.

Thanks
Mike

Comment 6 Mike Fleetwood 2014-06-19 20:26:02 UTC

(In reply to comment #3)
>    Use reentrant functions for safer signal handling
>      -- How and when to employ reentrancy to keep your code bug free
>    http://www.ibm.com/developerworks/linux/library/l-reent/index.html

In the terminology of this IBM developer works document, reading and
writing PipeCapture member variable sourceid by multiple threads would
be considered global variable access.

Comment 7 Mike Fleetwood 2014-06-22 22:55:43 UTC

Created attachment 278960 [details] [review]
Onreadable use after free fix (Draft 1)

Hi Curtis,

Managed to find some time to put together this draft patch.  Seems to
work on Fedora 20.  Needs a better commit message and thoroughly
testing.

Thanks,
Mike

Comment 8 Mike Fleetwood 2014-06-22 23:14:58 UTC

More work required.  On CentOS 5.10 GParted sometimes reports these
messages:

(gpartedbin:10252): GLib-CRITICAL **: g_source_remove: assertion `tag > 0'
failed

When the call flow is like this:

set_devices_thread             main thread

connect_signal()
  call g_io_add_watch()
                                Command executed, EOF occurs on the
                                IO Channel
                                OnReadable()
                                  g_source_remove(0)
                                    above assertion
  g_io_add_watch() returns
  and assigns sourceid

Comment 9 Mike Fleetwood 2014-07-05 19:29:29 UTC

Created attachment 279960 [details] [review]
Write after free debug patch (v2)

Here a debug 2 patch which reports more a bit more details about what is
going on.  It's what I used when documenting exactly what is happening
for following patch v1 ...

Comment 10 Mike Fleetwood 2014-07-05 19:34:22 UTC

Created attachment 279962 [details] [review]
Onreadable use after free fix (v1)

Hi Curtis,

Here's a patch to fix this bug.  Turns out that we don't need to use any
Glib Mutex's and Cond's.  Instead just use the standard capability of
Glib.  Quoting from the patch commit message:

 This is happening because PipeCapture member variable sourceid is being
 saved, in a different thread, just so the _OnReadable() callback can be
 removed.  However a glib IOChannel callback, type GIOFunc(), returning
 false will be automatically removed.

     GLib Reference Manual 2.26 / IO Channels
     https://developer.gnome.org/glib/2.26/glib-IO-Channels.html#GIOFunc

     GIOFunc()

     Returns : the function should return FALSE if the event source
               should be removed

 Therefore fix by just not saving the event sourceid at all, and not
 calling g_source_remove() to manually remove the callback, but instead
 letting glib automatically remove the callback when it returns false.

Please excuse the enormous commit message.  It seemed necessary.

Tested successfully on Fedora 20 and CentOS 5.10.

Thanks,
Mike

Comment 11 Curtis Gedak 2014-07-05 22:27:29 UTC

Hi Mike,

Thank you for persevering on with this problem.  Crashes are nasty business that we with to eliminate.  I'm busy this weekend but plan to extensively test this early next week on a variety of distros.

Curtis

Comment 12 Mike Fleetwood 2014-07-06 09:48:34 UTC

Hi Curtis,

First to clarify previous testing mentioned in comment #10:

1) Run GParted under valgrind.  Load and immediately exit
   Expect: No invalid write of size 4 in PipeCapture::_OnReadable()

Passed successfully on:
Fedora 20
CentOS 5.10
CentOS 6.5


Second to perform some extra testing so hopefully you don't need to do
quite so much :-)

1) Load GParted and exit
   Expect: No Gtk or Glib errors on the console

2) Perform a few operations, such as: create new partition, label and
   check
   Expect: External commands executed successfully, output viewable

3) Copy partition into empty space (repeating testing for bug #697727
   "Segfault in Gparted 0.15.0 when copying partition")
   Expect: Successfully partition copy

Passed successfully on:
CentOS 5.10
CentOS 6.5
Debian 6
Fedora 14
Fedora 19
Fedora 20
Kubuntu 12.04 LTS
Ubuntu 13.10
Xubuntu 14.04 LTS    [NOTE:1]

[NOTE:1]  When expanding tree in operation results the following
messages are generated:
    (gpartedbin:6475): GLib-CRITICAL **: Source ID 3337 was not found
    when attempting to remove it
Already documented as bug #731740 "Expand 2nd level Progress Details
yields GLib-CRITICAL Source ID message".

Thanks,
Mike

Comment 13 Curtis Gedak 2014-07-06 21:25:31 UTC

Hi Mike,

Great work on the patch in comment #10.  And a big thank you for the
extra testing on a variety of distros -- that saved me lots of time
enabling me to finish testing during the weekend.  :-)

Regarding the detailed commit message, I think it is warranted
especially with the trouble we've had with elusive random crashes with
asynchronous thread handling.

For my testing I performed the four test steps listed in comment #12
and some additional random operations such as resize and move for good
measure.

All of these tests passed successfully on:

Debian 7
fedora 12          <--- the oldest GNU/Linux VM I have :-)
openSUSE 12.3
openSUSE 13.1
ubuntu  10.04 LTS
kubuntu 12.04 LTS  <--- My physical development computer
ubuntu  14.04 LTS  [Same as Note 1 in comment #12]

As such, the patch in comment #10 has been committed to the git
repository for inclusion in the next release of GParted.

The relevant git commit can be viewed at the following link:

Prevent cross thread write after free in _OnReadable() (#731752)
https://git.gnome.org/browse/gparted/commit/?id=0fcfd18061a45f208415971f190818a41cc59f47

Since this patch fixes a potential crash that might result in a loss
of data, if there are no immediate objections then I will plan a
0.19.1 release soon.

Curtis

Comment 14 Mike Fleetwood 2014-07-07 12:16:17 UTC

Hi Curtis,

I think we should re-test patch 278205 "Multi-device btrfs support (v2)"
from bug 723842 "GParted resizes the wrong filesystem", just to see
if there are any more crashed which we can identify in existing code 
before releasing 0.19.1.

Mike

Comment 15 Curtis Gedak 2014-07-07 15:04:20 UTC

Good suggestion Mike.  I will begin testing on Debian/ubuntu with the patch for bug 723842 "GParted resizes the wrong filesystem".  This will help us build confidence in this patch set, or else show where more improvement is needed.  I will hold off on releasing 0.19.1 while testing bug 723842.

Curtis

Comment 16 Mike Fleetwood 2014-07-08 07:20:22 UTC

Hi Curtis,

I've also re-tested patchset v2 on Fedora 20 doing several resizes of
multi-device btrfs successfully so from the point of view of crashes
bug 731752 - "Write after free cross thread race in
PipeCapture::OnReadable()" has done the job.

Q: Include Multi-device btrfs support patch in GParted 0.19.1?
I think not.  We make 0.x.1 releases with minimal changes on top of
0.x.0 releases just to address crashes and other serious issues.  While
resizing the wrong device in a multi-device btrfs is bad I think it is a
low volume use case and this patchset adds lots of other enhancements
which for me makes it 0.y.0 release material.
I think we should release 0.19.1 with what is already committed to
master and no more.

Thanks,
Mike

Comment 17 Mike Fleetwood 2014-07-08 13:33:53 UTC

Above comment #16, was meant for bug 723842 - "GParted resizes the wrong
filesystem" instead.

Comment 18 Curtis Gedak 2014-07-08 14:46:35 UTC

It sounds like we are both in agreement that the patch in comment #10 is stable and does fix the problem outlined in this report.

If there are no objections, I will start the 0.19.1 release process soon.

Curtis

Comment 19 Phillip Susi 2014-07-11 02:34:59 UTC

I hadn't seen this thread until now because I'm not on the CC list ( I guess I need to get bugzilla to cc me on all gparted bugs ).  After seeing the new point release announcement I updated my git repo and read the change log and at first was thinking that this might fix some of the crash reports I've seen on the ubuntu bug tracker lately, but looking at the changes, I think they are incorrect.

The reason that I saved the sourceid and had the dtor kill the source was to handle the case of the PipeCapture object being destroyed prior to reading eof from the pipe.  By removing that code, and instead relying on the return from OnReadable to kill the source, gparted will now not be able to handle that case.  I believe that case can arise during a cancel event but will need to study the code some more to be sure.

Comment 20 Mike Fleetwood 2014-07-11 07:17:19 UTC

Hi Phillip,

Can you explain a case where you think that the PipeCapture will be
destroyed before the callback returns false to get the callback
unregistered?

I have just quickly tested the following cases which work correctly:

1) Kill external command while Applying pending operations.
   -> Executable receives SIGTERM and exits
      PipeCapture callback calls signal_eof.emit() and returns false
      PipeCapture destructor called

2) Cancel external command from Applying pending operations dialog.
   -> Executable receives SIGINT and exits
      PipeCapture callback call signal_eof.emit() and returns false
      PipeCapture destructor called

Thanks,
Mike

Comment 21 Phillip Susi 2014-07-11 14:19:13 UTC

Now that I think about it, I suppose that I did have the cancel path wait for eof so I suppose there currently isn't a way to have the PipeCapture be destroyed before eof, but I'm a little worried that is likely to change at some time in the future and then this would bite us in the ass.  I'll have to think on it some more.

Comment 22 Curtis Gedak 2014-07-11 19:12:23 UTC

Hi Phillip and Mike,

In order to check if we re-introduced problems with moves and cancels I have run a series of tests similar to those we used in troubleshooting Bug 697727 - Segfault in Gparted 0.15.0 when copying partition.

Specifically I created an 10238 MiB ext4 partition on my kubuntu 12.04 physical development machine.  I then moved the partition left or right by a small overlapping amount. For the Cancel half-way I moved left to be sure that the start of the partition was moved first and hence would have to be rolled back.

Test results are as follows:

Success - Five move only operations (operation takes 11 minutes)
Success - Five move then Cancel about half way through move (~11 minutes)
Success - Five move then cancel with Force Cancel then Cancel Operations

By "Success" I mean that the operation finished or cancelled as expected.

Based on these tests I believe that the problem we encountered with GParted 0.15.0, improved with 0.16.0, and then implemented the fix in a second location in 0.16.1 is still addressed.

Curtis

Comment 23 Curtis Gedak 2014-07-15 16:48:33 UTC

The patch to address this bug report was included in GParted 0.19.1 released on July 15, 2014.