After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 697727 - Segfault in Gparted 0.15.0 when copying partition
Segfault in Gparted 0.15.0 when copying partition
Status: RESOLVED FIXED
Product: gparted
Classification: Other
Component: application
0.15.0
Other Linux
: Normal major
: ---
Assigned To: gparted maintainers alias
gparted maintainers alias
: 696810 698010 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2013-04-10 15:17 UTC by Ian
Modified: 2013-04-30 16:31 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
GDB backtrace etc (121.67 KB, text/plain)
2013-04-14 10:11 UTC, Mike Fleetwood
  Details
gdb backtrace for core file in comment 8 (76.09 KB, text/plain)
2013-04-14 16:18 UTC, Curtis Gedak
  Details
Valgrind log (9.18 KB, text/plain)
2013-04-14 19:32 UTC, Mike Fleetwood
  Details
bad-glibmm.patch (1.99 KB, patch)
2013-04-15 14:34 UTC, Phillip Susi
none Details | Review
Patch v2 to fix segfault / bad-glibmm (2.82 KB, patch)
2013-04-16 15:29 UTC, Curtis Gedak
none Details | Review
0001-Avoid-glibmm-GSource-bug-crash-697727.patch (3.28 KB, patch)
2013-04-16 15:52 UTC, Phillip Susi
none Details | Review
0001-Avoid-glibmm-GSource-bug-crash-697727.patch (3.41 KB, patch)
2013-04-17 13:49 UTC, Phillip Susi
none Details | Review
0001-Avoid-glibmm-GSource-bug-crash-again-697727.patch (3.86 KB, patch)
2013-04-26 01:37 UTC, Phillip Susi
none Details | Review
0001-Avoid-glibmm-GSource-bug-crash-again-697727.patch (4.12 KB, patch)
2013-04-26 13:35 UTC, Phillip Susi
none Details | Review

Description Ian 2013-04-10 15:17:20 UTC
Just downloaded the latest liveCD today to clone my ubuntu installation onto a larger hard drive

Old drive, connected to USB-SATA cable:
 sdc1 - 13.97GB partition, EXT3, bootable
 sdc2 - 13.97GB partition, EXT3
 sdc3 - 13.97GB partition, EXT3
 sdc4 - extended partition
 sdc5 - 7.something GB partition, swap 
 sdc6 - 200something GB partition, EXT3

New drive (desired changes to sda):
 copy sdc1 to new partition 1
 resize new partition 1 to 32GB
 copy sdc2 to new partition 2
 copy sdc3 to new partition 3
 create extended partition (new partition 4)
 copy sdc5 to new partition 5
 copy sdc6 to new partition 6
 resize new partition 6 to [remainder of available space]

All new parititions aligned to MiB, not cylinder.


Hit start.  Expand the "show details" arrows as they become available.

New partitions 1 and 2 are copied successfully.  New partition 3 successfully e2fsck's, but the copy operation segfaults.  Re-open gparted, and re-create all operations from "copy sdc3" onward.  Again, segfault on copy, within the first 60 seconds.

(Tried to run the same from the terminal, but no stack trace was printed... just "Segmentation Fault").

Re-open gparted, and just copy sdc3.  Hold my breath.  No segfault (didn't open details).

Re-create all operations from "create extended paritition" onwards.  Watch details.  Segfault within the first 60 seconds of copy operation.  

Re-create all operations from "create extended paritition" onwards.  Don't watch details.  15 minutes of a 3 hour copy operation done with no issues.  Will update bug if it completes successfully.
Comment 1 Curtis Gedak 2013-04-12 16:26:26 UTC
Thank you Ian for reporting this problem.

Is the problem still reproducible for you?

If so, then we might work together to troubleshoot and find the exact problem.
Comment 2 Ian 2013-04-12 17:17:21 UTC
The problem happened again during the "big" 3-hour copy, so I went back and completed it with the gParted version that's on the latest Ubuntu LiveCD.

I'd be happy to help troubleshoot this; how can I get more debug info (and crash info) from the running program?  As I said, running it from the terminal was uninformative :(
Comment 3 Ian 2013-04-12 17:18:02 UTC
And I should add, the problem happened without me expanding the "details".  So, unfortunately it seems to be something intermittent.
Comment 4 Phillip Susi 2013-04-12 19:13:06 UTC
Can you run "ulimit -c unlimited" to enable core dumps, and attach it to this report?
Comment 5 Curtis Gedak 2013-04-12 20:33:05 UTC
If I recall correctly, the default location for the core dump is the working directory, and the default name of the core dump is "core".

http://linux.die.net/man/5/core
Comment 6 Curtis Gedak 2013-04-13 17:08:49 UTC
I have added a warning to avoid using GParted 0.15.0 for partition moves, resizes, copies, and checks to the GParted web site download page and news page while we investigate this problem.

In the meantime both Mike Fleetwood and I are running copy operations trying to reproduce the bug to acquire a core dump.
Comment 7 Curtis Gedak 2013-04-13 17:43:33 UTC
There have been at least 3 reports of crashes with GParted 0.15.0.

The other two are:

Bug #696810 - segfault resizing ext4 partition with gparted-0.15.0-1 iso

"Applying pending operations" grey screen
http://gparted-forum.surf4.info/viewtopic.php?id=16863

I suggest that we track our troubleshooting in this report.
Comment 8 Curtis Gedak 2013-04-13 18:15:29 UTC
After many attempts I have finally been able to get GParted to crash.

Since the core dump file is too large to upload to bugzilla
(the file is ~3.5MB), you can find the core dump temporarily hosted
at:
http://gparted.org/curtis/core-dump-move-10238MiB-ext4-1MiB-to-right.gz

The core dump file has been renamed from
   core
to
   core-dump-move-10238MiB-ext4-1MiB-to-right
 and has been compressed with gzip.

The crash occurred while moving a 10238 MiB ext4 partition by 1 MiB to
the right.  This was done within a virtual machine running the
gparted-live-0.15.0-3-i486.iso image.

Phillip, your help analyzing this core dump would be greatly
appreciated.
Comment 9 Patrick Verner 2013-04-13 18:41:15 UTC
GParted Live is using e2fsck from e2fsprogs-1.42.5

This is in the changelog for the 1.42.6 release:

Fixed a potential seg fault in e2fsck when there is an I/O error while
reading the superblock.

I don't know if this is the cause, but it sure could be.
Comment 10 Curtis Gedak 2013-04-13 18:55:33 UTC
Hi Patrick,

Thanks for the heads up about e2fsck.  We certainly want to get e2fsprogs upgraded in the live image to avoid this problem.  :-)

My thoughts are that a crash in e2fsck should not result in a crash of GParted because the e2fsck command is spawned from the gpartedbin process.

The other item is that in creating the crash dump in comment #8, e2fsck had finished successfully, and GParted was in the process of actually moving the partition to the right.  The progress indicators were updating properly when all of a sudden *CRASH* and the GParted GUI disappeared.  Hence e2fsck should not have been executing at the time of the crash.

In case it is relevant, before running gparted from a terminal with "sudo gparted", I ran the command "ulimit -c unlimited" as suggested by Phillip in comment #4.
Comment 11 Ian 2013-04-13 19:32:27 UTC
The original crashes that I experienced happened after the e2fsck commands had completed successfully.  Or rather, whether or not e2fsck completed successfully, the progress bar had switched over to show the progress of the copy operation.
Comment 12 Mike Fleetwood 2013-04-14 10:11:06 UTC
Created attachment 241488 [details]
GDB backtrace etc

Yesterday I copied 3 partitions: 20 G, 50 G and 130 G on my Fedora 14
desktop using GParted master (git 76a9cd3).  All successfully.  Might
be a distribution / library version specific issue.

Today I have reproduced the crash on Debian 6 in Virtual Box by copying
a 4 G partition into unallocated space.  Had expanded the details in
the operation dialog to see the amount copied updating.

After installing all the debug symbol packages I have useful information
from gdb.  Full details from below commands attached.

# gdb gpartedbin core
backtrace
backtrace full
thread apply all backtrace
info registers
x/16i $pc
quit

The top of the stack looks like this:
  • #0 _int_free
    at malloc.c line 4948
  • #1 *__GI___libc_free
    at malloc.c line 3739
  • #2 IA__g_free
    at /build/buildd-glib2.0_2.24.2-1-i386-AScyie/glib2.0-2.24.2/glib/gmem.c line 191
  • #3 IA__gdk_region_destroy
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/gdkregion-generic.c line 365
  • #4 queue_item_free
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/x11/gdkgeometry-x11.c line 151
  • #5 gdk_window_queue
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/x11/gdkgeometry-x11.c line 188
  • #6 _gdk_x11_window_queue_antiexpose
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/x11/gdkgeometry-x11.c line 258
  • #7 gdk_window_process_updates_internal
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/gdkwindow.c line 5358

Initial reaction is it looks like a crash freeing memory from Gtk.
Comment 13 Curtis Gedak 2013-04-14 16:18:55 UTC
Created attachment 241508 [details]
gdb backtrace for core file in comment 8

Thank you Mike for sharing the commands you used in your investigation.

I used similar commands to get details on the core dump from comment #8.  Since I was running gpartedbin as compiled in the GParted Live, the backtrace does not have the additional details as if gparted were compiled with the "-g" option.

After installing all the debug symbol packages I too have the useful information
from gdb.  Full details from below commands attached.

$ gdb /usr/sbin/gpartedbin ./core | tee -a gdb-log.txt
backtrace
backtrace full
thread apply all backtrace
info registers
x/16i $pc
quit

The top of the stack looks like this:

(gdb) backtrace
  • #0 malloc_consolidate
    at malloc.c line 5161
  • #1 _int_malloc
    at malloc.c line 4373
  • #2 *__GI___libc_malloc
    at malloc.c line 3660
  • #3 _cairo_xlib_surface_create_internal
    at /build/buildd-cairo_1.12.2-3-i386-jzS4pg/cairo-1.12.2/src/cairo-xlib-surface.c line 1503
  • #4 cairo_xlib_surface_create
    at /build/buildd-cairo_1.12.2-3-i386-jzS4pg/cairo-1.12.2/src/cairo-xlib-surface.c line 1680
  • #5 _gdk_windowing_create_cairo_surface
    at /build/buildd-gtk+2.0_2.24.10-2-i386-Tg7Q_2/gtk+2.0-2.24.10/gdk/x11/gdkdrawable-x11.c line 1553
  • #6 gdk_pixmap_create_cairo_surface
    at /build/buildd-gtk+2.0_2.24.10-2-i386-Tg7Q_2/gtk+2.0-2.24.10/gdk/gdkpixmap.c line 596
  • #7 _gdk_drawable_create_cairo_surface
    at /build/buildd-gtk+2.0_2.24.10-2-i386-Tg7Q_2/gtk+2.0-2.24.10/gdk/gdkdraw.c line 1973
  • #8 IA__gdk_window_begin_paint_region
    at /build/buildd-gtk+2.0_2.24.10-2-i386-Tg7Q_2/gtk+2.0-2.24.10/gdk/gdkwindow.c line 2995
  • #9 IA__gtk_main_do_event
    at /build/buildd-gtk+2.0_2.24.10-2-i386-Tg7Q_2/gtk+2.0-2.24.10/gtk/gtkmain.c line 1608


My initial reaction is that this crash is due to a failure to allocate memory.  This is different from the crash you observed in comment #12.

This makes me think we might have some sort of race condition occurring.  This is just a guess at this time.
Comment 14 Mike Fleetwood 2013-04-14 19:32:11 UTC
Created attachment 241521 [details]
Valgrind log

Hi Curtis,

I now have a core dump from malloc() too.  The top of the stack looks
like:

  • #0 malloc_consolidate
    at malloc.c line 5145
  • #1 _int_malloc
    at malloc.c line 4373
  • #2 *__GI___libc_malloc
    at malloc.c line 3661
  • #3 *INT_cairo_create
    at /build/buildd-cairo_1.8.10-6-i386-6jOwL1/cairo-1.8.10/src/cairo.c line 154
  • #4 setup_backing_rect_method
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/gdkwindow.c line 4284
  • #5 gdk_window_clear_backing_region
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/gdkwindow.c line 4313
  • #6 IA__gdk_window_begin_paint_region
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/gdkwindow.c line 2887
  • #7 IA__gtk_main_do_event
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gtk/gtkmain.c line 1571
  • #8 _gdk_window_process_updates_recurse
    at /build/buildd-gtk+2.0_2.20.1-2-i386-TNeM25/gtk+2.0-2.20.1/gdk/gdkwindow.c line 5181


I have also run valgrind malloc debugger.  Like this:

# valgrind --trace-origins=yes --log-file=gparted-valgrind.log gpartedbin

GParted doesn't crash under valgrind but I see this error reported which
is possibly relevant:

==4340== Invalid write of size 4
==4340==    at 0x4975DD3: Glib::SignalIdle::connect(sigc::slot<bool, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil> const&, int) (main.cc:92)
==4340==    by 0x810427D: GParted::copy_blocks::copy_thread() (Copy_Blocks.cc:137)
==4340==    by 0x49700C1: call_thread_entry_slot (slot.h:440)
==4340==    by 0x4D6C6CE: g_thread_create_proxy (gthread.c:1893)
==4340==    by 0x4F17954: start_thread (pthread_create.c:300)
==4340==    by 0x4FF81DD: clone (clone.S:130)
==4340==  Address 0x61615e8 is 8 bytes inside a block of size 12 free'd
==4340==    at 0x4023881: operator delete(void*) (vg_replace_malloc.c:387)
==4340==    by 0x4974814: (anonymous namespace)::SourceConnectionNode::destroy_notify_callback(void*) (main.cc:85)
==4340==    by 0x4D41913: g_source_callback_unref (gmain.c:1077)
==4340==    by 0x4D41F26: g_source_destroy_internal (gmain.c:856)
==4340==    by 0x4D42355: g_main_context_dispatch (gmain.c:1985)
==4340==    by 0x4D45FE7: g_main_context_iterate (gmain.c:2591)
==4340==    by 0x4D46526: g_main_loop_run (gmain.c:2799)
==4340==    by 0x469EE18: gtk_main (gtkmain.c:1219)
==4340==    by 0x428FE86: Gtk::Main::run_impl() (main.cc:536)
==4340==    by 0x428FC81: Gtk::Main::run() (main.cc:483)
==4340==    by 0x8104590: GParted::copy_blocks::copy() (Copy_Blocks.cc:187)
==4340==    by 0x80A87AF: GParted::GParted_Core::copy_filesystem(Glib::ustring const&, Glib::ustring const&, long long, long long, long long, long long, long long, GParted::OperationDetail&, bool, long long&, bool) (GParted_Core.cc:2755)


I have a second "Invalid write of size 4" too.  Full log attached.

It looks like the heap is getting corrupted leading to nondeterministic
crashes when calling malloc() and free().  Valgrind is saying it's all
happening in glib(mm)/libstdc++/gtk(mm).  Could be a bug in that code,
or a race between multiple threads performing malloc() and free() in an
unsafe way.  Not sure really.

Mike
Comment 15 Mike Fleetwood 2013-04-14 19:39:35 UTC
Hi Curtis,

Should we remove GParted 0.15.0 application and Live CD from the download
pages too?  Both gparted.org and sf.net.
    http://www.gparted.org/download.php
    http://sourceforge.net/projects/gparted/files/

Thanks,
Mike
Comment 16 Curtis Gedak 2013-04-14 20:22:46 UTC
(In reply to comment #15)
> Should we remove GParted 0.15.0 application and Live CD from the download
> pages too?  Both gparted.org and sf.net.
>     http://www.gparted.org/download.php
>     http://sourceforge.net/projects/gparted/files/

Thanks Mike for pointing this out.

Since many other sites pick up and distribute these versions, I thought it best not to have these versions disappear mysteriously into the ether.  Instead I have added a warning README to each directory containing GParted 0.15.0 download files.  I have also set the default SF download link to the gparted-live-0.14.1-6-i486.iso file.

On the GParted web site front page I have updated the direct links to the respective GParted 0.14.1 versions.


On the topic of the segmentation fault, I wonder if we need semaphores or something so that our Copy_Blocks thread and main Dialog_Progress thread play nice together.  My guess is that perhaps the updates from Copy_Blocks are perhaps getting out of sync with the Dialog_Progress thread updating the GUI.  It is also possible, as you point out, that we have found a bug glib(mm)/libstdc++/gtk(mm) but I always suspect our own code first.


Phillip, if you are out there we would certainly appreciate your thoughts on this segmentation fault matter.
Comment 17 Phillip Susi 2013-04-15 13:34:39 UTC
Sorry, I was a little busy doing my taxes over the weekend.  Good work so far, while catching up valgrind was going to be my first recommendation.  At first glance, it looks like this is that bug I found in gtkmm before that I thought was benign, but it appears now it may not be.  That bug was being tracked in #561885.
Comment 18 Mike Fleetwood 2013-04-15 13:48:07 UTC
I've git bisected it to:
    bd9e16f22fe3dfae73064369aadeda5de10b61be
    Thread the internal copy algorithm (#685740)

Probably as expected.


Also [re]discovered that several of Phillip's patches don't even compile
on Debian 6.  This makes git bisecting harder!  As per comments at:
    https://bugzilla.gnome.org/show_bug.cgi?id=685740#c8
    https://bugzilla.gnome.org/show_bug.cgi?id=685740#c9
several patches used Glib::Threads::Mutex, rather than Glib::Mutex,
which is too new for Debian 6.

In future we need to ensure all patches compile for git bisectability.
Comment 19 Phillip Susi 2013-04-15 13:50:11 UTC
Yep, that's what it is.  There is a race condition between SignalIdle::connect
and the main thread completing the idle event, and destroying the idle object
before connect finishes with it.  The bug will ultimately need fixed in glibmm,
but as a work around, we can change gparted to avoid using the buggy glibmm
GSource wrappers and instead use glib directly.
Comment 20 Phillip Susi 2013-04-15 14:34:05 UTC
Created attachment 241568 [details] [review]
bad-glibmm.patch

I hacked this up real quick.  It should do the trick, but I won't have time to test it until tonight.  I'm not even sure it will compile, but if you guys want to give it a try go for it.
Comment 21 Curtis Gedak 2013-04-15 16:16:08 UTC
Thank you Mike for your work so far to identify the root cause of this problem.  It looks like valgrind is a very powerful tool.  Do you have a good link to a tutorial so that I could learn valgrind?

Regarding catching individual commits that do not compile on all distributions, I agree that this is a good goal.  However it might prove to be difficult to achieve.  In order to do so we would have to test each commit on each distribution.  We can try our best, but with the labour intensive nature of this goal I do not think we will be able to catch all occurrences.  Having said that, it is a worthy goal to strive for.


Phillip, I'm just beginning to test your patch in comment #20.  When trying to compile I get the following message:

Copy_Blocks.cc: In function ‘gboolean GParted::_set_progress_info(gpointer)’:
Copy_Blocks.cc:60:6: error: ‘bool GParted::copy_blocks::set_progress_info()’ is private
Copy_Blocks.cc:100:31: error: within this context
make[2]: *** [Copy_Blocks.o] Error 1
make[2]: Leaving directory `/home/gedakc/workspace/gparted.dev2/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/gedakc/workspace/gparted.dev2'
make: *** [all] Error 2

To work around this I made copy_blocks::set_progress_info() into a public method in include/Copy_Blocks.h.
Comment 22 Mike Fleetwood 2013-04-15 17:05:56 UTC
Valgrind:

Nice introduction article
"Using Valgrind to Find Memory Leaks and Invalid Memory Use"
        http://www.cprogramming.com/debugging/valgrind.html


On Valgrind's home page they have Quick Start and User Manual documents
        http://valgrind.org/
Comment 23 Curtis Gedak 2013-04-15 18:41:18 UTC
My testing of the patch in comment #20 has resulted in a crash.  Unfortunately I had not run "ulimit -c unlimited" beforehand so there is no core dump file in the working directory.

I will rerun and attempt to get a stack trace.
Comment 24 Curtis Gedak 2013-04-15 18:54:53 UTC
Ignore comment #23.  I think I might have been testing the wrong gpartedbin.
Comment 25 Curtis Gedak 2013-04-15 23:34:50 UTC
I have now successfully run 20 tests of moving a 10238 MiB ext4 partition by 1 MiB to the left/right using code based on the patch in comment #20.  As such the patch shows great promise.

I have also run the code through valgrind and did not discover any references to copy_blocks in the output.  The valgrind command I used was:

sudo valgrind --track-origins=yes --log-file=gparted-valgrind.log src/gpartedbin


Mike, would you be able to test with valgrind to confirm if the problem you discovered is now gone?


Phillip, would you be able to enhance the patch to be a git patch including comments?
Comment 26 Mike Fleetwood 2013-04-16 12:19:58 UTC
Curtis and Phillip, If one of you can upload a working patch I'll
repeat my testing on Debian 6 with and without valgrind.
Comment 27 Curtis Gedak 2013-04-16 15:29:50 UTC
Created attachment 241648 [details] [review]
Patch v2 to fix segfault / bad-glibmm

Attached is the git diff of the fix I have been testing for GParted 0.15.0.  It should also work with the git head too.

The diff file was generated with:

   git diff > bug697727-segfault-fix.diff

To apply the patch, change into the top level of the gparted directory where the README file is and use the command:

   patch -p1 < bug697727-segfault-fix.diff
Comment 28 Phillip Susi 2013-04-16 15:52:31 UTC
Created attachment 241652 [details] [review]
0001-Avoid-glibmm-GSource-bug-crash-697727.patch

I converted it to a proper git patch with log.
Comment 29 Curtis Gedak 2013-04-16 19:53:07 UTC
A further 10 tests of moving a 10238 MiB ext4 partition right then left using the patch in comment #27 (which is the same as the git patch in comment #28) has been successful.  No crashes occurred.

To check if the GParted 0.15.0 still crashed I ran the same tests in the same environment using the gpartedbin executable from GParted Live 0.15.0-3.  This time gparted 0.15.0 crashed on the first test.  Often it takes several tries.

Mike,

If your testing and check with valgrind looks good, then we can consider committing this patch and preparing for a GParted 0.16.0 release.  One minor update for the patch is to add the bug number and one line description to the patch comment similar to our example on the GParted git page.
   http://gparted.org/git.php

My thoughts are that the stuff already committed to the git master branch is all good.  We could also consider including some of your other patches if you would like these in the release as well.  I'm specifically thinking of this report which I can quickly test and commit:

   Bug #697946 - Absolutely full NTFS reported as partition error
Comment 30 Mike Fleetwood 2013-04-17 10:36:42 UTC
Tested Phillip's patch from comment #28.  Valgrind shows the invalid
writes have gone.  Also run my test case of coping 4 GiB partition on
Debian 6 successfully 16 times, where as before it would crash on 1st or
2nd try.  Patch passes testing by me.


Can the patch summary also mention the number of the bug being worked
around in glibmm, formatted as "bug #561885"?
(Will allow easier linkage especially with https://git.gnome.org/ 
converting it to a hyperlink).


Curtis,

Feel free to include Bug #697946 in GParted 0.16.0.
Comment 31 Phillip Susi 2013-04-17 13:49:29 UTC
Created attachment 241745 [details] [review]
0001-Avoid-glibmm-GSource-bug-crash-697727.patch

How's this?
Comment 32 Kjell Ahlstedt 2013-04-17 14:42:06 UTC
I've been working with the glibmm bug 561885 and other glibmm bugs.

I can't draw any conclusion from the gdb backtraces in this bug. The crashes
may or may not be caused by the fault described in bug 561885.

The valgrind log in comment 14, on the other hand, is very informative. The
illegal memory accesses are definitely caused by the fault described in glibmm
bug 396958. That bug has been partly fixed in glibmm 2.35.8 and later. Even
after that fix, Glib::SignalIdle::connect() is not completely thread-safe. This
paragraph was added to the reference documentation:

* This method is not thread-safe. You should call it, or manipulate the
* returned sigc::connection object, only from the thread where the SignalIdle
* object's MainContext runs.

This remaining thread-unsafety is very difficult to remove. Don't expect it to
be done in the foreseeable future.

Glib::SignalIdle::connect_once() is more thread-safe in glibmm 2.32.0 and
later, where bug 396963 has been completely fixed. The remaining thread-
unsafety is described like so:

* Because sigc::trackable is not thread-safe, if the slot represents a
* non-static method of a class deriving from sigc::trackable, and the slot is
* created by sigc::mem_fun(), connect_once() should only be called from
* the thread where the SignalIdle object's MainContext runs. You can use,
* say, boost::bind() or, in C++11, std::bind() or a C++11 lambda expression
* instead of sigc::mem_fun().

Phillip's patch in comment 31 shows that the fault is expected to be in
Glib::signal_idle().connect() and Glib::signal_child_watch().connect(). Those
functions don't use a C++ wrapper around the created GSource object. They are
affected by bug 396958, but not by bug 561885.
Comment 33 Phillip Susi 2013-04-17 16:02:20 UTC
This crash is caused by a corrupted heap.  The heap is corrupted by glibmm writing to the ConnectionNode after it has already been freed by the destroy notify callback, as shown by the valgrind trace.

More specifically, Glib::SignalIdle::connect() calls conn_node->install(source) after having called g_source_attach().  The main thread can dispatch the idle event, and call the destroy_notify_callback, which deletes the SourceConnectionNode, leaving the background thread to call conn_node->install on a now invalid pointer.

It appears that the connect_once variants do the reverse order, so should be safe, aside from trying to unref an already freed source object, which I suspect may be capable of causing the same kind of heap corruption, though it is probably far less likely.
Comment 34 Curtis Gedak 2013-04-17 16:27:51 UTC
Thank you Kjell for commenting on this problem, and Phillip for working towards a resolution.

If I understand comment #32 and comment #33 correctly, these indicate that even with the patch from comment #31, the code is still not 100% thread safe.

Is my understanding correct?

If so, then is there a way that we can make the code 100% thread safe using Glib, or some other library?
Comment 35 Phillip Susi 2013-04-17 17:39:57 UTC
No Curtis, he is saying he made a change to glibmm to improve the situation on that end, but it doesn't fix it completely.  By avoiding the use of glibmm in this area gparted should be fully safe on all versions of glibmm, so we should be good to go with that patch.
Comment 36 Curtis Gedak 2013-04-17 17:52:13 UTC
Thank you Phillip for the clarification.  This afternoon I will test the patch in comment #31 along with the git repository head, before committing the patch.

NOTE:  I am moving this report from component livecd to application since it affects more than just the Live CD.
Comment 37 Curtis Gedak 2013-04-17 17:58:20 UTC
*** Bug 698010 has been marked as a duplicate of this bug. ***
Comment 38 Kjell Ahlstedt 2013-04-17 18:27:02 UTC
(In reply to comment #33)
> This crash is caused by a corrupted heap.  The heap is corrupted by glibmm
> writing to the ConnectionNode after it has already been freed by the destroy
> notify callback, as shown by the valgrind trace.
>
> More specifically, Glib::SignalIdle::connect() calls conn_node->
> install(source) after having called g_source_attach().  The main thread can
> dispatch the idle event, and call the destroy_notify_callback, which deletes
> the SourceConnectionNode, leaving the background thread to call conn_node->
> install on a now invalid pointer.

Correct! This is what has been fixed with bug 396958, but it was fixed only
recently (2013-02-21 in the git repository). Now conn_node->install(source) is
called before g_source_attach().

That fix makes Glib::SignalIdle::connect() much more thread-safe, but not
completely so, as explained by the added documentation that I cited in
comment 32.

> It appears that the connect_once variants do the reverse order, so should be
> safe, aside from trying to unref an already freed source object, which I
> suspect may be capable of causing the same kind of heap corruption, though it
> is probably far less likely.

The connect_once() methods were fixed almost a year before the connect()
methods (2012-04-04).

Where do the connect_once() methods unref an already freed source object?
Note that the connect() and connect_once() methods create a GSource
object, but no Glib::Source wrapper. Neither Glib::Source::unreference() nor
SourceCallbackData::destroy_notify_callback() is called.
SourceConnectionNode::destroy_notify_callback() is called.
The call to g_source_unref() in Glib::SignalIdle::connect() is needed because
g_idle_source_new() adds one ref count, and g_source_attach() adds a second
ref count. g_idle_add() also calls g_source_unref() after g_source_attach().

(In reply to comment #34)
> If I understand comment #32 and comment #33 correctly, these indicate that
> even with the patch from comment #31, the code is still not 100% thread safe.

Even with the latest fixes in glibmm, the Glib::Signal*::connect() methods are
not completely thread-safe. It's not safe to call them in one thread and add a
source to a main loop that runs in another thread.

With Phillip's patch in comment 31, Glib::Signal*::connect() are not used
(unless there are other uses of them in gparted). This is probably the safest
way out of your problem, especially since you want to use gparted also with
fairly old versions of glibmm. (In bug 561885 comment 66, Mike Fleetwood
mentions glibmm 2.24.2 from 2010-05-03.)

As I said in comment 32, don't expect the connect() methods to be fully
thread-safe in the foreseeable future.

Of course I don't know if there are other problems in gparted, apart from those
that the patch in comment 31 fixes.
Comment 39 Kjell Ahlstedt 2013-04-17 18:35:50 UTC
Sorry, I happened to reset component and status by mistake.
Changing back to application and NEW.
Comment 40 Phillip Susi 2013-04-17 18:49:29 UTC
I was just lamenting to Curtis yesterday that the damn bugzilla "midair collision" nonsense is such a confusing mess.

The connect_once() methods call g_source_unref() right before they return.  I was thinking that the source may already have been destroyed by the main thread at this point, but now that I look again, I guess not.  It looks like the once versions are already safe, but you said that was fixed a year ago, and gparted wants to support older versions of gtkmm, so I suppose we'll just stick with using glib directly.
Comment 41 Curtis Gedak 2013-04-17 18:56:46 UTC
Thank you Kjell for your explanations in comment #38.  We appreciate your assistance with this problem, and with maintenance of Free Software in general.  :-)

Phillip's patch will work well for GParted because we try to maintain backward compatibility at least for the currently supported major GNU/Linux distributions.  This means that for now it will have to work with some distros that are more than a year old (e.g., ubuntu 10.04).
Comment 42 Curtis Gedak 2013-04-17 19:02:06 UTC
Thank you Kjell for your explanations in comment #38.  We appreciate your assistance with this problem, and with maintenance of Free Software in general.  :-)

Phillip's patch will work well for GParted because we try to maintain backward compatibility at least for the currently supported major GNU/Linux distributions.  This means that for now it will have to work with some distros that are more than a year old (e.g., ubuntu 10.04).
Comment 43 Curtis Gedak 2013-04-17 19:39:40 UTC
I just successfully ran 10 more moves of a 10238 MiB ext4 partition with no problems using the git head plus the patch in comment #31.  This patch passes my testing.

The patch in comment #31 has been committed to the Gnome repository for inclusion in the next release of GParted on April 24th.

The relevant git commit can be viewed at the following link:

Avoid glibmm GSource bug/crash (#697727)
https://git.gnome.org/browse/gparted/commit/?id=c36934aca5d0691953fc0eed15ee61bc027ca086
Comment 44 Curtis Gedak 2013-04-17 19:42:47 UTC
Updating the title of this report because the problem is not limited to the Live CD image.

Changing from:
   Segfault in livecd Gparted v 0.15.0-3 when copying partition
To:
   Segfault in Gparted 0.15.0 when copying partition
Comment 45 Curtis Gedak 2013-04-24 16:29:47 UTC
The code change to address this report has been included in GParted 0.16.0 released on April 24, 2013.
Comment 46 Phillip Susi 2013-04-26 01:37:22 UTC
Created attachment 242493 [details] [review]
0001-Avoid-glibmm-GSource-bug-crash-again-697727.patch

Missed one.. this patch gets it.
Comment 47 Mike Fleetwood 2013-04-26 13:01:43 UTC
Tested the patch on CentOS 5.9 and it stopped these assertion errors
found on CentOS 5.9:

(gpartedbin:9670): GLib-CRITICAL **: g_source_unref: assertion `source != NULL' failed


Reviewing the code I though that the PipeCapture constructor needed to
initialise sourceid = 0 to ensure that the destructor never depended on
assing an uninitialised variable.  But doing that gave this error 6 
time:

(gpartedbin:5836): GLib-CRITICAL **: g_source_remove: assertion `tag > 0' failed

Planning to use valgrind to check if there is any uninitialised memory 
access going on when PipeCapture::PipeCapture doesn't initialise
sourceid = 0.
Comment 48 Phillip Susi 2013-04-26 13:35:13 UTC
Created attachment 242564 [details] [review]
0001-Avoid-glibmm-GSource-bug-crash-again-697727.patch

Woops, good catch.  Yes, it should be initialized to zero, and tested for >0 before trying to remove the source in OnReadable.

Updated patch.
Comment 49 Curtis Gedak 2013-04-26 16:27:06 UTC
Thanks Mike for re-opening this bug report and reviewing the patch so quickly.  Thanks also Phillip for coming up with another patch.

My initial testing on OpenSUSE 11.2 shows that this patch does address the problem with g_source_unref messages.

Next I will test this patch in a variety of different VMs and report back here.
Comment 50 Curtis Gedak 2013-04-26 19:55:13 UTC
To test the patch in comment #48 I performed a half dozen operations
with various file systems on various disk devices in various GNU/Linux
distributions.


VIRTUAL MACHINES TESTED
-----------------------

   OpenSUSE 11.2, 12.2

   Fedora 16, 17, 18

   Ubuntu 10.04, 12.04, 12.10, 13.04

   Debian 6

The testing was successful in all of the above VMs with no crashes and
no g_source_unref messages.


PHYSICAL MACHINE TESTED
-----------------------

   Kubuntu 12.04

All tests on my physical development machine, which includes an Intel
Fake RAID, completed successfully.
Comment 51 Mike Fleetwood 2013-04-26 20:11:31 UTC
Hi Phillip,

I'm still seeing a race between threads but I don't know if it's safe or
not?

Sometimes the main thread is running PipeCapture::OnReadable() and
getting status==Glib::IO_STATUS_EOF before the set_devices_thread doing
the file system specific querying commands has run
PipeCapture::connect_signal().  At the moment I think this is working
because sourceid is initialised and being accessed by different threads.
I don't know if this is safe though.


Hopefully self explanatory debug output:

D: [15742] main()
D: [15746] GParted_Core::set_devices_thread()
...


blikd example with normal expected sequence of events:

D: [15746] Utils::execute_command(command="blkid", &output, &error, use_C_locale=1)
D: [15746] PipeCapture::PipeCapture(fd=10, &string) sourceid=0
D: [15746] PipeCapture::PipeCapture(fd=12, &string) sourceid=0
D: [15746] PipeCapture::connect_signal() sourceid=g_io_add_watch(); sourceid=30
D: [15746] PipeCapture::connect_signal() sourceid=g_io_add_watch(); sourceid=31
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_NORMAL) return true
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_NORMAL) return true
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_EOF) sourceid=30
D: [15742] PipeCapture::OnReadable() return false
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_EOF) sourceid=31
D: [15742] PipeCapture::OnReadable() return false
D: [15746] PipeCapture::~PipeCapture() sourceid=0
D: [15746] PipeCapture::~PipeCapture() sourceid=0


e2label example with questionable sequence of events:

D: [15746] Utils::execute_command(command="e2label /dev/sda1", &output, &error, use_C_locale=1)
D: [15746] PipeCapture::PipeCapture(fd=10, &string) sourceid=0
D: [15746] PipeCapture::PipeCapture(fd=12, &string) sourceid=0
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_NORMAL) return true
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_EOF) sourceid=0
D: [15742] PipeCapture::OnReadable() return false
D: [15746] PipeCapture::connect_signal() sourceid=g_io_add_watch(); sourceid=51
D: [15742] PipeCapture::OnReadable() (status==Glib::IO_STATUS_EOF) sourceid=0
D: [15746] PipeCapture::connect_signal() sourceid=g_io_add_watch(); sourceid=52
D: [15742] PipeCapture::OnReadable() return false
D: [15746] PipeCapture::~PipeCapture() sourceid=52
D: [15746] PipeCapture::~PipeCapture() sourceid=51
Comment 52 Phillip Susi 2013-04-26 21:09:02 UTC
That should be fine.  It just so happens that the foreground thread finishes before the background has had a chance to store the sourceid.  It gets stored after, then the destructor cleans up the source instead of OnReadable().  As long as it gets cleaned up at some point it should be fine.  Come to think of it, it's cleaned up automatically as soon as OnReadable() returns false, whether or not sourceid is still 0.
Comment 53 Mike Fleetwood 2013-04-27 15:38:46 UTC
Hi Phillip,

Do you see any problem if this change was made?

@@ -80,9 +80,6 @@ bool PipeCapture::OnReadable( Glib::IOCondition condition )
        if (status != Glib::IO_STATUS_EOF)
                std::cerr << "Pipe IOChannel read failed" << std::endl;
        // signal completion
-       if( sourceid > 0 )
-               g_source_remove( sourceid );
-       sourceid = 0;
        eof();
        return false;
 }

It means that the sourceid memory location is only ever accessed from
a single thread / CPU core.  If there are any extended storing of values
in registers rather than memory by compiler optimisations; out of order
execution by the CPU; or lack of memory barriers between CPUs accessing
memory there won't be any unexpected behaviour.

This does of cause continue to assume that g_io_add_watch() and
g_source_remove() are thread safe against the glib main loop polling
the list of events.  But that seems to be a given.

Thanks,
Mike
Comment 54 Phillip Susi 2013-04-27 18:15:25 UTC
That should be fine since the return false implicitly removes the source.
Comment 55 Mike Fleetwood 2013-04-28 15:38:01 UTC
Hi Phillip,

A couple of questions I'm not sure about:

1) Is it correct that 'Glib::RefPtr<Glib::IOChannel> channel' gets
   destroyed when ~PipeCapture() runs?

2) As used in OnReadable() what functions or object methods are update()
   and eof()?  I wasn't able to find out.

Thanks,
Comment 56 Phillip Susi 2013-04-28 18:21:36 UTC
(In reply to comment #55)
> 1) Is it correct that 'Glib::RefPtr<Glib::IOChannel> channel' gets
>    destroyed when ~PipeCapture() runs?

Why wouldn't it be?

> 2) As used in OnReadable() what functions or object methods are update()
>    and eof()?  I wasn't able to find out.

I don't understand.
Comment 57 Mike Fleetwood 2013-04-29 13:21:20 UTC
(In reply to comment #56)
> (In reply to comment #55)
> > 1) Is it correct that 'Glib::RefPtr<Glib::IOChannel> channel' gets
> >    destroyed when ~PipeCapture() runs?
> 
> Why wouldn't it be?

I was just checking because I don't know glib(mm) very well at all and
destructors don't get called when an there is only a pointer to it in
the object currently getting destroyed.


> > 2) As used in OnReadable() what functions or object methods are update()
> >    and eof()?  I wasn't able to find out.
> 
> I don't understand.

Do you have a link for documentation on the update() and eof() functions
because I don't know what they are.
Comment 58 Phillip Susi 2013-04-29 13:32:52 UTC
(In reply to comment #57)
> I was just checking because I don't know glib(mm) very well at all and
> destructors don't get called when an there is only a pointer to it in
> the object currently getting destroyed.

That's why it's a RefPtr and not a plain pointer; RefPtr releases the reference count on the object when it is destroyed.

> Do you have a link for documentation on the update() and eof() functions
> because I don't know what they are.

They are sigc signals ( declared in PipeCapture.h ): they call any connected handlers.  I'm using them to forward information on when the pipe hits eof or has just read data ( so we can parse the progress output and update the progress bar ) to the callers of PipeCapture.  If they are interested in these events, they connect a slot to the relevant signal.
Comment 59 Mike Fleetwood 2013-04-29 15:28:04 UTC
Hi Curtis,

I've been testing the patch from comment #48 with the change I mentioned
in comment #53.  Tested various create, delete and copy FS and partition
operations on Fedora 14, CentOS 5.9 and Debian 6.  Unless you have any
reason not to I'll apply this upstream shortly.

Thanks,
Mike
Comment 60 Curtis Gedak 2013-04-29 15:31:44 UTC
Hi Mike,

As a double check, I would like to test the same version that you tested.  Namely the patch from comment #48 with the change mentioned in comment #53.

I will report back here with the test results.

Curtis
Comment 61 Curtis Gedak 2013-04-29 17:02:51 UTC
*** Bug 696810 has been marked as a duplicate of this bug. ***
Comment 62 Curtis Gedak 2013-04-29 17:13:12 UTC
Hi Mike,

My testing of the patch from comment #48 with the changes you mentioned in comment #53 has gone well.  I tested with OpenSUSE 11.2 and Debian 6.  I plan to continue testing up until we release 0.16.1

Please feel free to commit the modified patch.

I will tentatively plan for a 0.16.1 release after this modified patch is committed, and one is available for the recent problem with
Bug #695279 - GParted doesn't compile on RHEL / CentOS 5.9

Thanks,
Curtis
Comment 63 Mike Fleetwood 2013-04-29 17:49:23 UTC
Thanks Phillip,

I've pushed the commit for inclusion in the next release of GParted:

Avoid glibmm GSource bug/crash (again) (#697727)
https://git.gnome.org/browse/gparted/commit/?id=9475731ac82e2e29bd604bf73cf2282763c58aeb
Comment 64 Curtis Gedak 2013-04-30 16:31:24 UTC
The fix to address this bug report has been included in GParted 0.16.1 released on April 30, 2013.