GNOME Bugzilla – Bug 790418
"Unable to inform the kernel of the change" message may lead to corrupted partition table
Last modified: 2018-03-19 17:08:42 UTC
Created attachment 363780 [details] Message log for GParted I just used GParted to resize some partitions on my HDD. Booted up from the Live CD (USB) so that I wouldn't have any mounting issues. This is my (new) partition list: Model: ATA Samsung SSD 840 (scsi) Disk /dev/sdb: 120GB Sector size (logical/physical): 512B/512B Partition Table: msdos Disk Flags: Number Start End Size Type File system Flags 1 1049kB 62,9GB 62,9GB primary ntfs 2 62,9GB 63,4GB 472MB primary ntfs 3 63,4GB 64,5GB 1074MB primary ext4 4 64,5GB 120GB 55,6GB extended lba 5 64,5GB 120GB 55,6GB logical lvm 1. Windows 2. Windows recovery 3. /boot 4-5. Extended partition containing a logical partition, which is a physical volume containing logical volumes with my Fedora linux & swap. Goal was to: a) Shrink parition 1 (Windows) b) Move 3, 4, 5 back to the empty space c) Grow 4 & 5 to the free space. I unmounted & deactivated all partitions first. In the midst of its operations, GParted showed the following dialog: "libparted Partition(s) 5 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes. Cancel | Ignore" Okay, I said Cancel according to what it says, then rebooted into the Live CD again... and my LVM partition was corrupted! The partition table looked okay, everything moved behind but I haven't been able to mount or access it. Not even with lvscan or whatever. Solution was testdisk that came with the Live CD (what a savior!): it turned out that although the partition table was updated, the partitions themselves haven't been moved yet when I canceled according to the instrucions in the dialog box. So with its deep scan feature, I've been able to recover a version of the parition table when the Windows partition has already been shrunk but the others were still at their original place. Then I redid the whole thing, except pressing Ignore in this dialog box that came up 3 times during the operations. This time it was fine, then I simply used lvresize and resize2fs to grow the logical volumes. So: - maybe this is not even a bug? - maybe this is a bug in libparted? - but the message dialog is shown by GParted, it's misleading because doing what it says leads to partition table corruption which is hard to recover. Something that the average GParted user might not want to do. - ignoring the message leads to proper results, but there is also a dangerous cancel button there. I attached the log files, the above messages are listed at the end of it.
Hi Gergely, Thank you for the detailed report. Unfortunately in your case you were moving the start of partition sdb5 to the left. This was written to disk, but the kernel counldn't be informed of the change. Therefore the data in sdb5 was not moved either. This resulted in tools no longer finding the LVM PV signatures at the start of sdb5 so the partition would have appeared as unknown. You ended up doing the right action and reverting the last partition change to recover your data. I suspect that this is similar to previous bugs that we have worked around in GParted [1][2]. Possibly caused by asynchronous udev triggered actions interfering with following libparted actions related to resizing (or removing and re-adding) partitions leading to your error: Partition(s) 5 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes. So far I haven't been able to reproduce that error, however I have on occasion managed to produce this error: Error informing the kernel about modifications to partition /dev/sdb1 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdb1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Both errors are from the same portion of the code in libparted, used to inform the kernel of partition additions, removals and changes. http://git.savannah.gnu.org/cgit/parted.git/tree/libparted/arch/linux.c?id=v3.2#n3080 I'll see what I can do in GParted to work around this issue. Mike Previous bugs: [1] Bug 604298 - Problems resizing file systems with gparted-live-0.5.0-3 (Operations failing reporting "Error informing the kernel about modifications to partition ...") [2] Bug 762941 - Operations sometimes failing with: No such file or directory
Thanks for your comment Mike. Let me know if you need further info. Yes, cancelling (& rebooting) at the moment of the message led to a corrupted state when the partition table has been updated but data on the HDD has not. If you look a the log, I'm pretty sure that the error came up after growing the partition: grow partition from 19.94 GiB to 51.75 GiB 00:02:31 ( SUCCESS ) old start: 192614400 old end: 234440703 old size: 41826304 (19.94 GiB) new start: 125902848 new end: 234440703 new size: 108537856 (51.75 GiB) (and before moving the filesystem) I hope that you can work around this issue. It seemed to be reproducible for me...
Created attachment 364072 [details] Test ped_disk_commit_to_os() reporting errors Here is a simple test program that I have written which reproduces the fault. All it does is use libparted to inform the kernel of the current partitions on a named hard drive. Doesn't make any changes. Example setup ------------- Just use any spare disk with a few partitions. I used: * Partition table: MSDOS * 3 primary partitions containing # parted /dev/sdb unit MiB print Model: ATA VBOX HARDDISK (scsi) Disk /dev/sdb: 8192MiB Sector size (logical/physical): 512B/512B Partition Table: msdos Disk Flags: Number Start End Size Type File system Flags 1 1.00MiB 1025MiB 1024MiB primary 2 1025MiB 2049MiB 1024MiB primary 3 2049MiB 3073MiB 1024MiB primary Example successful test ----------------------- # ./c-test-0011 /dev/sdb calling ped_disk_commit_to_os(lp_disk) ... ped_disk_commit_to_os(lp_disk) succeeded # echo $? 0 Example failed test ------------------- # ./c-test-0011 /dev/sdb calling ped_disk_commit_to_os(lp_disk) ... Error: Error informing the kernel about modifications to partition /dev/sdb1 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdb1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Error: Failed to add partition 1 (Device or resource busy) ped_disk_commit_to_os(lp_disk) failed # echo $? 1 How to monitor udev and trace the test program ---------------------------------------------- To monitor the kernel and udev events, while tracing the test program for relevant ioctl() OS calls do this. # udevadm monitor & # strace -i ioctl ./c-test-0011 /dev/sdb
Correction for strace command. "-i" -> "-e". Should be: # strace -e ioctl ./c-test-0011 /dev/sdb
Created attachment 364078 [details] Test case 1: CentOS 7 failure monitored and traced Test case 1: CentOS 7 VirtualBox VM failure ------------------------------------------- On CentOS 7 it is generally quite easy to get the test case to fail. Just run it half a dozen times or so for it to fail with the following error. Error: Error informing the kernel about modifications to partition /dev/sdb1 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdb1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Attached file is the full trace from this failure. Libparted call chain: ped_disk_commit_to_os() parted/libparted/disk.c linux_disk_commit() parted/libparted/arch/linux.c _disk_sync_part_table() parted/libparted/arch/linux.c _blkpg_add_partition() parted/libparted/arch/linux.c ped_exception_throw(..., message="Error informing the kernel about modifications to ...) http://git.savannah.gnu.org/cgit/parted.git/tree/libparted/arch/linux.c?h=v3.1#n2385 Log fragment: ioctl(3, BLKPG, {BLKPG_ADD_PARTITION, flags=0, datalen=152, data={start=1048576, length=1073741824, pno=1, devname="/dev/sdb1", volname=""}}) = -1 EBUSY (Device or resource busy) Error: Error informing the kernel about modifications to partition /dev/sdb1 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdb1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Error: Failed to add partition 1 (Device or resource busy) ped_disk_commit_to_os(lp_disk) failed In summary what happens is that libparted 3.1 tells the kernel to remove all partitions (ignoring failures because the partition is busy) and then tells the kernel to re-add all partitions. The kernel then also generates udev userspace events which are processed asynchronously. However in this failure case the kernel reported the device as busy when re-adding partition sdb1. Hence libparted reported the error and failed the ped_disk_commit_to_os() call.
Created attachment 364175 [details] Test case 2: CentOS 7 success monitored and traced Test case 2: CentOS 7 VirtualBox VM success ------------------------------------------- Attached file is the full trace from a successful test. Log fragment: ioctl(3, BLKPG, {BLKPG_ADD_PARTITION, flags=0, datalen=152, data={start=1048576, length=1073741824, pno=1, devname="/dev/sdb1", volname=""}}) = 0 ioctl(3, BLKPG, {BLKPG_ADD_PARTITION, flags=0, datalen=152, data={start=1074790400, length=1073741824, pno=2, devname="/dev/sdb2", volname=""}}) = 0 ioctl(3, BLKPG, {BLKPG_ADD_PARTITION, flags=0, datalen=152, data={start=2148532224, length=1073741824, pno=3, devname="/dev/sdb3", volname=""}}) = 0 KERNEL[916986.831800] add /devices/pci0000:00/0000:00:0d.0/ata4/host3/target3:0:0/3:0:0:0/block/sdb/sdb1 (block) KERNEL[916986.831815] add /devices/pci0000:00/0000:00:0d.0/ata4/host3/target3:0:0/3:0:0:0/block/sdb/sdb2 (block) KERNEL[916986.831821] add /devices/pci0000:00/0000:00:0d.0/ata4/host3/target3:0:0/3:0:0:0/block/sdb/sdb3 (block) ped_disk_commit_to_os(lp_disk) succeeded The only real difference is that all the user space UDEV rules triggered from informing the kernel of the removal of the partitions completed fraction earlier before libparted got around to re-adding the partitions.
Test case 3: Fedora 27 VirtualBox VM failure -------------------------------------------- On Fedora 27 it is less likely that the test case fails. Have to run the test program continuously in a loop to get it to fail, but it only takes a minute or less. # while ./c-test-0011 /dev/sdb > do > sleep 0 > done ... output trimmed ... calling ped_disk_commit_to_os(lp_disk) ... Error: Partition(s) 1 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes. ped_disk_commit_to_os(lp_disk) failed This is the same error Gergely reported. So far I have not been able to trigger failure while stracing the test program. It must modify timing enough for the error to never occur. # strace -e ioctl ./c-test-0011 /dev/sdb > do > sleep 0 > done Libparted 3.2 call chain to this error: ped_disk_commit_to_os() parted/libparted/disk.c linux_disk_commit() parted/libparted/arch/linux.c _disk_sync_part_table() parted/libparted/arch/linux.c ped_exception_throw(..., message="Partition(s) %s on %s have been written, but we have ...) http://git.savannah.gnu.org/cgit/parted.git/tree/libparted/arch/linux.c?h=v3.2#n3077
Test program c-test-0011.c (attachment 364072 [details]) is basically doing a similar job to partprobe. So does partprobe fail? Partprobe doesn't seem to fail just run a few time on the command line, however run in loop we soon hit a timing coincidence and get it to fail. Test case 4: CentOS 7 VirtualBox VM failure of partprobe -------------------------------------------------------- # while partprobe /dev/sdb > do > sleep 0 > done Error: Partition(s) 1 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes. Test case 5: Fedora 27 VirtualBox VM failure of partprobe --------------------------------------------------------- # while partprobe /dev/sdb > do > sleep 0 > done Error: Partition(s) 1 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes.
Created attachment 364312 [details] Test case 6: CentOS 7 comparison of c-test-0011 and partprobe logs In CentOS 7 VM, c-test-0011 is much more likely to fail than partprobe. c-test-0011 will usually fail within half a dozen times run manually on the command line, where as partprobe has never failed when run singularly on the command line, but only ever in a continuous while loop. On CentOS 7, parted-3.1-28.el7 RPM contains this change: $ rpm -q --changelog parted ... * Thu May 26 2016 Brian C. Lane <bcl@redhat.com> - 3.1-26 - partprobe: Open the device once for probing Resolves: rhbz#1339705 * rhbz#1339705 - ceph-disk prepare: Error: partprobe /dev/vdb failed : Error: Error informing the kernel about modifications to partition /dev/vdb1 -- Device or resource busy. https://bugzilla.redhat.com/show_bug.cgi?id=1339705 Which is this upstream parted commit, post 3.2, but back ported into CentOS 7, parted-3.1-27.el7: http://git.savannah.gnu.org/cgit/parted.git/commit/?id=cfafa4394998a11f871a0f8d172b13314f9062c20 Author: Brian C. Lane <bcl@redhat.com> Date: Wed May 25 09:00:04 2016 -0700 partprobe: Open the device once for probing Previously there were 3 open/close pairs for the device, which may result in triggering extra udev actions. Instead, open it once at the start of process_dev and close it at the end. Comparing the source of c-test-0011.c and partprobe.c (from CentOS 7 parted-3.1-28.el7) for the sequence of libparted APIs called they look like this: CMD: c-test-0011 /dev/sdb partprobe /dev/sdb APIs: ped_device_get("/dev/sdb") ped_device_get("/dev/sdb") ped_device_open() ped_disk_probe() ped_disk_new() ped_disk_new() ped_disk_commit_to_os() ped_disk_commit_to_os() ped_disk_destroy() ped_device_close() Test case 6: CentOS 7 VirtualBox VM comparing c-test-0011 and partprobe ----------------------------------------------------------------------- Tested both of these capturing udev events and stracing of additional OS calls, thus: Terminal 1: Terminal 2: # udevadm monitor # strace -e open,ioctl,close ./c-test-0011 /dev/sdb ^C # udevadm monitor # strace -e open,ioctl,close partprobe /dev/sdb ^C Captured log files: c-test-0011-udevadm-monitor.log c-test-0011-strace.log partprobe-udevadm-monitor.log partprobe-strace.log Examining the log files shows that without ped_device_open() to hold a file handle open to the device, libparted internally is opening and closing the device an extra time in order to issue ioctl()s for ped_disk_new() and ped_disk_commit_to_os() separately. $ fgrep 'open("/dev/sdb"' c-test-0011-strace.log open("/dev/sdb", O_RDONLY) = 3 open("/dev/sdb", O_RDWR) = 3 open("/dev/sdb", O_RDWR) = 3 $ fgrep 'open("/dev/sdb"' partprobe-strace.log open("/dev/sdb", O_RDONLY) = 3 open("/dev/sdb", O_RDWR) = 3 Which is leading to less udev events being triggered by partprobe, making the timing dependent triggering of the error less likely. $ fgrep 'KERNEL[' c-test-0011-udevadm-monitor.log | wc -l 20 $ fgrep 'KERNEL[' partprobe-udevadm-monitor.log | wc -l 13
Created attachment 364331 [details] GParted details of move operation with multiple libparted messages Hi Curtis, At the moment GParted reports all libparted messages in the operation results at the end of each operation. I am thinking that they should be reported after each step so that it is clear what libparted messages occurred when. As an example see the attached GParted details. I did a combined resize and move operation. It has a couple of genuine as well as some debugging exceptions. Can you work out when all the exceptions occurred and whether they were resoled successfully or not (clicked [Ignore] or [Cancel])? Answer: 1st shrink partition: synthetic exception clicked [Ignore] for success. 2nd grow partition: informing kernel exception clicked [Ignore] for success. informing kernel exception clicked [Ignore] for success. synthetic exception clicked [Ignore] for success. 3rd shrink partition: synthetic exception clicked [Cancel] for failure. I'm planning to change the code according unless you have a reason not to. Thanks, Mike
Hi Mike, If you've got some ideas on how to improve the libparted feedback and tracking then please do proceed. If I recall, earlier versions of GParted did not even permit the user to respond to libparted questions/messages. Phillip made great strides in this area when he added a libparted exception handling popup around 0.12.0. Curtis
Raised separate bug 790842 to "Report libparted messages into operation details at the point at which they occur".
Created attachment 364581 [details] [review] Avoid errors informing the kernel about partition changes (v1) Hi Curtis, Here's patchset v1 for this. One thing I am still wondering about is whether to fix the following. User does a resize/move which involves moves the start of a partition to the left. On failure of this still leaves the file system inaccessible. Thinking that we might have to try to undo that partition change. Thanks, Mike
Hi Mike, Thanks for the patch set. I'll take a look at it soon. > One thing I am still wondering about is whether to fix the following. > User does a resize/move which involves moves the start of a partition to > the left. On failure of this still leaves the file system inaccessible. > Thinking that we might have to try to undo that partition change. I think that as long as it is 100% safe to undo the partition change then it is okay to do so. One situation that qualifies is when there is no overlap when moving the partition. By this I mean the end of the new partition location must be less than the start of the old partition location. Otherwise when moving left there is a chance that the start of the old partition may have been overwritten. In such a case it is not valid to restore the old partition boundaries. Of course you could add code to track whether this situation actually occurred. Another issue is that many times when I've seen a partition move fail it is due to hardware errors. Such hardware errors may make the situation worse by trying to further write to the drive to change the partition boundaries. At least those are my thoughts. :-) Curtis
Hi Mike, Thanks for the patch set with very detailed comments. I appreciate the explanations. In reviewing patch set v1 from comment #13 I came up with the following minor comment: P2/2 - Wait for the kernel and udev to settle partitions for a second time Minor wording issue: Change "too" to "to". CHANGE FROM: and re-adds the partitions again. Need to wait for these to complete too prevent any following step failing with missing partition device ^^^ nodes. TO: and re-adds the partitions again. Need to wait for these to complete to prevent any following step failing with missing partition device nodes. I will make this change prior to committing the patch set. In my testing I found no regressions. If there are no objections then I will commit this patch set in the next day or so. Curtis
Hi Curtis, I am only talking about rolling back a single failed step to modify the partition boundaries, as implemented by resize_move_partition(). Not about restoring the original partition as at was at the start of the operation. Given that implementing a resize/move operation involves a specific ordered sequence of partition adjustments, file system moves and file system resizes failing at any point must be a safe place to stop. So stopping either before or after a single partition adjustment is safe, then rolling back just that single failed partition adjustment is safe. It seems however that in some cases rolling back the partition change would be preferable and in other cases it would not. A couple of simple examples: 1) Fail to grow partition before growing the FS. Better to roll back. 20 Fails to shrink partition after shrinking the FS. Better to keep new partition if at all possible. On the failure front, GParted_Core::commit() has two steps which could fail, writing to the disk and informing the kernel. Also with GPT having two copies of the partition table it would appear to be possible for one to succeed and the second to fail. Going to analyse all the cases when resize_move_partition() is used more carefully to see when partition adjustment rollback is preferred. Thanks, Mike
Hi Mike, If you identify some use cases where it is better to roll back than to leave the partition table as is, then I am certainly open to such improvements. Thanks again for patch set v1 in comment #13. I have committed this patch set to the git repository. The relevant git commits can be viewed at the following links: Avoid libparted failing to inform the kernel about partition changes (#790418) https://git.gnome.org/browse/gparted/commit/?id=f49f0bb2b8114d867469dcbed6b501e69c078b0d Wait for the kernel and udev to settle partitions for a second time (#790418) https://git.gnome.org/browse/gparted/commit/?id=2f53876c0fc8bceddabe739c298e19e7939d9ad7 Curtis
So, the patches applied in comment 17 work, but they still don't completely prevent GParted/libparted encountering an error informing the kernel of the updated partitions. On CentOS 7 with libparted 3.1 GParted can still fail. One captured failure from just deleting a partition in GParted has this call chain: delete_partition() commit() commit_to_os() ped_disk_commit_to_os() Libparted encountered an error and reported this exception: Error informing the kernel about modifications to partition /dev/sdb1 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdb1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. At the moment I don't understand the trigger for this error because the new code in commit() gets libparted to keep a single file handle open over the ped_disk_commit_to_dev() and ped_disk_commit_to_os() calls and I though the kernel and udev only fired events when the open file handle was closed, which is after the call to ped_disk_commit_to_os(). Mike
Raised separate bug 791875 "Rollback specific failed partition change steps" to prevent leaving user's data apparently lost when partition changes fail. This is meant to provide a complete solution to the issue faced by Gergely as described in comment 0. Selecting Cancel to a libparted raised exception reporting failure to inform the kernel will rollback selected partition changes ensuring boundaries match file system data, preventing partitions becoming reported as unknown and loosing user's data.
This enhancement was included in the GParted 0.31.0 release on March 19, 2018.