GNOME Bugzilla – Bug 701560
various improvements for g_file_set_contents()
Last modified: 2013-06-21 21:03:19 UTC
See patch.
Created attachment 245955 [details] [review] g_file_set_contents(): use posix_fallocate() Extents-based filesystems like knowing in advance how much data will be written to a file in order to prevent fragmentation. If we have it, use posix_fallocate() before writing data in g_file_set_contents().
In general, this function could maybe use to be cleaned up a little. There are two sore spots: a) we do a displayname conversion on the hope that there might be an error (since all error messages use it) and free it at the end b) for some reason we use fdopen() and fwrite() instead of just write() ie: two unnecessary memory allocations.
Review of attachment 245955 [details] [review]: Looks fine, one minor comment. ::: glib/gfileutils.c @@ +1065,3 @@ + * on the underlying filesystem. + */ + posix_fallocate (fd, 0, length); I like casting to (void) if we're ignoring the error, since it placates static analysis tools.
Also: from my research online it seem that ext3 and ext4 (after Linux 2.6.30) support omitting the fsync() before rename() with the default mount options. It would probably be nice to drop the fsync() for those filesystem as well (since we already do it for btrfs). The question is if we attempt to detect the 'bad' mount options and do the fsync anyway. I'd say no, because if people set those options then they are indicating that they are interested in the extra performance (even if at the cost of living life on the edge).
Comment on attachment 245955 [details] [review] g_file_set_contents(): use posix_fallocate() Attachment 245955 [details] pushed as d3be43f - g_file_set_contents(): use posix_fallocate()
Created attachment 245958 [details] [review] g_file_set_contents(): don't allocate display name g_file_set_contents() sets a GError in the event of various failures that count occur. It uses g_filename_display_name() in order to get the filename to include in the messages. Make it so that we only allocate this string in the actual event of an error instead of doing it always.
Review of attachment 245958 [details] [review]: If we're doing code cleanup, this code is crying out for a helper function that sets the error from errno and handles the display name there, like I did with https://git.gnome.org/browse/glib/commit/?id=7fc2ab9493dbb480e2b6815813af9cf8bbfc081e
Created attachment 245959 [details] [review] g_file_set_contents(): use unistd instead of stdio Use a normal write() system call instead of fdopen() and fwrite(). This will definitely work on UNIX system and should work on Windows as well... As an added bonus, we can use g_close() now as well.
Review of attachment 245959 [details] [review]: ::: glib/gfileutils.c @@ +1051,3 @@ errno = 0; + n_written = write (fd, contents, length); Er...you need to loop here. And handle EINTR. @@ +1123,2 @@ + if (!g_close (fd, err)) + goto out; This change is losing the unlink()-on-error, right?
Created attachment 245960 [details] [review] g_file_set_contents(): don't fsync on ext3/4 ext3 and ext4 (for quite some time) with default mount options don't need fsync() to ensure safety of replace-by-rename. Stop doing that for these filesystems. Note: this patch also impacts ext2, which is probably not safe, but I don't know of any way to check ext2. vs the others because they all have the same magic numbers (short of opening /proc/mount).
(In reply to comment #9) > Er...you need to loop here. And handle EINTR. For a local file? Seriously? > This change is losing the unlink()-on-error, right? Good catch. Thanks.
Review of attachment 245960 [details] [review]: We write out to a tmp file even if there wasn't an already extant file though; it looks like before we could lose data on BTRFS in this case, but you're now extending it to the far more widely deployed ext4 system. If we're going to do this we should elaborate a bit on the guarantees in the docs. All we say now is "This write is atomic in the sense that it is first written to a temporary file which is then renamed to the final name.", which doesn't really help much. We could either: 1) Make it explicit we defer to the filesystem semantics for rename() (maybe this is what the existing bit is trying to say, but we should at least mention data consistency) 2) Make it explicit we fsync() always (and do that) 3) Only fsync() in the case where we aren't renaming over an existing file (at the small cost of an lstat() of the target, this is what Gio does) You should probably back up the assertion in the first paragraph with some links. How about http://lwn.net/Articles/322823/ ? I don't like much that we're just tossing away potential safety without any idea of the relative performance increase and safety guarantees. There's lots of code that calls g_file_set_contents() now...
Created attachment 245962 [details] [review] g_file_set_contents(): don't allocate display name g_file_set_contents() sets a GError in the event of various failures that count occur. It uses g_filename_display_name() in order to get the filename to include in the messages. Factor out the error handling to make it easier to allocate the display name only when we need it (instead of allocating it every time).
Created attachment 245963 [details] [review] g_file_set_contents(): use unistd instead of stdio Use a normal write() system call instead of fdopen() and fwrite(). This will definitely work on UNIX system and should work on Windows as well... As an added bonus, we can use g_close() now as well.
(In reply to comment #12) > Review of attachment 245960 [details] [review]: > > We write out to a tmp file even if there wasn't an already extant file though; > it looks like before we could lose data on BTRFS in this case, but you're now > extending it to the far more widely deployed ext4 system. Check out the logic that's already here, though: if (g_lstat (dest_file, &statbuf) == 0 && statbuf.st_size > 0 && fsync (fileno (file)) != 0) { ie: we have only ever called fsync() in the "dest already exists and is non-empty" case. We are not losing any extra safety here. It seems that you're arguing we should do the opposite: add an explicit sync for the case that the file does _not_ exist. I can understand that this is nice from the standpoint of "by the time this function returns, your data is on disk" but consider that this is not really true anyway. The metadata for the rename has not been written out yet. I think the original intention here was simply that "this function should not destroy what was already there". We can't really make any further guarantees than this without becoming (much) slower. I think that this original intention is right, and this is exactly the guarantee that ext3/4 are providing these days, so we don't need to do more than that.
Review of attachment 245962 [details] [review]: This looks a lot nicer, one minor thing: ::: glib/gfileutils.c @@ +1013,3 @@ +format_error_message (GError **error, + const gchar *format_string, + const gchar *filename) Minor: Can we put the filename first since it's not actually used by the format string directly? Otherwise the 37 neurons in my brain that fire on "mismatched format string and arguments!" keep triggering when reading the code below.
Review of attachment 245963 [details] [review]: Looks right to me.
Created attachment 245965 [details] [review] g_file_set_contents: change {posix_ => }fallocate Use fallocate() instead of posix_fallocate() so that we just fail instead of getting the emulated version from the libc.
Comment on attachment 245962 [details] [review] g_file_set_contents(): don't allocate display name Attachment 245962 [details] pushed as c152ceb - g_file_set_contents(): don't allocate display name Pushed with suggested change, and also removed the now-unused variable.
(In reply to comment #15) > ie: we have only ever called fsync() in the "dest already exists and is > non-empty" case. We are not losing any extra safety here. I missed that; OK then. > It seems that you're arguing we should do the opposite: add an explicit sync > for the case that the file does _not_ exist. I can understand that this is > nice from the standpoint of "by the time this function returns, your data is on > disk" but consider that this is not really true anyway. The metadata for the > rename has not been written out yet. Many of my applications want transactional semantics - either the file is there with the correct data, or it's not there. Think cache files. So it'd be OK if at the end of g_file_set_contents() and the system crashed, the tmpfile was still in place and the new file didn't exist. In that case the app could just recreate it (although ideally we'd clean up the tmpfile, that's https://bugzilla.gnome.org/show_bug.cgi?id=629301#c2 ) But anyways I'm fine with the current semantics then. > > I think the original intention here was simply that "this function should not > destroy what was already there". Yeah.
Comment on attachment 245963 [details] [review] g_file_set_contents(): use unistd instead of stdio Attachment 245963 [details] pushed as e40435e - g_file_set_contents(): use unistd instead of stdio
Review of attachment 245965 [details] [review]: You could reference David's commit in your commit message or here in the bug (at the moment it was just on IRC), otherwise looks good.
Comment on attachment 245965 [details] [review] g_file_set_contents: change {posix_ => }fallocate Attachment 245965 [details] pushed as c828aef - g_file_set_contents: change {posix_ => }fallocate
(In reply to comment #22) > Review of attachment 245965 [details] [review]: > > You could reference David's commit in your commit message or here in the bug > (at the moment it was just on IRC), otherwise looks good. <mclasen> desrt: around ? <mclasen> desrt: seen this ? https://git.gnome.org/browse/gnome-disk-utility/commit/?id=d96d30229782682b1aecf697ec9ba573b7ea6aca <mclasen> I wonder how that relates to https://git.gnome.org/browse/glib/commit/?id=d3be43fcc5165b7680c9073438ad60a3652c1703
About the "don't bother with fsync" thing, my biggest concern is ext2. How much do we care? Should we figure out a way to work around this? I see two reasonable solutions here: 1) ignore that ext2 exists, and don't ever fsync on these filesystems 2) figure out a way to detect the difference, including the possibility of opening /proc/mounts and parsing it, looking for the filesystem that we are on The reason that I think that option 2 is better than "always fsync" is because - I assume ext2 will be a very rare case indeed - in the common case (ext3, ext4) the cost of opening and parsing /proc/mounts will be far less than the cost of the fsync(). That said, I'd be fine with just skipping the fsync() for ext2 as well... It's 2013, after all...
Review of attachment 245960 [details] [review]: I agree, and i don't think ext2 is commonly used enough for this to be a real life problem.
Attachment 245960 [details] pushed as 9d0c17b - g_file_set_contents(): don't fsync on ext3/4
Ok, there are two issues here we need to address: 1) We're no longer calling fsync() for *new* files on ext4. This is a very common deployment, and this can easily lead to bad corruption. I suspect most of the people hitting dconf database issues are people who are trying new logins. 2) Whether ext4 is really doing what we expect, and for what versions. There was AFAIK no actual testing of this patch in real world conditions.
Er nevermind, 1) was already discussed here: https://bugzilla.gnome.org/show_bug.cgi?id=701560#c15 But I have my doubts about 2).
Ok yeah, it's trivial to reproduce 0 size files on unclean shutdown after this patch. Which was apparently backported to the glib-2-36 branch?!? Why? System: Fedora 19 x86_64, Linux 3.9.4 Filesystem: ext4 (data=ordered) on LVM (Fedora default)
Created attachment 247362 [details] simple test program to just generate many files with g_file_set_contents() and verify them
Created attachment 247364 [details] [review] 0001-Revert-g_file_set_contents-don-t-fsync-on-ext3-4.patch
Sorry, not zero-length files, but files filled with NUL. This is AIUI characteristic of data blocks not being committed to disk. It's worth a quick peursal of: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt The docs claim that both data=ordered(*) and auto_da_alloc(*) are defaults. The combination of these are what theoretically allow replace-by-rename to be safe. Anyways, this reproduces easily in both my Fedora 19 VM (ext4-on-LVM) as well as my gnome-ostree VM (Linux 3.6.0 vanilla upstream, ext4 on bare virtio block). More data here would be useful - is this a recent regression?
Comment on attachment 247364 [details] [review] 0001-Revert-g_file_set_contents-don-t-fsync-on-ext3-4.patch Ryan gave a +1 to this on IRC.
I wrote an email to ext4 list about this issue asking for clarification. This smells like a kernel bug but I'm willing to entertain the possibility that our use of fallocate() voids the guarantee in the ext4 documentation about replace-by-rename being safe... If that's the case then I'd either prefer that they fix the kernel or that we drop our use of fallocate() rather than bring back fsync(). If the whole point of putting my data at risk is for ext4 to perform delayed allocation then using fallocate() is probably a waste of time anyway.....
colin: I don't know the code history, but on the simple 'end user experience' end, it sure quacks and walks like a regression: several people have shown up in IRC or on the downstream bug report lately reporting that their dconf databases have been zapped, and I don't recall *ever* hearing of a case of that prior to this month. It's happened to me personally and to mjg59 and to a bunch of other folks.
See also: https://bugzilla.gnome.org/show_bug.cgi?id=702785 for fixing up the dconf side of this.
*** Bug 702394 has been marked as a duplicate of this bug. ***
Ted Ts'o's reply to my email, for reference: http://www.spinics.net/lists/linux-ext4/msg38778.html Seems like the official line from the maintainer (despite what is written in the ext4 documentation) is that there are no guarantees. I'm trying to get him to update his docs. Meanwhile, with this issue addressed by Colin's reverts, I guess this is closed again...