GNOME Bugzilla – Bug 91494
Please, make sure that it is able to edit gigabytes large files
Last modified: 2021-05-25 17:46:58 UTC
Old ghex seemed to try loading everything into memory at once and
segfaulted on large files (which mostly need hex editing :) ). Please avoid
such lame behaviour in Gnome2.
true. but how should one do it - using mmap() would be a nice way, the
only problem is that the file on disk would then be modified directly...
Jaka, use mmap and with msync you can back out the changes that were
made to the memory if necessary.
*** Bug 137841 has been marked as a duplicate of this bug. ***
You could also try implementing a paging solution, kind of like lfhex
(http://freshmeat.net/projects/lfhex) does it.
mmap() won't work for large (>=2 GiB) files (on x86, anyway)
A paging solution doesn't seem too difficult, seeing how access to the data is
via the HexDocument API.
Any news on this?
> mmap() won't work for large (>=2 GiB) files (on x86, anyway)
> A paging solution doesn't seem too difficult,
Yeah, you could map just a window of 1 MB around the current display position, or something. You could even get clever and make madvise(2) calls when the user is paging up/down through the file.
You could mmap with MAP_PRIVATE and copy changed pages to the file on save. It doesn't look like it's possible to change a map from private to shared, i.e. ask the kernel to make your copied-on-write version the official version that should be written do disk. If you want to implement edit != save, or undo, probably you should just be read()ing from the file instead of messing around with mmap. Unless mmap avoids changing the API of the objects...
BTW, mmap()ing all of a file isn't the fastest way to read through it. The TLB misses hurt more than making read() system calls. mmap of small regions with MAP_POPULATE is probably good, though. Not sure if reusing the same chunk of virtual address space helps, but probably. Probably not a big deal, though. Using huge pages cuts down on TLB misses, but then you have to wait for the whole 2MB to page in before the process can run again. And having the one or two 4kB pages that fit in the displayed window sooner is probably a lot better.
Only read only mappings are portable in glib2 so best to use those surely. That way MapViewOfFile() can be used on windows without OS-specific semantics which are buggy in many kernel versions anyway... Modifications can be stored on the heap and merge them for display and file-save through the HexDocument interface.
That way the UI, file-save behaviour and undo/redo don't need massive changes.
It will be necessary to write modified files to a temporary file and rename them anyway in order to prevent bugs like the following (or disk full conditions etc.) from mangling the users data:
Think the performance costs of that approach will be unnoticable compared to the cost of rendering font glyphs etc.. I doubt anyone will be typing or even pasting in gigabytes of changes. :)
> It will be necessary to write modified files to a temporary file and rename
> them anyway in order to prevent bugs like the following (or disk full
> conditions etc.) from mangling the users data:
A read-only mapping plus edit changes lends itself pretty well to overwrite in-place saves: just walk through the change list, seek and write. (or pwrite(2), if available).
Hmm, maybe do this only if none of the changes are inserts. I hadn't realized that ghex2 supported insert mode, which is unusual for a hex editor. Insert mode might have to be disabled for block devices unless there is a save routine that will shuffle all the following data over to make room. In that case, you have a lot of I/O to do, and will leave the file somewhat corrupted if you crash part way through. (Let alone if the system crashes.)
When I reported that Ubuntu bug, I suggested that modifying in-place would be the way to go, since I didn't know about insert mode. Modifying in-place (on non-sparse files) never gets ENOSPC, and hex editors are often used on large files like disk images, WAV files, or even block devices. Block devices _must_ be modified in-place, and I wouldn't want a hex editor that wrote a second copy of a multi-GB file when it could have just overwritten the changed page or disk block. We're talking 10 ms save time (1 disk seek) vs. maybe 2 minutes for a large file.
If you're worried about ENOSPC, use posix_fallocate(3) before doing any writing. On Linux >= 2.6.23, it uses fallocate(2)...
I don't know glib's APIs, so sorry if my suggestions aren't very useful. I do know that usual behaviour for a hex editor is to modify in-place, since usually they don't support inserting bytes anywhere, even at the end.
Yes good point, I didn't think about block devices. Indeed the only sane way for that is edit in place with pwrite(2) or such. And in that case insert mode is just meaningless anyway. Probably we ought to fstat(2) and check for non regular files and then disable insert mode.
The problem with fallocate() and so on is that it's Linux specific (or POSIX at best). Maybe not a problem since current code-base is using lstat(2) anyway.
But when you're editing regular files the reasons to write the full thing out are not just ENOSPC, think about power-cut or the program crashes. Program crashes are bad enough to think about (for programmers anyway), but wiping out a few GB's of the users data is guaranteed to annoy them. Maybe the save code ought to do backup by default (for regular files) and allow user to disable this if they don't want to wait. How about that?
In the mean time I've started hacking on a portable mmap patch using GMappedFile. First cut will be without any insert mode.
> think about power-cut or the program crashes.
Yes, that's a major concern for saves that used insert mode, requiring shifting the rest of the file. But overwriting a few blocks can't cause the whole file to be lost, no matter what happens. You could end up with some blocks updated and some blocks not, though.
I've found with XFS, if you lock the machine hard right after writing a new file and moving it over an old file, you often end up with an empty file after a reboot. (In my case, it was xorg.conf, and the lockup was caused by the X server I started right after saving, to see if I'd finally got multi-seat working... I made xorg.conf a symlink to an NFS filesystem. I don't know why I didn't just run sync before startx, though. Anyway...) So writing a new file and replacing the old file is not always better, esp. for small files. For large files (relative to free RAM), the FS would have to start allocating space before the write is finished, but delayed allocation might still be a problem at the end of the file.
> Maybe the save code ought to do backup by default (for regular files) and
> allow user to disable this if they don't want to wait. How about that?
That's ok for edits with inserts. I think the default for pure overwrite edits should be to rewrite just the changed blocks. This is what people expect a hex editor to do.
behaviour could be as follows:
0. if there are no insert/delete edits, overwrite in place.
1. if inserts/deletes are only at the tail of the file, simply append/truncate.
2. if file size > 100MB, warn about slow saves when enabling insert mode.
Assuming someone writes a robust algorithm for saving insert/delete edits by shuffling data, we would need a heuristic to decide when to use it, at least by default:
3. if the inserts/deletes affects < 25MB and < 5% of the file size, overwrite in place.
3b (maybe have a higher threshold where you prompt the user asking how to save)
4. if the file is > 100MB, or if the filesystem doesn't have enough space for a copy, always prompt before saving a new file and replacing (if that's what the following rules say to do.)
5. default: save new file and replace.
This doesn't take into account the risk of bugs in the rewrite-in-place code when saving edits with lots of adds and deletes nearby, or whatever else might hit corner cases in an algorithm that walked through an edit list calculating shuffle distances. Or how recoverable the save process might be if interrupted part-way. (i.e. whether there will usually be data that's not in the file, only in ghex2's buffers.) To make it safe, ghex2 might have to implement a journaling algorithm and use fsync a lot, and that's starting to get silly. That's why I suggested limiting insert/delete overwrite in place to cases where it affects a small part of a large file, with an absolute size cap. So it will never be used in cases where it's going for very long.
Another reason to overwrite in place is that sometimes you might want to edit a file that another program has open. e.g. a disk image used by qemu. Replacing the disk image with a modified snapshot could lose data if the virtual system under qemu remounts read-write and makes changes to the block device. Those changes will go to the (deleted) original, not the editted copy, so will be lost when qemu exits.
So ghex should probably always warn when enabling insert mode, even for small files, since the user might be editting a file that another process has open or mapped.
Copying large files in an almost-full filesystem will lead to fragmentation of the file, too. If someone had a disk image that was contiguous, and ghex replaced it with a fragmented file that made I/O slower in a virtual machine, that would be a bad thing.
Again, it's a matter of what people expect a hex editor to do. By all means, have a preference setting for save mode, but at least warn when e.g. enabling insert mode, or anything that will make it not overwrite in place.
I'm glad to hear you're working on it. I hope ghex will soon be a fully functional hex editor. Sorry I don't have time to help with the code; I have too many other things going on...
Yeah, I think the point your making is along the lines of "Hell, you're using a hex editor, you (presumably) know what you're doing" :)
As for editing files another process has open or mapped, there's no way of telling in general that it will have the intended effect or just blow up in your face. But I'm happy to give people all the rope they need on that one because it may indeed be the users full intention to do that - and for good reasons (debugging, fault injection, reverse engineering and other creative hacks).
So I'm convinced. It won't be a problem to implement two different file-save functions (re-write/overwrite). And the principles of your heuristic are good, ie. overwrite where possible to do it safely, new-file if not, and some sort of way for the user to force re-write if that's what they want. Oh and of course, some sort of feedback so that they know what is going on.
Am still quite busy with school etc.. but making incremental progress with the patch so watch this space.
> "Hell, you're using a hex editor, you (presumably) know what you're doing" :)
_exactly_. And that's why it's important to follow the principle of least surprise (I think ESR talks about that in his book, The Art of Unix Programming).
> and some sort of way for the user to force re-write if that's what they want.
It might be ok to _sometimes_ put up a dialog that gives the user the choice between save methods. Maybe always before actually doing a re-write save. e.g. "These edits will be slow to save and require a lot of I/O to shuffle bytes in the original file. It may be safer to save to a new file and then replace the original (like a typical text editor)." with buttons for both ways, and help. Obviously you don't want to bother the user with that ever time. Unless you have code that will do the shuffling on insert edits, I don't think there are any cases where it's hard to decide what to do.
Thanks for listening to my suggestions, and happy hacking.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).
If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
and create a new enhancement request ticket at
Thank you for your understanding and your help.