GNOME Bugzilla – Bug 738121
gzip stream to compress history data
Last modified: 2021-06-10 14:56:31 UTC
Here are a few sample implementations of random-access on top of gzip: http://sourceforge.net/projects/razip/ https://code.google.com/p/idzip/ http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c Idea is very simple: break input in chunks of 16kb or 64kb, gzip each. Keep a listing of chunks in whatever datastructure, have a LRU cache of one or two uncompressed chunks.
Good idea. I think it should be implemented along with encryption. (gzip below, encryption above, correct? I don't think the encrypted data is compressable.)
For both layers, it's crucial that they shouldn't impose a significant performance regression when cating huge files.
Correct.
From bug 738601 because it rather belongs here: > > E.g. instead of runlength encoding of attributes and UTF-8, what if we store > > raw Unicode and attributes for each cell? How effective would gzip be on top of > > this, rather than on the current structure? > That's definitely interesting idea. I don't think gzip is the best compression > algorithm to deal with that, but I'm sure it would be acceptable. However, it > might mean that we pay for gzip overhead 8 times of what we do currently for > ascii text and no attributes. The primary factor in choosing a compression algorithm is probably that is must be super fast. The second is that it should compress reasonably well. After picking an algorithm, we could experiment with what data format it can compress better (e.g. UTF-8 vs. UCS-4, runlength-encoded attributes or repeating attrs for each cell). > Keep a listing of chunks in whatever datastructure That data structure can also grow very large, so should be part of the file stream, rather than reside in memory. row_stream requires quick access to a certain logical offset, maybe we won't be able to compress that one. The other two are only access via pointers pointed from row_stream, so if necessary, row_stream could be modified to contain the offsets of the gzip-block (and an offset within), so that the other two streams can also be accessed quickly. This way we could maybe avoid keeping the list of chunks.
Behdad found this in bug 738601 comment 62: http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf
In 738601 comment 63 I said I was not going to implement compression, but I'm not so sure about it anymore :) I was thinking about it for a while... initially I was hoping for having compressed blocks of 4kB (or some other fixed) size, but it's quite problematic, e.g. what if it uncompresses to a giant amount of data. Now I believe that your initial proposal "break input in chunks of 16kb or 64kb, gzip each" is the right approach. This would all go on top of the current one-fd + encryption + buffering (as in the said ticket, rc1 at the moment). The next layer above that buffering would be the gzip one, and then there we'd need to have another buffering layer, with would pretty much be the same code. I don't see how we could do without two buffering layers: the topmost batching up 16kB or 64kB of plain data to be compressed, the lower one batching up 4kB (or so) compressed data to be encrypted and stored. We'd need a method converting the logical offset (the head of the stream at the time of appending) to a pair of numbers: the offset (head) of the middle buffering layer (that is, a gzip synchronization point's compressed offset), and the amount of buffered data in the topmost layer (that is, the additional offset in the yet-uncompressed data). These two numbers would be stored in row_stream. They allow retrieval of the block: the first tells us the file offset from where we need to read a chunk (the length of the compressed data could be stored as the first few bytes of this chunk), then decompress and advance further according to the second offset. Truncating at the head would be tricky (as always) but looks easily doable. Advancing the tail: We can avoid decompressing at the tail. Just do nothing until a full gzip-block is advanced and then throw it away completely (just as it does with the encryption now). attr_stream and text_stream could be compressed this way. row_stream would remain uncompressed (we need direct random access), but a simple attribute(packed) reduces it from 24 to 17 bytes per row, and with a bit of bitpacking we can easily do 16.
A bit of gzip compression ratio test: Orig file: 58,484,115 bytes (100%) (output of "ls -lR --color=never") fastest(-1) default(-6) compressed as a whole 8,883,232 (15.1%) 7,011,365 (11.9%) compressed each 64kB block 9,207,353 (15.7%) 7,466,815 (12.7%) compressed each 16kB block 10,193,389 (17.4%) 8,717,291 (14.9%) compressed each 4kB block 12,432,766 (21.2%) 11,354,304 (19.4%) I'll probably vote for fastest with 16kB or 64kB. But let's have the code first.
http://www.oberhumer.com/opensource/lzo/ lzop -1 with 64kB blocks gives 12,507,517 (21.3%) compressed size, bigger by a third of gzip's corresponding size, but runs 4x faster. Not bad!
Encryption (AES 256 GCM) brought a higher than anticipated performance regression, about 20% with my favorite test file (cating it went from 7s to 8.3s). Lzopping the whole file (in a single run) takes 0.15s. So I'm hoping for an overall performance win by having compression (there's much less to encrypt/decrypt :))
Isn't lzo's GPL license an issue?
If the row_stream's size per entry (16 / 17 / 24) becomes a problem, we can do special layers that encode that type of stream better. Lets think a bit more about this. I don't know how much of the middle layer leaks into row_stream. Like to minimize that.
(In reply to comment #10) > Isn't lzo's GPL license an issue? Hmm, sure it is. Let's look for another compression library then, or go with zlib. Vte could easily expose compression hooks via the API, so gnome-terminal could ask it to use lzo instead of vte's default zlib. And then g-t is linked against lzo which is fine. How does it sound? (In reply to comment #11) > If the row_stream's size per entry (16 / 17 / 24) becomes a problem, we can do > special layers that encode that type of stream better. I think it'll be 21 bytes with attribute packed, or fewer if we do some bitpacking. The additions are the offsets within the unzipped block (2 bytes per offset). > Lets think a bit more about this. I don't know how much of the middle layer > leaks into row_stream. Like to minimize that. Me too, but I'm afraid we can't avoid some leakage. I'd introduce a struct VteStreamOffset { gsize major; uint16 minor; } For unzipped stream, minor would always be zero and major would be the offset. The caller would allow to do aritmhetics (as it's needed for row_stream). For zipped stream, major would point to the beginning of the gzip block, and minor would be an offset within that after decompression. The only thing a caller would need to do is handle these as opaque structures, store in row_stream unmodified, pass it unmodified to read, advance_tail, truncate etc requests. It wouldn't be able to do any arithmetics, but it wouldn't need to do it either.
(In reply to comment #10) > Isn't lzo's GPL license an issue? That only means that vte becomes _effectively_ GPL2+, which is fine, IMO.
(In reply to comment #13) > That only means that vte becomes _effectively_ GPL2+, which is fine, IMO. What do you mean by "_effectively_"? There might be non-GPL vte-based apps which would need to stick with an older vte or choose something else. Also I believe all contributors would need to agree to such a switch (e.g. what if someone posted a vte patch to make it better for his commercial app?). I'd prefer not to go this route, if possible.
(In reply to comment #14) > (In reply to comment #13) > > > That only means that vte becomes _effectively_ GPL2+, which is fine, IMO. > > What do you mean by "_effectively_"? GPL is viral: it can only be used with other GPL code. That would normally make LGPL software incompatible with GPL. However, LGPL has a provision that says anyone can relicense to GPL. What's what happens when you use LGPL code with GPL: you are relying on the conversion provision at run time. Using a GPL library in vte makes it always require that conversion. That's what Christian means by "effectively". Ie. non-GPL apps won't be legally able to use it anymore. We can possibly add a libvtegpl that users lzo. But that's too much trouble. Maybe we can ask lzo maintainer to consider relicensing. Or just find something else (if not gzip). > There might be non-GPL vte-based apps which would need to stick with an older > vte or choose something else. Also I believe all contributors would need to > agree to such a switch (e.g. what if someone posted a vte patch to make it > better for his commercial app?). I'd prefer not to go this route, if possible.
I started implementing this. It's not hopeless, but more complicated than I thought; will be more complex than the snake or encryption stuff. The ring is full of places where we do arithmetics on offsets (even in text_stream), these will need to be reworked. E.g. now we know in advance how much data we'll read for a row just by subtracting the two offsets, from now on it'll be more complicated. One possible approach is to store in row_stream both the logical uncompressed offset and the (compressed major, uncompressed minor) offset described above; however, it'd increase its size to 40ish bytes, making it probably larger than the compressed text stream. So I don't want to do this, I'd only want to store the (major,minor) offset. Let's drop this idea. Pointer arithmetics (advance/retreat by a value, or get the diff between two) could be moved down to the particular stream implementation. The only problem is that with compression, when the minor value overflows 64k (or underflows 0) we need a disk read to know how much to add/subtract to major. My first idea for disk storage format is 2 bytes encoding the length of the compressed data, followed by the compressed data. (Special case: if couldn't compress then store 0 as length followed by the 64k long uncompressed block.) Finally the length could be repeated, we might need this to allow to walk backwards in the file, not sure if we'll need it. With this format, I'm afraid rewrap would suffer from a serious performance degradation. Currently the most typical case (a paragraph which is ASCII 32-126 only) is rewrapped without reading text stream because we know how it wraps and we can do offset arithmetics without problem. With compression, we'd need to read a block at every compression boundary. Assuming that 64k blocks compress to ~20k, this would mean reading and decrypting every ~5th 4k-block from the disk, just to get 2 bytes: the compressed length of the block. We can maybe group some of these 2-byte values, e.g. store the length of 100 consecutive compressed blocks in consecutive 200 bytes. It makes implementation harder (especially with truncate), and (in case of limited scrollback) makes the file larger because we'd need to defer advance_tail until all these blocks are thrown away. Or keep the offsets for the blocks near the tail in memory, yet some amount of complication. Plus, the (major,minor) offset scheme would change to 3 numbers: the offset to this "superblock", the sequence number of the compressed block within, and the offset after decompressing that block. I'll keep thinking about a solution which is efficient yet simple enough.
I like the snake BTW. We should totally use that name! Though, personally I grew up calling it Nibbles.
:) And how about "boa" for encryption in the rc1 patches over there? I might have gone a bit too far, but once you get used to it it's easier to read than some VteFileStreamEncryptingLayer. And it's private to that file anyway. For compression, we should look up which kind of snake is the strongest in squeezing/compressing its food :)
Love the "boa" too!
Recovering from a disk full will be another problematic story. The bad news: we might give it up, or it'll complicate things even further. The good news: by doing it right there's a chance to get it right finally, including recovering from this situation :) We'd not able to walk forward (or backward) in a compressed stream without external help, since a read error in the middle causes us to lose sync and there's no way to synchronize back. So instead of doing arithmetics on the offsets of such streams, we'd need to rely more heavily on the offsets as stored in the uncompressed row_stream. Let's design the whole thing with this in mind.
A totally different approach occurred to me just now, which might be much easier to implement. It could be totally transparent to the caller, without any of that offset-mapping hassle. What if we keep every 64k boundary as a synchronization point, just compress each 64k block individually and leave gaps (sparse blocks) on the file system between them? Roughly: assume for example that a 64kB block compresses to 18000 bytes. Store this on 4096 * 5 = 20480 bytes, and the next 65536 - 20480 = 4096 * 9 = 45056 bytes will be sparse blocks. The exact math would be: assuming that we operate with 64kB offsets, we have 65536 / 4096 * 4080 = 65280 bytes of payload that we can encrypt (due to the encryption's integrity data), which would in turn allow us to try to compress blocks of 65280 - 2 = 65278 bytes (because we also need to store the length, and be able to denote if the block was uncompressable). So the compression layer would apply an offset mapping of the fixed ratio 65280 / 65278, exactly as the encryption layer already applies the ratio of 4096 / 4080. Advantages: - very easy to implement - implementation only touches vtestream-file.h - row_stream can also be compressed Disadvantage: - an additional average ~2k wasted per 64k block (effectively a ~3 percentage point loss in compression ratio) Being able to compress row_stream seems to outweigh this 3pp regression by far. We can further increase the compressed unit size to 128 or 256kB if we wish to.
Note for finite scrollback: We should implement the punch hole stuff, otherwise 64kB units in the first part of the snake (where we don't truncate back) would eventually get their 64kB units filled up more and more as they keep getting overwritten with compressed data of varying size. According to fallocate(2), Linux 3.7 supports punch_hole on xfs, ext4, btrfs and tmpfs. This looks good enough. Users of other file systems or other OSes are a bit ouf of luck (but hey, it'll still be way better for them than it used to be).
Created attachment 292220 [details] [review] Compression v1 Here's a damn simple proof of concept, it seems to work fine. Just a new layer between Boa and VteFileStream, and hardly any other change. No need for a second buffering layer, no need to expose two different implementation to the outside world. As expected, it brings a performance improvement (since there's much less to encrypt/decrypt :)) "du" doesn't work on /proc/12345/fd/*, so a temporary change is needed in vteutils.c not to unlink the file and not to hit the O_TMPFILE branch. My usual test case (colored ls -lR), with unlimited scrollback: before: 37556 vte9OCSQX 10116 vteW8PSQX 17692 vteYWPSQX after: 1944 vte2HZCQX 6884 vte370GQX 3208 vteQ8DCQX (I don't know which file corresponds to which, but it doesn't matter at all. It's 1/5th of the original disk usage.) TODO: - Punch holes in the file. - The lowest snake layer, even though receives 4k blocks, should probably align its teleports at 64k. With that, even systems without punch support would likely to keep many blocks unused. Now that we align at 4k, a 4k block that's skipped might be the initial 4k block of a 64k unit next time, so after just a few turnarounds of the snake all the 4k blocks will contain actual data rather than holes. - Tons of code cleanup, unittests etc.
(Note to self: du -D)
It's still too complicated :) Let's forget about using two different blocksizes, and work with a large one (~64k) only. And let's move compression and encryption to the same layer. Boa would just take a block of almost 64k size, compress, encrypt, and store whatever the result is. (How come we didn't start with this approach right away? :)) Modify snake so that it doesn't expect a full 64k block to write, but allow a block of any shorter size. The OS will figure out the sparse block stuff without any further help from us. A bit of disadvantage is that the compressed block's length would be stored unencrypted at the beginning of every 64k block, leaking a bit of metainfo, but I don't care about it too much. We can round it a bit (the value rounded to FS blocksize can anyways be read from the directory entry if someone gains raw access to the disk). Or actually we can encrypt it (in a separate run from the rest) if we really want to.
Hahahah. Sounds promising. Thanks for working on this!
Created attachment 292234 [details] [review] Compression v2 v2, as promised
Cating my usual test file (667000 lines), and 10000 scrollback lines in vte: Before this patch (but with snake+encryption): /proc/28735/fd$ du -D 16 17 18 1220 16 96 17 376 18 After this patch: /proc/1253/fd$ du -D 16 17 18 96 16 40 17 12 18
Smaller *and* faster? I take that! What's the speedup like?
It's quite hard to measure because there's a giant variance (maybe a headless mode could help have better numbers), and sometimes my laptop enters a state when it becomes slower by like 20% and I don't know how to take it out of this mode :) I can hear the fans wind up when vte or gcc does some heavy stuff, maybe it also slows down the CPU after a while. I really don't know. But basically when it's in "good mood" then the timing with my favorite test case is approximately: current vte git (with 6 fds): 7.7s snake: 7.0s snake+encryption: 8.3s snake+encryption+compression: 8.1s So it's not a big difference, but it's sure good to see. With lzo instead of zlib I'd expect somewhere around 7.6s. [Right now compression results in only 1/5th the amount to encrypt (1.3s becomes 0.3s or so), so the 8.1s probably boils down to about 7.0 base + 0.3 encryption + 0.8 compression, which would probably become around 0.4 encryption + 0.2 compression.] The new code is now full of g_asserts, not sure if enclosing them between #ifdef VTE_DEBUG's would make it noticeably faster, probably not.
Newer patch in bug 738601. (It goes in between two other patches maintained there, so it's simpler to keep it at one place.)
A friend of mine pointed this out to me: https://code.google.com/p/snappy/
I looked at this, among many other compression libs. My concerns: - I'm not sure about the licence. A bit of searching suggests that the old BSD licence is incompatible with LGPL, but the new one is probably compatible. We'd need confirmation. - Snappy seems to be near the end of the scale that it's really fast, but doesn't compress that well. This is probably not what we want, a better but bit slower compression does not only make vte use less disk, but probably also makes it faster overall due less encryption work. - I'm reluctant to bring in an unusual dependency for vte, and for whole Gnome as well. Snappy, along with a few similar ones, is hardly used by any apps. It has a package in Ubuntu and Fedora, but probably not in some smaller distros, and might also be inconvenient for those manually compiling vte. With these three criteria, I didn't find any that could replace zlib. So far lzo seems to be the best candidate to me, if we could clear the GPL<->LGPL issue.
Lets continue with gzip then. All BSD code around these days uses the new BSD which is compatible with LGPL.
My timing numbers were inaccurate (or the code has changed since then, I don't know), seems that punching holes all the time makes it noticeably slower, e.g. the overall 8.1s becomes around 9.1s best case, and often 10s-ish. Further increasing our blocksize doesn't seem to help, so it's probably not the context switch and entering the kernel call that's slow, but the actual work done inside. (For the record: I use ext4.) Maybe we could do something like punching holes only at every Nth turnaround of the snake. That way, let's say, for 9 turnarounds the 64k blocks would all slowly grow as sometimes bigger data is written, sometimes smaller data is written but it's not truncated back, and at its 10th turnaround each 64k block would be punched first, so that if there was an extremely large block then it doesn't remain there using the disk for vte's whole lifetime. I'll think about it. It's not that super critical, but I'd hate to see finite scrollback performing noticeably worse than infinite. Just for fun: after a "cat /dev/urandom", text_stream's 64k blocks compress to about 27k. This is because the stream is not random at all, it's the UTF-8-decoded-reencoded (plus escape sequences dropped) version of urandom, containing tons of \xEF\xBF\xBD (replacement character). Blocks of "base64 /dev/urandom" compress to about 50k.
That would work. In the mean time, lets keep thinking! Or, can we commit the two other layers first? Or maybe commit this and keep improving later?
Can you also add lzo as a compile-time option? Perhaps disable punching at all if it's too slow. That might not save disk space, but wouldn't make it worse either. Instead, it reduces I/O a lot. Maybe report the hole punch performance to glibc or the kernel?
> Or maybe commit this and keep improving later? I'll probably do this... > Perhaps disable punching at all if it's too slow. ...or this :) > Or, can we commit the two other layers first? I wouldn't want to switch back the order of gzip and encryption. Let's get the encryption reviewed in its final design, with some kind of compression already being there underneath. ----- I don't want to support two compression libraries (but if you implement it on top of my patches, I don't mind). LZO's source says "Special licenses for commercial and other applications which are not willing to accept the [GPL] are available by contacting the author." I'll write a PoC, see if it's noticeably faster than zlib, and if it is then I'll contact the author. Bringing in a new dependency for the whole Gnome project is still an issue, not sure it needs some kind of approval or alike. Or we could go for minilzo which is something we'd ship in vte's source tree and link statically.
Just wondering... when overwriting the middle of a file which contains actual data (let's assume punch_hole didn't work), is it any cheaper to write a full FS block rather than a partial one? E.g. when overwriting the first 4000 bytes, does someone need to perform a read first to get those 96 bytes and then write 4096? If so, is it the kernel or the HDD/SDD hardware; is it the filesystem block size (typically 4k) or the sector(?) size (512 bytes???) that matters? If such an extra read is required behind the curtains which slows it down, I can pad the boa block accordingly with zeros before giving it to snake.
Answering my previous question: I couldn't measure any difference (using O_DSYNC). ----- I've put in code to only punch the hole every 16th time, and it hardly brings any improvement, pretty much as if I always punched it. If I modify it to every 256th time then it's performing great, as expected. I don't understand.
Interesting. Definitely needs further investigation...
Patch from bug 738601 comment 102 committed. Leaving open to perhaps investigate the punch_hole performance issue.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/vte/-/issues/2137.