GNOME Bugzilla – Bug 738601
Implement vtestream with one fd
Last modified: 2015-01-18 16:30:57 UTC
There are a couple different approaches we can take. One interesting approach is to use mmap and mremap as suggested in these articles: https://www.mikeash.com/pyblog/friday-qa-2012-02-03-ring-buffers-and-mirrored-memory-part-i.html https://www.mikeash.com/pyblog/friday-qa-2012-02-17-ring-buffers-and-mirrored-memory-part-ii.html That's at the expense of using address space and wouldn't work with our plans for gzip and encryption. So, lets not talk about that one. Another approach I can think of is, if we are a bit more careful, we can get away with just one descriptor. This is how: Currently, vtestream-file uses two FDs as magazines. It fills one with one full scrollback worth of history and puts it aside, then starts filling the second one. When the second one is full, we know we don't need the first one anymore, so we truncate and reuse that one, and repeat. The nice thing about this is very simple. But it has its own costs. Namely, two FDs per stream. That's six per terminal. It's true that for unlimited scrollback we just need one FD per stream. But still, would be nice to make it one per stream for the limited-scrollback case as well. Here's one way I can think of. It just needs more housekeeping. Say, we start filling in the fresh file: 1: aaaaaaaa at this point we notice that we have one scrollback full of history. So we mark and continue, from here on is our second page: 2: aaaaaaaa|bbbbbbbbbbbbb when we've got enough 'b's, we don't need 'a's anymore, so we start from the beginning again, overwriting 'a's: 3: cccccaaa|bbbbbbbbbbbbb| This is all good. If we got enough 'c's while there's still 'a's to overwrite, we truncate the file and continue writing 'd's. This is similar to case 2. Otherwise, we run out of 'a's and we have not got enough 'c's yet. In this case, we just make a note and jump over the 'b's: 4: cccccccc|bbbbbbbbbbbbb|ccc We just need to remember the order of things in that map. At some point we've got enough ccc's, and we don't need 'b's anymore, and. In this case, we continue writing and leave 'b' area unused: 5: cccccccc|bbbbbbbbbbbbb|ccccc|dddddddddddddddddddddddddddd Whenever we've got enough 'd's, we don't need any of 'b's or 'c's, so we think of them as both junk, we rewind and start writing 'e's at the beginning, but now this is the same situation as case 3. So, we have to handle up to five different cases, but I believe it would work. At any time there will be three contiguous ranges to care about. WDYT?
(In reply to comment #0) > mmap and mremap > wouldn't work with our plans for gzip and encryption. Are you sure they wouldn't work? > Another approach I can think of is, if we are a bit more careful, we can get > away with just one descriptor. This is how: SGTM. In the mean time, I'd like to put a bit of thinking into whether we really need 3 streams, or we could cut it down to 1 or 2 with some clever data structure. E.g. instead of runlength encoding of attributes and UTF-8, what if we store raw Unicode and attributes for each cell? How effective would gzip be on top of this, rather than on the current structure? Also, for small (like 10000 lines) scrollbacks I'd consider the possibility to use a malloc'ed segment instead of an actual file. If we store the same format in the memory, all we'd need to do is every once in a while increase or maybe decrease the amount of allocated memory. Maybe this could be done in a way that doesn't lead to fragmentation, I'm not sure yet.
(In reply to comment #0) > 5: cccccccc|bbbbbbbbbbbbb|ccccc|dddddddddddddddddddddddddddd Maybe an easier implementation of the same concept is to devote every even 4kB-block of the file to one scrollback worth of data, and every odd 4kB-block to the other one. That is, make it fixed aaaabbbbaaaabbbbaaaabbbb...
Or... could we just simply multiplex all the 6 current files into one, using this pattern (aaaabbbbccccddddeeeeffffaaaa...)? Even if the current 6 files are of vastly different length, the resulting sparse file won't take up more disk space than currently. ... except if one of the multiplexed files (which is not the largest) shrinks, so far we could shrink its file, but can we now wipe out the middle of the multiplexed file, as if it was never allocated? Linux seems to be able to do this (fallocate(2) FALLOC_FL_PUNCH_HOLE), how about others?
(In reply to comment #1) > (In reply to comment #0) > > mmap and mremap > > wouldn't work with our plans for gzip and encryption. > > Are you sure they wouldn't work? I mean we sure could use. The way I see it, the benefits are much smaller since we read/write compressed+encrypted data in chunks, so we might as well call read/write. Don't know. One problem with mmap is that under memory pressure it will segfault the process whereas with read/write we gracefully handle errors. > In the mean time, I'd like to put a bit of thinking into whether we really need > 3 streams, or we could cut it down to 1 or 2 with some clever data structure. > > E.g. instead of runlength encoding of attributes and UTF-8, what if we store > raw Unicode and attributes for each cell? How effective would gzip be on top of > this, rather than on the current structure? That's definitely interesting idea. I don't think gzip is the best compression algorithm to deal with that, but I'm sure it would be acceptable. However, it might mean that we pay for gzip overhead 8 times of what we do currently for ascii text and no attributes. > Also, for small (like 10000 lines) scrollbacks I'd consider the possibility to > use a malloc'ed segment instead of an actual file. If we store the same format > in the memory, all we'd need to do is every once in a while increase or maybe > decrease the amount of allocated memory. Maybe this could be done in a way > that doesn't lead to fragmentation, I'm not sure yet. My idea for that case is to set the buffer sizes such that small-enough scrollback never fills a buffer to go through the compression+encryption to begin with. Would be nice if we can avoid separate codepath for that. Re avoiding 3 streams... The nice thing about the way this is implemented now is that it handles all kinds of corner cases like veeeeeeeeeeeeeeeeeeeeeeeeeeeery long lines, or maaaaaaaaaaaaaaaaaaaaany short lines as efficiently. This is achieved mainly having: - Random access mapping from delta ("line", better known as paragraph, number) to start of text and attrs (costing one FD), and - Random access to text (costing one FD), A nice side-effect of keeping UTF-8 is that search and copy / save-buffer can be made fast. But 1. we currently don't use this property, and 2, those are rare operations compared to normal operation of the scrollback. So we should be ready to remove that feature if it makes sense. Separating cell attrs was just a size optimization. Separating the other two was more about performance characteristics. For example, one way to reduce one FD is to merge the attrs and utf8 streams. utf8 encoded has unused patterns (a 0xFF byte for example is never used in UTF-8; there are a few of those). So we can use that as a escape sequence for attribute changes.
(In reply to comment #3) > Or... could we just simply multiplex all the 6 current files into one, using > this pattern (aaaabbbbccccddddeeeeffffaaaa...)? Even if the current 6 files are > of vastly different length, the resulting sparse file won't take up more disk > space than currently. Humm. Not sure I follow. How do you make that conclusion? > ... except if one of the multiplexed files (which is not the largest) shrinks, > so far we could shrink its file, but can we now wipe out the middle of the > multiplexed file, as if it was never allocated? Linux seems to be able to do > this (fallocate(2) FALLOC_FL_PUNCH_HOLE), how about others?
(In reply to comment #5) > Humm. Not sure I follow. How do you make that conclusion? $ dd if=/dev/urandom of=sparsefile bs=1024 seek=1048576 count=1 $ ls -l sparsefile -rw-rw-r-- 1 egmont egmont 1073742848 Oct 17 10:13 sparsefile $ du -h sparsefile 4.0K sparsefile Blocks that are not written but skipped over are not allocated on the disk (they might take up some space in the inode and in the indirect blocks though, I'm not sure about this). All you need to guarantee is that the multiplexing size unit is a multiple of the file system block size (the latter typically being 4k, can be queried with statfs).
(In reply to comment #4) > For example, one way to reduce one > FD is to merge the attrs and utf8 streams. utf8 encoded has unused patterns (a > 0xFF byte for example is never used in UTF-8; there are a few of those). So we > can use that as a escape sequence for attribute changes. I like this idea. One caveat though is that the current attribute has to be repeated at the beginning of every line (maybe optimized to only do this if it's the non-default attribute), which might have a noticeable penalty in the rare case that the buffer is full of very short (or empty) lines. But I'm happy to live with that.
Er... wait a bit. We need to know the starting attribute at the beginning of every *line* (not paragraph), but the text stream is left intact on a rewrap. So this data can't reside there, it should be moved to row_stream (along with pointers to the start of each row, it should also contain the attributes that row starts with).
(In reply to comment #4) > 0xFF byte for example is never used in UTF-8; there are a few of those). So we > can use that as a escape sequence for attribute changes. <just_for_fun> Can the terminal area ever contain the escape character itself? I don't think so. So, we could actually use the standard ANSI codes in the UTF-8 stream. :) Then if anyone ever digs into those deleted files in /tmp and strips of the to-be-implemented gzip/encryption/multiplexing layers, they would be able to cat that data into terminals, with the proper attributes. It's of zero practical use, but sounds cool. It's harder to implement (e.g. we probably don't want to use vteseq's parser when reading from the scrollback, plus we need to implement the opposite direction too and make sure they are absolutely consistent). It might waste more space with complex color changes, but might save space with simpler ones (e.g. just the bold flag toggled). A binary dump of the cell attribute structure is definitely an easier way to go. </just_for_fun>
(In reply to comment #0) > at this point we notice that we have one scrollback full of history. So we > mark and continue, from here on is our second page: > > 2: aaaaaaaa|bbbbbbbbbbbbb > 3: cccccaaa|bbbbbbbbbbbbb| > 4: cccccccc|bbbbbbbbbbbbb|ccc > 5: cccccccc|bbbbbbbbbbbbb|ccccc|dddddddddddddddddddddddddddd Let's not forget, as per bug 336238 comment 74 (and 102) we no longer have the concept of "one scrollback full of history". Instead, we have a head pointer (~write pointer in the linked articles) and a tail (~read) pointer. I guess your approach still works, but one's brain would explode a couple of times before implementing it correctly :D Comments 2/3 here sound easier to implement to me, because they would separate the "paging" (head/tail pointer) logic and the actual mapping to FDs into two separate layers underneath each other. It feels to me that your approach wouldn't allow separation of these two.
Trying to continue your approach, here's how we could implement an arbitrarily growing circular buffer in a single file. H and T stand for head and tail (inclusive), * for other special. We start simply writing data sequentially. 1. abcd T H The tail also starts advancing, but we continue as if nothing happened. 2. abcdefg T H At some point, based on heuristics (e.g. when H <= 2*T) we wrap to the beginning. 3. klcdefghij H T * If tail wraps around before head reaches it, there's no problem at all, e.g.: 4. klmnopqhij T H But if head reaches the tail, we continue by appending to the file (continuing the 3rd bullet point). 5. klmnopqhijrst *T * H When tail wraps around but there's still a hole, we keep advancing with the head. 6. klmnopqhijrstuv T * * H At one point, tail teleports to the last block 6. klmnopqhijrstuvwxy T H Goto 3. It's not that hopeless after all. WDYT? :)
(In reply to comment #3) > Or... could we just simply multiplex all the 6 current files into one, using > [...] > (fallocate(2) FALLOC_FL_PUNCH_HOLE) The more I think about it, the less I like this approach because of its portability limitations. It's only supported on Linux, and only for a few file systems. However, it would be nice to add some punch hole calls to our current (or next) implementation of the stream for blocks that are no longer used (i.e. currently when the tail is advanced). If that call fails, nothing happens. If the call succeeds, we free up precious disk space. If – for the sake of simplicity – we assume that scrollback data is filled up totally evenly, the overall file sizes for a stream currently fluctuate between N and 2*N, with successfully punching holes it would remain N (plus epsilon) all the time.
(In reply to comment #4) > For example, one way to reduce one > FD is to merge the attrs and utf8 streams. utf8 encoded has unused patterns (a > 0xFF byte for example is never used in UTF-8; there are a few of those). So we > can use that as a escape sequence for attribute changes. Currently the rewrap code has significant optimization for paragraphs that contain plain ascii (32-126) only (this bit is stored in row_stream): it can rewrap those paragraphs without looking at text_stream. Inlining attritbute changes would degrade performance: this optimization would also require that there's no attribute change within the given paragraph; otherwise the slower branch would be executed. On the other hand, changing the streams to contain UCS4 (with gzip on top) would allow to extend this optimization for paragraphs containing any single-width characters.
Created attachment 289081 [details] [review] Implement stream in one fd, v1 Here's a first implementation of the idea from comment 11. It's moderately tested, hopefully doesn't corrupt the buffer. The code cries for unittests but they aren't there yet. When cating a large file, it introduces an 1.5x slowdown. Apparently glibc's stdio buffering does a bad job when overwriting an existing area of a file. Needs further investigation, maybe we'll need to do caching ourselves. Aligning the boundary between the segments might also help a bit. Crashes when switching to the alternate screen (bug 738964).
Note to self: In advance_tail's 2->1 state transition I forgot to truncate the file. About PUNCH_HOLE: I'll add it only when everything else is fine and the patch is committed. It's an optional additional improvement (conditionally to OS and filesystem support).
Thanks Egmont. I think we should remove stdio and have our own chunked stream, which also does the compression+encryption. It's so much easier to do those three at the same time.
Review of attachment 289081 [details] [review]: Very interesting implementation. Might be possible to simplify it a bit, but in general, I think it's sound, even if not as compact as the implementation it replaces. ::: src/vtestream-file.h @@ +475,3 @@ + + /* TODO add comment here about the heuristics */ + if (G_UNLIKELY (stream->state == 1 && 2 * stream->segment[0].fd_tail > stream->segment[0].fd_head)) { How about something like this instead: segment[0].fd_tail > segment[0].fd_head - segment[0].fd_tail + (2*BLOCK_SIZE/3) where BLOCK_SIZE is 512. Another option would be to move this transition to the end of advance_tail, such that the common case for unlimited-scrollback doesn't go through this check at all. @@ +498,1 @@ + last_segment = VTE_FILE_STREAM_SEGMENTS(stream) - 1; Might be cleaner to remove all the redundant members you have, and have a stream->current_segment that is used for appends directly. Also, perhaps have a safe_len or something which is the maximum data we can directly write at stream->current_segment without any state changes. That should simplify this function a lot. WDYT? @@ +511,3 @@ + int i; + + // FIXME if offset+len >= stream->segment[0].tail: should memset(0) and then partially read Good point.
(In reply to comment #17) > Very interesting implementation. Might be possible to simplify it a bit, but > in general, I think it's sound, even if not as compact as the implementation it > replaces. Well, my brain did explode while writing/debugging it, as I said it would :) > How about something like this instead: > > segment[0].fd_tail > segment[0].fd_head - segment[0].fd_tail + (2*BLOCK_SIZE/3) I was definitely thinking about something like this. Not sure why the 2/3, I'd be up for something like 8k. > Another option would be to move this transition to the end of advance_tail, > such that the common case for unlimited-scrollback doesn't go through this > check at all. Sounds micro-optimization. You can only mark for transition there and then you still need the check when appending data, or alternatively, you need to safely handle a segment of length 0 which is error-prone in the long run. (Should be backed up with unittests and then it's good, but why complicate it any further.) > > @@ +498,1 @@ > + last_segment = VTE_FILE_STREAM_SEGMENTS(stream) - 1; > > Might be cleaner to remove all the redundant members you have I was thinking about it, e.g. I used to have stream_tail, fd_tail and size, but it turned out size was the least used. The code became more readable by storing stream_head and fd_head, even if one of them is redundant. > and have a > stream->current_segment that is used for appends directly. Did you just vote for removing redundancy? This would be another redundant variable :D It's always a trade-off: either need to maintain some variables consistently, or a read becomes more complicated and slower. It's still sooo micro-optimization with probably no measureable impact that probably code readability should be the most important. Not sure though what's the best for that :) I'll probably change the code in ways I find nicer, I don't know yet what that will mean. > Also, perhaps have > a safe_len or something which is the maximum data we can directly write at > stream->current_segment without any state changes. That should simplify this > function a lot. WDYT? You still have to compute it, either here or somewhere else. Plus, it'd be another derived redundant value stored. Computing somewhere and using at just one place somewhere else doesn't sound a simple design to me, I'd rather compute right where I need it.
I'm sure you spent a lot of fun time considering alternatives :). Happy to go with whatever design you end up with. Thanks!
Seems to me that glibc's stdio caching does a poor job when mixing freads and fwrites to the same file (even if different parts thereof). Apparently it keeps throwing away its cache, moreover, it needs to lseek back-n-forth all the time (as it uses read/write instead of pread/pwrite). This can be experienced with my previous patch with a limited scrollback, and the phenomenon only occurs in row_stream (whose tail needs to be read to know how much to advance with the tail of the other two streams). VTE with my patch is just as fast as previously if you have infinite scrollback, the slowdown only occurs with finite one. Coincidentally, so far the writing position always "escaped" to the other file as soon as the reading position reached that file, that is, we never hit this bottleneck. With the one-fd approach we obviously need to read and write the same file. I can see two possible solutions: - Trick stdio into believing that these are different files, by calling fdopen() twice on a given fd. (Not sure if it works, or glibc is too clever to find out that we're cheating. Might be very tricky because at certain places we have to manually invalidate the read cache.) - Ditch stdio, do our caching on top of pread/pwrite. What I'm thinking about is that if caching is done above the mapping from logical stream offset to physical file offset, rather than below as it is currently, then it's probably easier and we don't have to bother with invalidating the cache as reads/writes to the same file offset influence each other.
> - Trick stdio into believing man fdopen: "The file position indicator of the new stream is set to that belonging to fd". So this clearly wouldn't work. Let's go for our own very simple caching on top of pread/pwrite. Another optimization we could do is not to read the tail of row_stream and advance the tail of the other two streams every time, but do it only every Nth occasion.
(In reply to comment #21) > > - Trick stdio into believing > > man fdopen: "The file position indicator of the new stream is set to that > belonging to fd". So this clearly wouldn't work. > > Let's go for our own very simple caching on top of pread/pwrite. Ok, lets step back and design caching, fewer-fds, compression, encryption together. Because they can't be added as separate layers without significant costs. For compression, we either have to keep an open compressing state as we write to it, OR, accumulate a chunk of data and compress it all at once. Same about encryption. These both fit perfectly into the caching. So, we gather a full chunk at a time, and compress+encrypt+write it synchronously. If we keep an LRU cache of two uncompressed/denecrypted chunks around, that guarantees good scrolling behavior. These two should be separate from the writing head's chunk. I think it's worth thinking about them together now that you want to reinvent stdio caching, as some of the work you spend on reducing number of fds will have to be thrown away / revalidated when the other layers are added. For example, that compression changes the chunk length from constant to variable has implications on how you implement the artual I/O layer. > Another optimization we could do is not to read the tail of row_stream and > advance the tail of the other two streams every time, but do it only every Nth > occasion. That's an interesting idea. When we implement caching/compression/encryption, we should make sure advancing the tail doesn't cause a decompression/deencryption.
(In reply to comment #22) > > For compression, we either have to keep an open compressing state as we write > to it, OR, accumulate a chunk of data and compress it all at once. Same about > encryption. These both fit perfectly into the caching. So, we gather a full > chunk at a time, and compress+encrypt+write it synchronously. Sounds good. > If we keep an > LRU cache of two uncompressed/denecrypted chunks around, that guarantees good > scrolling behavior. These two should be separate from the writing head's > chunk. Yes. > I think it's worth thinking about them together now that you want to reinvent > stdio caching, as some of the work you spend on reducing number of fds will > have to be thrown away / revalidated when the other layers are added. Even if some of my work will be thrown away, I'd like to do one-fd and pread/pwrite caching only first. This is because we can already see the end of this tunnel nearby, and we'll have a safe checkpoint where VTE is definitely better than before. Compression/encryption will be tricky, we'll probably have to experiment with a few approaches, see if altering the streams format (e.g. UCS-4 with inline attributes) helps or not, evaluate the performance costs etc. It'll be a much more complicated task than one-fd, probably with complications we can't foresee. I don't know if I'll have time to do it anywhere in the near future (let alone I have zero experience both with compression and with encryption :)). I'd like to push one-fd into 0.40, while compression+encryption might be delayed by several more years due to lack of eng resources. > For > example, that compression changes the chunk length from constant to variable > has implications on how you implement the artual I/O layer. Definitely. > > Another optimization we could do is not to read the tail of row_stream and > > advance the tail of the other two streams every time, but do it only every Nth > > occasion. > > That's an interesting idea. When we implement caching/compression/encryption, > we should make sure advancing the tail doesn't cause a > decompression/deencryption. In row_stream we need direct access (e.g. when dragging the scrollbar) and we need to frequently read its tail (unless we come up with a substantially different approach), so probably we should do no compression and only a weak encryption. This means the stream's behavior needs to be parameterized. The other two streams could probably receive compression and a better encryption. Let's not forget that when rewrapping, we need to read tons of data from text_stream, this might cause a performance problem. We'd need a compression+encryption whose costs are low compared to the cost of reading from disk and rewrapping.
Ok, you have very good points. I wanted to raise this issue to your attention. If you implement one of these items without the others, that's still one item we can check as done. :)
I have not forgotten about this issue, just don't have too much time nowadays. I had a couple of approaches to implement the above-mentioned single-fd stream, with caching (no compression/encryption), but things always got way out of hand and made me drop the code and think further. Implementing the single-fd structure and caching together in the same layer is practically impossible. Implementing caching as the lower layer right above pread/pwrite, and the single-fd structure above that is still very cumbersome with tons of special cases. Seems to me that the simplest approach would be to leave the single-fd structure as it is (patch from comment 14), and implement caching on top of that - which is still easy to get wrong :) Finally I have a half proof of concept that works (only the writes are cached, reads are not - it's already as fast as current mainstream vte). The caching layer would basically batch up append requests until we have a full 4kB block, and also delay the advance_tail calls until a 4kB block can be dropped from the tail. (This guarantees that advance_tail works on the part that's already written to the file, doesn't advance to the write cache.) We'd always read aligned 4kB blocks, caching 1 or 2 LRU pages. Truncating is also a bit tricky. It's a rare user-initiated event (it can only occur when the window is made taller - correct?) so I'd go for the solution that's a local hack without influencing the rest of the code. That is: read the new partial block at the head, and place that in the write cache (as if it hadn't been written yet), it'll be overwritten with itself.
Created attachment 290778 [details] [review] Implement stream in one fd, v2 There's a lot to clean up in this code, but we're getting there :) Finally caching is working fine. The code also brings a slight (~ 3-5%) speed improvement when cating a large file with finite scrollback.
Created attachment 290872 [details] [review] Implement stream in one fd, v3 (just a git update)
Now that with this patch we only read and write aligned 4kB-blocks, I believe implementing encryption (but no gzip) became fairly trivial. Behdad, do you have any experience which encryption method and which library to use? It should be super fast, yet reasonably secure.
Awesome. Lets do this! See https://bugzilla.gnome.org/show_bug.cgi?id=664611#c40 I'd say we want AES. Key size can be configurable at compile time (128, 196, 256). We need to create a random key per stream, and then choose a block cipher mode to go with it: https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation We probably don't want ECB. CBC sounds right to me. In which case, for random access, this means that for every block, we need to keep a key in memory. As such, I think a block size of, say, 32kb might be better than 4kb. As for implementation lets start with OpenSSL, and see what everyone else switch to over time. Alternatives are gnuTLS and LibreSSL.
This seems to be useful: http://stackoverflow.com/questions/2575131/symmetric-encryption-performance-questions . It says "Good implementations of AES ... require about 10-20 CPU cycles per byte to encrypt." The benchmark file I've been using since bug 572210 is 42 MB large, cating it just dropped from ~7.5s to ~7.2s with the patch above; assuming 20 CPU cycles for each byte @ 2.3 GHz, plus a factor of 2 (encrypting at head, decrypting at tail) gives an expected additional 0.73 seconds (that is, ~10%) increase. I'd be super happy with that!
> plus a factor of 2 (encrypting at head, decrypting at tail) I was stupid here. We don't decrypt the large text_stream at tail, only the much smaller row_stream. I didn't count with the encryption of the two small streams, but even with them I'm hoping for a less than 10% degradation.
(In reply to comment #29) > We probably don't want ECB. CBC sounds right to me. In which case, for random > access, this means that for every block, we need to keep a key in memory. These streams can grow arbitrarily large, so let's not go this way. I think we should have one single key for all the blocks within a stream.
(In reply to comment #32) > (In reply to comment #29) > > > We probably don't want ECB. CBC sounds right to me. In which case, for random > > access, this means that for every block, we need to keep a key in memory. [My understanding of CBC was wrong; but the conclusions are even worse.] > These streams can grow arbitrarily large, so let's not go this way. I think we > should have one single key for all the blocks within a stream. Fair enough. Looks like we want CTR mode: https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29 For each stream we need a random key and nonce, and then use the file offset as counter.
I'm not sure I get the overall structure right, and maybe 2am is not the best time to figure it out :) So, I think AES encrypts 128 bits == 16 bytes in a single run, and within a 4kB block we'd chain up 4096/16 == 256 AES-blocks using CBC. Am I right? According to the wiki page, the initialization vector doesn't need to be secure, but ideally should be unique. We could e.g. use the stream offset to guarantee this. We should study a bit what already existing filesystem or blockdevice encryption stuff (e.g. cryptoloop) do. I'm sure they also only require a single master key, and somehow generate the initialization vectors automatically. (Editing: funny that we're cross-posting basically the same.)
I should have been more clear. AES encrypts 128 / 192 / 256 bits in a run. Within a 4kb block, we chain using CBC. We need an initialization-vector for this. For the stream we need a key, and an initialization-vector (IV). The IV can possibly be a pointer value or something if it doesn't have to be secure. Across runs of 4kb, we use IV+fileOffset as initialization vector for use within that single block.
(In reply to comment #29) > As for implementation lets start with OpenSSL, and see what everyone else > switch to over time. Alternatives are gnuTLS and LibreSSL. We can't use openssl since its licence is GPL-incompatible.
Created attachment 290937 [details] [review] Encryption, v1 This patch adds encryption (on top of "one fd, v3"). Performance degradation is somewhere betwee 5% - 10%, on top of the improvement brought by pwrite/pread + our own caching. That is, hardly any degradation on top of current mainstream vte.
Created attachment 290938 [details] [review] Encryption, v2
Seems to me that libgcrypt is used by far more apps than gnutls. I'll give it a try. Can I hardcode the requirement for the crypto library? Or do I need to make it optional for whatever reason (technical, legal [export laws etc.])?
No need to make it optional or configurable at the configure.ac level, IMHO. Not sure if we need to make it configurable (on/off) at the vte API level. On which lib to use: https://fedoraproject.org/wiki/FedoraCryptoConsolidation#Medium_Term_Goals
Thanks Egmont. I checked the v1 / v2 and they look great. Lets start cleaning this up and landing. gcrypt sounds like the better choice since it looks like gnutls uses gcrypt internally?
Actually I'm not sure whether libgcrypt or nettle is preferred.
A really old document suggests that nettle is much faster than gcrypt (but still slower than OpenSSL, which I hear is the most optimized, using modern CPU special operations etc). http://panthema.net/2008/0714-cryptography-speedtest-comparison/
(In reply to comment #40) > No need to make it optional or configurable at the configure.ac level, IMHO. > Not sure if we need to make it configurable (on/off) at the vte API level. I wouldn't want to make an API for it. I don't see a point in an app toggling this back-n-forth. > On which lib to use: > https://fedoraproject.org/wiki/FedoraCryptoConsolidation#Medium_Term_Goals Seems NSS is the winner, GnuTLS is the runner-up. I couldn't find the NSS API docs yet, just an example here: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/nss_sample_code/NSS_Sample_Code_sample2 which seems it's quite complicated, and the initialization with no database (and its possible interference with a vte-based app doing its own NSS) scares me. Maybe we're just fine with the gnutls code I created. (In reply to comment #41) > gcrypt sounds like the better choice since it looks like gnutls uses gcrypt > internally? On Utopic there's libgnutls26 (2.12.23, using libgcrypt) and libgnutls28 (3.2.11, using libnettle4), so looks like they either changed, or it's a configure option and Ubuntu changed. My code requires the newer gnutls 3.2.x. (In reply to comment #43) > http://panthema.net/2008/0714-cryptography-speedtest-comparison/ If we can still trust this, and also Fedora's recommendation, then gnutls is actually a very good choice. Maybe NSS is better but I'm a bit afraid of that (will take a look anyway). OpenSSL looks faster but is indeed out of question for license issues.
Given the weird mapping from logical offset to file offset, and the filesystem layer below us, it might happen that someone steals a laptop with a running vte, finds its key in memory, finds unused filesystem blocks containing data that were long ago scrolled out at the top of this vte, and is able to decrypt them. Of course there's nothing we could do against protecting data that's still there in vte's scrollback, but it would be nice if we could safely destroy them once they are scrolled out (maybe not immediately, but after a reasonable amount of more lines of output). One thing we could do is to explicitly zero the blocks when advancing the tail of the stream. This might have a noticeable performance cost. Another idea is to zero the blocks whenever we release them back to the OS (that is, truncate the file or (in a future version) punch a hole). This could happen in a slower rate than writes, e.g. if the physical end of the file is beyond the desired, then at every 16th pwrite() request we also zero the trailing block and truncate there. But (talking about the future version with punch_hole) if we're doing it at a slower rate then it defeats the purpose of punching holes (we don't actually win having a smaller file), and if we're doing it at the normal rate then we're back at the performance regression. Hah, complicated. Yet another idea is to periodically rotate the secret key, maybe have at most two of them at a time, pretty much like the current paging between two files goes. The scrollback contents would be encrypted with key[0] up to a point and key[1] from that point. When the tail reaches the boundary, so that key[0] is no longer used, it gets overwritten by key[1] and we generate a new key[1] to be used from now on. At this point the old key is forgotten for good, leftover unused blocks are no longer decryptable. Does this make any sense, or am I overcomplicating it? :)
It's certainly something to think about. We definitely should study what has been done in the literature, or ask someone more familiar with crypto at least. I'd say lets get the simple version cleaned up and committed, then find someone to review and recommend design changes.
Ok, since Fedora recommends GNUTLS (with nettle as crypto backend, but nettle never used directly by applications), I think we should go with GNUTLS. NSS on the other hand is complicated.
Created attachment 291278 [details] [review] Implement stream in one fd, v4 ChPe or Behdad, at this point could you please review if I'm doing the GObject inheritance correctly? There is the base abstract VteStream, extended by VteUnbufferedFileStream, further extended by VteFileStream. The "normal" (buffering) one is used by default, but by changing the _new() calls in ring.c it's easy to get the unbuffered slower version to work, useful for testing only one layer instead of two. I'm particularly curious about the way _vte_file_stream_init() calls its parent, and how an explicit ".parent" struct member is looked up sometimes. And all the chrome related to inheritance. Thx!
Created attachment 291287 [details] [review] Implement stream in one fd, v5 Unittests added. Plus some bugfixes.
Created attachment 291288 [details] [review] Implement stream in one fd, v5 (sorry, here's the full patch instead of an incremental one)
Created attachment 291290 [details] [review] Implement stream in one fd, v6 I believe it's finally okay and bugfree now (fingers crossed) and the code is also mostly cleaned up. A bit more cleanup won't hurt, though.
Regarding the encryption patch: - I have a vague memory that the compiler might optimize away the memset(key,...). Perhaps check what the best-practice for that is. - For IV I'm hearing suggestions to use 96 bits: "The nonce should be 12 bytes long -- other lengths cause GCM to do stuff to cope and you don't want that." Not sure how that translates to gnutls, but we should make sure, if it's using only parts of our 256 bits, that it's using the parts we want (including the offset). - Perhaps add the offset to the random bits instead of replacing it.
(In reply to comment #52) > Regarding the encryption patch: > > - I have a vague memory that the compiler might optimize away the > memset(key,...). Perhaps check what the best-practice for that is. Interesting point indeed :) > - For IV I'm hearing suggestions to use 96 bits: "The nonce should be 12 > bytes long -- other lengths cause GCM to do stuff to cope and you don't want > that." Not sure how that translates to gnutls, but we should make sure, if > it's using only parts of our 256 bits, that it's using the parts we want > (including the offset). The GnuTLS API tells us how large the IV is (it is 128 bits, even with AES256). I'm not sure if it's legal to feed it a smaller one (I guess it is, because the API allows this). > - Perhaps add the offset to the random bits instead of replacing it. So, there's the key, which is as random as it can be, and there's the IV, which as far as I understand doesn't even have to be random, it could be just the plain offset. Of course it can't hurt to make it more random. In my current setup it's 8 bytes stream offset + 8 bytes random. One more thing, regarding the IV (copying from what I just sent in email): We sometimes "reset" the streams to a particular offset (e.g. when the "reset" command is issued). Then the current way of generating the IV starts to repeat. As far as I understand, the IV doesn't need to be cryptographically strong, but should be unique. Currently some bytes of the IV encode the offset, some other bytes are random. My idea is to increment that random value by 1 on a reset. This is fast, doesn't drain the random entropy, and guarantees uniqueness. A "reset" (in this sense) is when the stream is empty, so we don't need to remember the previous IV.
Sounds good. I'll update you here as I learn more. In the mean time, here's what I sent you via email: ================================== The biggest feedback I've received so far indeed has been to never ever reuse IV. As such, what if: - We represent IV as 96 bits, three int32's? Populate randomly as base, - Add the block number (ie offset / 4096) to the first int32 in the IV for each block. This won't overflow for 16TB worth of data. Later on we can look into implementing carry, or use 64bit ops, - On reset, move the IV base forward, such that the next block IV will be the same as if reset has not had happened, People have also suggested GCM instead of CTR. GCM has integrity check at the expense of needing to store a 16-byte tag at the end of encrypted data. I've now asked if we can just use CTR if we don't care about integrity. Will let you know. ================================== Your idea about IV works too, if we can make sure gnutls doesn't throw away parts of the IV. Humm. How do we handle truncate vs IV-reuse?
(In reply to comment #54) > Humm. How do we handle truncate vs IV-reuse? I'd like to "cheat" here a little. Truncate only happens when you make the terminal widget taller and cross a power-of-two boundary. And only once per each power of two, because the ring is never made smaller. In this case, we truncate the stream by a rowful of data, that is, maybe like 100 bytes in row_stream, 16/24 in the others. This has a quite small chance to cross a 4kB boundary and hence the last encrypted block being thrown away and its IV getting reused. I can modify the code not to physically truncate the file in this case, meaning that the new block with the same IV will, if written to the same file offset, replace the previous block on the disk. Now, it might still happen, with an even smaller chance, that due to an advance_tail in between, the one-fd logic decides to place this block in a different file offset than previously... If anyone, based on this information, can find two such blocks on the disk that used the same IV, and use this information to decrypt the data, then I give up software development :D
Haha. Fine with me as long as we document these all.
To correct myself: we truncate my multiple rows of data, maybe a kilobyte or so, so the chance of crossing a 4kB boundary there is not that small.
Created attachment 291429 [details] [review] Implement stream in one fd, v7 I really hope this one works perfectly :)
If you are confident in the patch, I think we should land it and see what breaks :D.
Will do if it survives a few days for me. I'll do another round of review and minor cleanups myself. In the mean time, reviews from you guys are welcome :)
So, I heard back re crypto GCM vs CTR. First, as you also noted before, the requirement is that (key, nonce) be unique. So we don't really have to populate the IV with random bits. But doing doesn't hurt so that's fine. The recommendation I received is to implement GCM mode, which implements integrity as well as confidentiality. We currently don't require integrity, but for example will, if we want to provide scrollback persistence ala OS X terminal. Here's what AGL had to say on this point: """ I find, in general, that all systems tend to drift towards the standard (integrity+confidentiality) threat model over time. Additionally I find that developers always mentally assume that threat model whenever encryption is involved -- whatever the initial actual threat model is. So I'm going to stick with recommending GCM, including storing and checking the tags as anything less invites a problem in the future. """ Biggest problem with GCM is that we have to store 16bytes of tag data after the encrypted block, which adds yet another set of offsets to keep track of... My personal judgement is to document these all very clearly and go ahead with your current model.
Someone also brought this to my attention: http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf
(In reply to comment #62) > http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf This is about compressed streams only, right? Good to keep it for future reference, but I'm really not going to do compression now. (In reply to comment #61) > The recommendation I received is to implement GCM mode Thinking about it a bit more, handling the two kinds of offsets won't be that hard as I first thought. I'll try to give it a go. --- Current vtestream has some tricks to handle out-of-disk. I don't quite get how that's supposed to work, could you please enlighten me? We just ignore a write() failure, and try to truncate() [in the sense to grow a file] before subsequent writes, so that the file is large enough. This was important for the stdio-based model, I think this code could be dropped from _file_write() now that we use pwrite(). A corresponding _file_read() either returns FALSE (if the offset is beyond EOF) or a block full of zeros (sparse block, if we could later write data to the file). I'm not sure the caller handles these two cases equally. There'll be a third rare case, namely integrity checksum of the encrypted block doesn't match. I guess we'll just return a FALSE in this case too. As for text_stream, I can see how a zero block causes just a part of the scrollback to be missing, that's all. But for {row,attr}_stream, if there's a missing block then I'm afraid things go totally out of our hands and the rest of vte's code is likely to crash or whatever. Am I missing something? --- Is there a name for the math operator that finds the largest number that's not larger than a given number and is dividable by another given one? E.g. whatsyourname(173, 10) = 170. I call it ALIGN() in my latest patch, but if there's an official name I'd be happy to learn and use that :)
(In reply to comment #62) > http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf After a quick scroll-through, my key takeaway is that prefixing the data with some pseudo-random bytes cannot hurt; in case of compression it can even help a lot. (These are bytes that we don't need to remember, as opposed to key or IV, these are thrown away when decrypting.) Could you please ask the security guy whether it makes sense for us to do this, even without compression? If I implement GCM which needs room for some additional data, adding this too wouln't be a problem. I'm thinking of something damn simple like adding up timestamps, or a few encrypted bytes as read back from these streams, maybe even in a static buffer (shared by all FDs and VteTerminals). Or maybe by doing this I'm not improving security at all, just leaking some more data to the attachers??
I was about to ask something which I thought would be a very stupid question. It wasn't clear to me what's the right stacking order of encrypting and computing the hash, maybe encrypt the hash too. After a bit of searching, it turns out that this is actually a valid question: http://en.wikipedia.org/wiki/Authenticated_encryption -> EtM, E&M or MtE. Could you please ask the guy which one he recommends for our use case?
This is what AGL said before to me: AES-GCM is a sound AEAD construction that will provide the "seal" and "open" semantics that I find that most people expect. Nettle's AES-GCM interface isn't very clean, but is documented here: http://www.lysator.liu.se/~nisse/nettle/nettle.html#GCM (scroll down to "6.4.2.3 GCM-AES interface") I would recommend generating a random key per file by reading from /dev/urandom. Holding that key in memory will provide the same protection as holding the scrollback buffer in memory. (Of course, systems have to have encrypted swap to ensure that the key itself isn't written to disk, but that's no different than storing the scrollback buffer in memory.) AES-GCM requires that (key,nonce) pairs be unique for all time. If the key is randomly generated than that can be achieved by having a 96-bit counter per-file that is incremented for each chunk that's encrypted. You will need to know, external to the encryption, the length of each block it and what its nonce is. Nettle calls the nonce an IV. You must call gcm_aesXXX_set_iv before each gcm_aesXX_(de|en)crypt call. (I assume that the interface will stop you from encrypting two chunks with the same nonce, but don't bet on it.) The nonce should be 12 bytes long -- other lengths cause GCM to do stuff to cope and you don't want that. You must call gcm_aesXXX_digest after each gcm_aesXX_(de|en)crypt call. When encrypting, you need to get the 16 bytes and store it with the ciphertext (usually immediately afterwards). When decrypting you must compare the 16 byte result to the stored tag. Do not process decrypted plaintext before comparing the tag. If you abort on a tag mismatch then you're probably safe but otherwise (and maybe even then), use a constant-time compare to check the tags: uint8_t x = 0; for (unsigned i = 0; i < 16; i++) { x |= calculated_tag[i] ^ stored_tag[i] } if (x != 0) { abort(); }
(In reply to comment #64) > (In reply to comment #62) > > > http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf > > After a quick scroll-through, my key takeaway is that prefixing the data with > some pseudo-random bytes cannot hurt; in case of compression it can even help a > lot. (These are bytes that we don't need to remember, as opposed to key or IV, > these are thrown away when decrypting.) > > Could you please ask the security guy whether it makes sense for us to do this, > even without compression? If I implement GCM which needs room for some > additional data, adding this too wouln't be a problem. I'm thinking of > something damn simple like adding up timestamps, or a few encrypted bytes as > read back from these streams, maybe even in a static buffer (shared by all FDs > and VteTerminals). > > Or maybe by doing this I'm not improving security at all, just leaking some > more data to the attachers?? I was under the impression that the nonce does that for us? Anyway, I don't like adding extra bits "just-in-case", if the crypto recommendation doesn't require it.
(In reply to comment #63) > (In reply to comment #62) > > http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf > > This is about compressed streams only, right? Good to keep it for future > reference, but I'm really not going to do compression now. > > (In reply to comment #61) > > The recommendation I received is to implement GCM mode > > Thinking about it a bit more, handling the two kinds of offsets won't be that > hard as I first thought. I'll try to give it a go. Sure, it's not hard, you just need to change your blocksize from 4096 to (4096-16) and do a couple of conversions. The slower div/mul compared to bitwise shifts is the only thing that bothers me :D. Well, and the fact that we have to validate data upon reading. I'm divided on which way to go. > Current vtestream has some tricks to handle out-of-disk. I don't quite get how > that's supposed to work, could you please enlighten me? This definitely changed when I was switching to stdio. > We just ignore a write() failure, and try to truncate() [in the sense to grow a > file] before subsequent writes, so that the file is large enough. This was > important for the stdio-based model, I think this code could be dropped from > _file_write() now that we use pwrite(). If pwrite() extends the file as needed, then you are right indeed. > A corresponding _file_read() either returns FALSE (if the offset is beyond EOF) > or a block full of zeros (sparse block, if we could later write data to the > file). I'm not sure the caller handles these two cases equally. There'll be a > third rare case, namely integrity checksum of the encrypted block doesn't > match. I guess we'll just return a FALSE in this case too. > > As for text_stream, I can see how a zero block causes just a part of the > scrollback to be missing, that's all. Yes. And that has worked in practice, in my testing. My goal was that history recording should resume working if space becomes available on the partition, but I never saw *that* working. Whenever my system goes out of space, all terminals lose their history, and it never comes back. Would be nice to fix that. > But for {row,attr}_stream, if there's a missing block then I'm afraid things go > totally out of our hands and the rest of vte's code is likely to crash or > whatever. Am I missing something? You are not missing anything... I intentionally designed the attr-stream such that reading back zeros would not be catastrophic. Kinda same about row_stream. But looks like we don't set zeros either? I was under the impression that I did that. > --- > > Is there a name for the math operator that finds the largest number that's not > larger than a given number and is dividable by another given one? E.g. > whatsyourname(173, 10) = 170. I call it ALIGN() in my latest patch, but if > there's an official name I'd be happy to learn and use that :) Don't know :).
Created attachment 291502 [details] [review] Encryption, v3 This is still the previous encryption (AES_256_CBC), ported to one-fd patch v7.
Review of attachment 291502 [details] [review]: ::: configure.ac @@ +217,3 @@ PKG_CHECK_MODULES([GTK],[gtk+-$GTK_API_VERSION >= $GTK_REQUIRED]) +### TODO: why is the 3rd arg needed, is it okay? Third arg is not needed if you consume LIBGNUTLS_LIBS later. Since we are requiring this, you can simply add the gnutls >= 3.2.0 requirement to the VTE_PKGS line in configure.ac and be done. ::: src/vtestream-file.h @@ +600,3 @@ stream->rbuf_offset = 1; /* Invalidate the read cache. */ + + gnutls_global_init (); Should we subsclass encrypted stream from buffered stream? The stream types / naming / etc are getting a bit confusing.
Created attachment 291513 [details] [review] Encryption, v4 GCM with integrity checking. 4080-byte buffers are encrypted, integrity stuff added, written into FS 4096-byte blocks. Missing bit: altering IV on a stream reset. I believe it's basically okay now. Time to stress-test and start a massive code cleanup (e.g. the naming is indeed confusing, even for myself ;))
(In reply to comment #55) > > Humm. How do we handle truncate vs IV-reuse? > > I'd like to "cheat" here a little. I kept thinking about it a lot, and here's the best I found: When truncating, the lowest unbuffered layer could actually not totally undo the bookkeeping of the previous append, but keep the segment[0..2].{st,fd}_{tail,head} values unmodified and only decrement stream->head. This would allow it to keep the same behavior via the API, yet guarantee that the data will be overwritten at the exact previous location. The append() call becomes more complicated: first if stream->head is lower than st_head (i.e. there was a truncate) then fill up the file offsets as mapped by the segment[0..2].{st,fd}_{tail,head} values (same logic as for read()s), and once it's done then continue with the current snake logic. Luckily, in turn, the truncate() logic becomes much simpler. [Another idea I had was: since the IV doesn't have to be secret, we could just have an ever-incremented value disassociated from the file offset, and store it at the beginning of 4k-blocks. But I didn't like it because it leaks data and helps an attacker reconstruct the sequence of filesystem blocks.]
(In reply to comment #66) > You must call gcm_aesXXX_digest after each gcm_aesXX_(de|en)crypt > call. BIG FAT TODO: double check that gnutls_cypher_tag() is indeed the wrapper above gcm_aesXXX_digest(). (In reply to comment #67) > Anyway, I don't > like adding extra bits "just-in-case", if the crypto recommendation doesn't > require it. Fair enough. (In reply to comment #68) > You are not missing anything... I intentionally designed the attr-stream such > that reading back zeros would not be catastrophic. Kinda same about > row_stream. I never had this in mind, e.g. at rewrap. I probably broke this big time. :P (In reply to comment #70) > Should we subsclass encrypted stream from buffered stream? I'd like to give it its own layer. Not sure if it can be a subclass, append() received the offset as a new value, it needs it for the IV. So it'll be a big ugly, with a different API than the layers below and above it. Anyway, encryption along with the 4096/4080 offset mapping definitely belongs to a new layer.
Thinking even more... The code evolved so that the underlying nonbuffering part has the same API as the buffering layer above, meaning you can easily change ring.c to use the unbuffered one. You get slower and unencrypted behavior. It was useful for testing and writing the buffering code, but not that useful anymore. I don't think it makes sense to maintain this identical API any longer. The underlying layer should serve the needs of the buffering-encrypting layers. It could insist on working with 4k blocks, allowing to eliminate some ugly loops. It could also stop supporting explicit truncation, instead it could offer random-access-writes to any offset that's either within the currently covered region, or immediately follows that. It'd be closer API-wise to the encrypt_and_write() method by always requiring the offset. To clean up the naming mess, I think I'll call this layer "snake". Seriously :) It totally looks like one particular row from the well-known Nokia game.
Created attachment 291772 [details] [review] Properly reset the ring Currently the ring can't be reset. On a terminal reset (\ec) the old ring is thrown away and a new one is created. Which means 3 new files, which in turn means three new encryption keys. We don't want a two-byte escape sequence to be able to drain the entropy pool by 96 bytes (we don't want any escape sequence to drain the entropy pool at all). So reset the exiting ring, keeping its streams and their encryption keys.
Comment on attachment 291772 [details] [review] Properly reset the ring committed with minor changes :)
Created attachment 291834 [details] [review] Implement stream in one fd, rc1
Created attachment 291837 [details] [review] Encryption, rc1 Here's the first version which is hopefully ready :) Two patches on top of each other, as usual, so that encryption is separated from the rest. Please review and test, thanks! Behdad, could you please also ask the security expert to take a look?
Further useful info about AES GCM, including IV reuse: http://crypto.stackexchange.com/questions/18420/aes-gcm-disadvantage
Just occurred to me: it's not only when the terminal is made taller that we truncate the head (and hence reuse the IV). This happens too with the whole onscreen contents at a horizontal resize with rewrap. The rewrap algorithm is implemented on the stream, so first the whole onscreen contents are frozen (written to the stream) and then rewrapped. This is just FYI when evaluating if reusing the same IV (and overwriting the previous file system block of the same IV) is acceptable. I still hope it is.
Note: AES 256 GCM brought a higher than anticipated performance regression, about 20% with my favorite test file (cating it went from 7s to 8.3s - the reference being the one-fd patch only; current vte is somewhere halfway in between). I hope that compression (bug 738121) will help us here, a simple compression algorithm is faster and there's much less to encrypt.
Review of attachment 291837 [details] [review]: ::: src/vtestream-file.h @@ +533,3 @@ + */ + struct _VteIv { + nit: it might be simpler to make offset be a guint64, then the padding isn't needed. @@ +577,3 @@ + /* Strong random for the key. */ + gnutls_rnd_refresh(); + void (*reset) (VteBoa *boa, gsize offset); GNUTLS_RND_KEY's behaviour isn't fully documented, but it might be that it does a blocking read from /dev/random. That's probably going to be a problem for some people—they won't want their terminal hanging like that. Opinions vary, but I tend to go with "just read /dev/urandom"; however that translates into GnuTLS calls. @@ +582,3 @@ + datum_key.size = VTE_CIPHER_KEY_SIZE; + gnutls_cipher_init(&boa->cipher_hd, VTE_CIPHER_ALGORITHM, &datum_key, NULL); + gboolean (*read) (VteBoa *boa, gsize offset, char *data); The key will remain in memory in boa->cipher_hd in any case so it's unclear how useful this is. There isn't a good, portable way to clear memory securely, sadly. The closest is memset_s, but I don't know if that's included in your minimum requirements. @@ +588,3 @@ + /* Weak random for IV's reset_counter, zeros for the rest. */ + memset(&boa->iv, 0, sizeof(boa->iv)); + gsize (*head) (VteBoa *boa); The reset counter does not need to be random. @@ +601,3 @@ + + gnutls_cipher_deinit (boa->cipher_hd); +G_DEFINE_TYPE (VteBoa, _vte_boa, VTE_TYPE_SNAKE) A global deinit of GnuTLS might be surprising to applications that are using it for other things. @@ +632,3 @@ + +#ifndef VTESTREAM_MAIN + g_assert_cmpuint (gnutls_cipher_get_tag_size(VTE_CIPHER_ALGORITHM), ==, VTE_CIPHER_TAG_SIZE); I don't know the context well enough to know if it's impossible to overwrite data at the same location without calling _vte_boa_reset first. Hopefully that's true, but I'd be tempted to maintain a high-water mark in the code and assert that a write never regresses it unless _vte_boa_reset has been called.
Review of attachment 291837 [details] [review]: ::: src/vtestream-file.h @@ +533,3 @@ + */ + struct _VteIv { + gsize offset; I did it this way so that the code can easily be modified to use another encryption algorihm just by altering the #define's at the top. @@ +577,3 @@ + /* Strong random for the key. */ + gnutls_rnd_refresh(); + gnutls_rnd(GNUTLS_RND_KEY, key, VTE_CIPHER_KEY_SIZE); I have no preference here, might look into what gnutls does (but it's not guaranteed via its interface, the implementation might change and we wouldn't notice). @@ +582,3 @@ + datum_key.size = VTE_CIPHER_KEY_SIZE; + gnutls_cipher_init(&boa->cipher_hd, VTE_CIPHER_ALGORITHM, &datum_key, NULL); + /* FIXME: 738601#c52 the compiler might optimize this away, how to make sure it's erased? That one will be (hopefully) wiped out by gnutls when closing the stream. At least trusting gnutls is our best bet. Here it's a copy that we handle ourselves. Maybe if it's not local on the stack but inside VteBoa then gcc doesn't have that much chance to optimize it away. @@ +588,3 @@ + /* Weak random for IV's reset_counter, zeros for the rest. */ + memset(&boa->iv, 0, sizeof(boa->iv)); + gnutls_rnd(GNUTLS_RND_NONCE, &boa->iv.reset_counter, sizeof(boa->iv.reset_counter)); I know :) but it doesn't hurt either. I might initialize to 0. @@ +601,3 @@ + + gnutls_cipher_deinit (boa->cipher_hd); + gnutls_global_deinit (); According to gnutls_global_init() manpage, they do reference counting and only deinit when the counter drops to zero. @@ +632,3 @@ + +#ifndef VTESTREAM_MAIN + boa->iv.offset = offset; It was discussed above that unfortunately rarely the last one or two IVs get overwritten. It's kind of an "undo" and "redo differently", happening when the terminal gets resized. I was thinking a lot how this could be avoided but I don't see a solution. So every once in a while we reuse the same IV, but it's guaranteed that the same file system block will be overwritten with it. If an attacker is able to see both the old and the new encrypted blocks, chances are he can just grab the key from vte's memory. I think the real use case we have to protect against is a hard drive getting stolen after vte has quit, and in that case there's only one file system block for each IV. I'm curious to hear your opinion whether this is acceptable. I also read GCM is very bad if you reuse the IV, maybe another algorithm would be more desirable? Thanks a lot!
Review of attachment 291834 [details] [review]: ::: src/vtestream-file.h @@ +574,3 @@ VteFileStream *stream = (VteFileStream *) astream; + g_assert (len > 0); "cat /dev/urandom" often fails on this
Adam: you might have missed my response to your review as you were not cc'd. I added you, please take a look at that response. In the mean time, I implemented compression in bug 738121. I'm planning to swap the order of the two patches so that compression is submitted first, and encryption is added as a subsequent patch on top of that. This is so that the most critical part of the code can get a review in its final design. I'm quite concerned about the IV reuse: - I can't see how we could reasonably avoid the need to occasionally truncate a bit of data at the head of the stream. - If we skip an IV there, I can't see how we could maintain that information in memory. - So far I lived under the assumption that if we only write 4kB blocks and once decide to overwrite an existing one, the same file system block will be overwritten. With compression in the game, this gets more complicated. The size changed to an arbitrary amount up to 64kB, which is stored and then a filesystem gap is left up to the next 64kB boundary. The actual net size won't stay the same at an overwrite. I'm afraid various modern file systems might have places where they store small blocks and places where they store large ones. So I no longer trust that the data will necessarily be overwritten in place. A totally different approach: Can we simply just store the IV unencrypted at the beginning of each block? Now I really wouldn't want to leak side-data (e.g. about the age of the vte widget, or the order of the blocks). Some leakage is probably inevitable, but let's try to keep this at minimum. I'm thinking about storing an "overwrite_counter" in each block, originally 0, incremented each time that block is overwritten. (It's a rare event to overwrite a block, so I don't mind the additional cost of having to read it back first.) This value would be added to or xored with the rest of the IV (the offset and the reset counter). (At this point it'd be convenient to have 16 bytes for IV, but we'll squeeze into 12 somehow.) The only leakage is that an attacker can see that somewhere an undo-redo (window resize) happened. What do you guys think about it?
That might work. I wish it was simpler though. :)
Me too :) Let's keep thinking about it...
http://crypto.stackexchange.com/questions/14546/iv-nonce-in-ctrgcm-mode-of-operation says: "The other requirement for nonces specific to GCM is that it is not all 0 bits, as that is used for GHASH, and using it can possibly allow forgeries of all authentication tags."
(In reply to comment #82) > GNUTLS_RND_KEY's behaviour isn't fully documented, but it might be that it does > a blocking read from /dev/random. That's probably going to be a problem for > some people—they won't want their terminal hanging like that. > > Opinions vary, but I tend to go with "just read /dev/urandom"; however that > translates into GnuTLS calls. strace shows to me: when gnutls_global_init() is first called, it reads 32 bytes from /dev/urandom. It leaves the file open but doesn't read from it anymore. Instead, it calls getrusage() whenever I ask for a new key. I hope it's clever enough to reach out to urandom again if it feels appropriate, e.g. rusage not being random enough. Anyway, /dev/random is not read at all.
Created attachment 292261 [details] [review] Implement stream in one fd, rc2
Created attachment 292262 [details] [review] Compression rc2 Compression on top of snake, without encryption at this point. TODOs: - Write unittest for the rare case when the data couldn't be compressed. - Magic constants instead of hardcoded 4 for the bytes storing the length.
Created attachment 292263 [details] [review] Encryption, rc2 Encryption, this time as the last patch on top of snake + compression. TODOs: - Stop reusing IV, as per comment 85. - Magic constants instead of hardcoded 4 and 8 for the bytes storing the length.
Created attachment 292275 [details] [review] Clever reset With this patch, we never reset the logical offset within a stream. Even at a full reset, the logical offsets (insert_delta and such) remain at their previous values. This means we no longer need the reset_counter to maintain a unique IV, conveniently leaving room for the soon-to-be-implemented overwrite_counter.
The encrypt function must never be called with a duplicate nonce, even if you believe that the old data is overridden. #92 references #85 which suggests storing the nonce in the block. That's fine. You can have a per-key, strictly monotonic counter as the nonce and can write it at the start of the block. However, I don't see that the code in #92 actually does that. Also, it appears to be rounding to the AES block-size in _vte_boa_write, but that block size is irrelevant to using GCM.
(In reply to comment #94) > The encrypt function must never be called with a duplicate nonce, even if you > believe that the old data is overridden. Purely out of curiosity: could you please explain why that would be problematic (if the old data was actually overwritten)? > #92 references #85 which suggests storing the nonce in the block. That's fine. > You can have a per-key, strictly monotonic counter as the nonce and can write > it at the start of the block. However, I don't see that the code in #92 > actually does that. Yup, it's still mentioned as a TODO :) I'll implement in the next version. It's great to hear that the approach sounds good! (I indeed shouldn't have called that version "rc"2 because it was not a release candidate.) > Also, it appears to be rounding to the AES block-size in > _vte_boa_write, but that block size is irrelevant to using GCM. Thanks for catching this. Should I figure out the GCM block size somehow and round up to that one? Or can I safely {en,de}crypt an arbitrary amount of data, even if that's not a multiple of the GCM block size? Please expect an rc3 (a real release candidate) addressing all your concerns in a couple of days.
(In reply to comment #95) > Purely out of curiosity: could you please explain why that would be problematic > (if the old data was actually overwritten)? Because I don't believe that the data would be overridden :) With bad-block remapping, log-structured file systems, SSD wear-leveling, journaling etc, there's just so many ways in which data doesn't get deleted. > > Also, it appears to be rounding to the AES block-size in > > _vte_boa_write, but that block size is irrelevant to using GCM. > > Thanks for catching this. Should I figure out the GCM block size somehow and > round up to that one? Or can I safely {en,de}crypt an arbitrary amount of data, > even if that's not a multiple of the GCM block size? GCM builds a stream-cipher from a block-cipher. You don't need any padding.
(In reply to comment #96) > Because I don't believe that the data would be overridden :) With bad-block > remapping, log-structured file systems, SSD wear-leveling, journaling etc, > there's just so many ways in which data doesn't get deleted. Fair enough, thanks! I had the software (including linux kernel to the extent I have a clue about what it's doing) mapping in mind, but didn't think about the hardware one. > GCM builds a stream-cipher from a block-cipher. You don't need any padding. So can I safely ask it to encrypt 1 byte and it will encrypt it to 1 byte? Encrypt 1025 bytes to 1025 bytes? Okay then :) It simplifies implementation, so I'm totally fine with it. It has the drawback that you can't switch to another chaining scheme by just altering the #define's, but I don't care too much about this.
(In reply to comment #97) > So can I safely ask it to encrypt 1 byte and it will encrypt it to 1 byte? You can ask to encrypt 1 byte and you'll get 1 byte back, plus 16 bytes of tag.
Created attachment 292565 [details] [review] Clever reset v2 (git merge, and a relevant bugfix: initialize cursor.row to the correct value)
rc3 is almost ready with all the aforementioned requirements, will post it soon, but I found one more thing to address. (See bug 741480 which is relevant). Each stream uses 3*64k right now (write batching buffer, read cache, tmp buffer for compression), which is 3*3*64k = about half a meg for both rings, that's more than a MB for a vte. The other bug will drop it for the alternate screen, still leaving us half a meg. We have the ensure_file thing now so that opening the fd is deferred until we actually need it. Creating these buffers, getting an encryption key etc. should also be deferred until the first write to boa/snake or to the stream, as relevant. Also I'm wondering if the 64k tmp buffer could be shared among the 3 streams of a vte (it's safe, but the design would be very ugly) or even make it one single global buffer for vte (we'd not be thread safe then, but we're not that now either: bug 730407). Or could reside on the stack somehow? Or is it safe and fast and wouldn't cause memory fragmentation if we malloc and then free it in the same method? Yup probably the last one sounds the best, unless it imposesn oticeable performance regression. Hopefully we'd end up only requiring 3*2*64 = 384kB for each vte's read/write caching buffers, and deferring those until the terminal contents actually begin to scroll out.
Created attachment 292676 [details] [review] Implement stream in one fd, rc3
Created attachment 292677 [details] [review] Compression, rc3
Created attachment 292678 [details] [review] Clever reset, rc3
Created attachment 292679 [details] [review] Encryption, rc3
Please do a sanity check of one-fd + compression ASAP, I'd like to commit them (but not the encryption yet) to this Monday's release so they get broader testing in Rawhide. Adam, could you please look at the encryption bits? I'm hoping for this one to be the final version, making it to the following release in about a month.
I can't do any meaningful testing of this between now and Monday. Might be good tohave a way to disable compression and/or encryption via cmdline arguments for testing maybe, though longterm that's not a good idea. Another alternative would be to land only the snake for one release, then the compression, then the encryption, such that they get tested gradually and not two at a time. Your call. Thanks for working on this!
Created attachment 292709 [details] [review] Clever reset, rc3b (just a simple git merge)
Review of attachment 292677 [details] [review]: ::: src/vtestream-file.h @@ +604,3 @@ +#ifndef VTESTREAM_MAIN + uLongf dstlen_ulongf = dstlen; + g_assert_cmpuint (compress2 ((Bytef *) dst, &dstlen_ulongf, (const Bytef *) src, srclen, 1), ==, Z_OK); Just realized this is a terrible pattern - breaks the code if compiled with -DG_DISABLE_ASSERT. Will fix. I also happened to find this: http://osdir.com/ml/gtk-list/2010-01/msg00121.html :D
Created attachment 293011 [details] [review] Encryption, rc3b Just a git merge, after fixing the previous issue. (FYI: bug 741732)
@Adam: friendly reminder, could you please take a (hopefully final) look at the encryption? I'd like to submit it on the 18th the latest, to make it into the next development release. Thanks in advance!
No response from AGL, but I believe I have addressed all his concerns. Submitted, marking as fixed, please reopen/file a new bug in case of problem.