After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 673012 - Stable byte-level specification for normal form
Stable byte-level specification for normal form
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gvariant
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Allison Karlitskaya (desrt)
gtkdev
Depends on:
Blocks:
 
 
Reported: 2012-03-28 18:03 UTC by Colin Walters
Modified: 2012-08-27 21:03 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
gvariant: Add tests for exact SHA256 checksums (4.20 KB, patch)
2012-03-28 18:04 UTC, Colin Walters
none Details | Review
gvariant: Add tests for exact SHA256 checksums (3.78 KB, patch)
2012-08-27 20:22 UTC, Colin Walters
committed Details | Review

Description Colin Walters 2012-03-28 18:03:42 UTC
Background: I'm implementing a user-level filesystem, similar to git.  In this project I'm heavily using GVariant where traditional kernel-like code would use C structures.

Also like git, this filesystem uses cryptographic checksums of content.  In order for this to work, the guarantee I need is that for a given GVariant I create with specific values, the exact same byte-level representation must result.

If we agree this should be possible, then it means that any future changes to the GVariant serialization must be opt-in.  Something like:

g_variant_new_with_flags (G_VARIANT_OPT_HASH_TABLE,
                          "a{sv}", ...)
Comment 1 Colin Walters 2012-03-28 18:04:46 UTC
Created attachment 210798 [details] [review]
gvariant: Add tests for exact SHA256 checksums

I'd like to use GVariant as a data format in my userspace filesystem,
and having the actual bits be stable means I can reliably compute
cryptographic checksums.
Comment 2 Colin Walters 2012-03-28 18:08:01 UTC
Note if we agree this makes sense, we should probably document the byte-level format too.
Comment 3 Allison Karlitskaya (desrt) 2012-03-28 18:31:48 UTC
From the very beginning, it has never been the case that a particular value, as represented in GVariant has had only one valid serialisation.  One of the primary conclusions reached during the design phase ensures that this cannot be the case.  The criteria that GVariant accessors never return errors leads to the fact that we have to deal with any possible input in a graceful way.  That means that many possible byte sequences can result in the same value (as returned by the API).

It's been a plan of mine for a while to use this freedom to introduce hashing to dictionary types.  I intentionally left room in the encoding of arrays to introduce this in a backwards-compatible way (although not forwards-compatible, admittedly).

Of course, there is a piece of sanity introduced in GVariant: the concept of "normal form".  For any given value there is only one normal form.  This is what you get from g_variant_get_normal_form().  This is also guaranteed to be the one that you get when you use the GVariant API to construct values "from scratch".

Clearly you'd want the guarantee about byte values to only apply in the case of normal forms.  So in terms of that, what would change in the future is that the normal form for dictionaries would become different.

What would never change, however, is GVariant's ability to interpret values that were previously stored using GVariant.  If you encode a value as a GVariant dictionary then take the serialised form then it will have a particular value.  If you take the same byte sequence and load it back into any future version of GVariant, it will continue to work.  It will obviously also checksum to the same value -- so your pack format will remain valid for all future versions.

All that would change is that re-encoding the same dictionary and storing it again may result in a different value being produced.  That's not really that bad anyway when you consider that it's already the case that this dictionary:

 { 'a': 1, 'b': 2 }

is already encoded differently than this one:

 { 'b': 2, 'a': 1 }

even though most people would argue that they are equal.
Comment 4 Colin Walters 2012-03-28 19:18:52 UTC
(In reply to comment #3)
> From the very beginning, it has never been the case that a particular value, as
> represented in GVariant has had only one valid serialisation. 

Ok, right; so let's be clear this discussion applies to the "normal form".

> Clearly you'd want the guarantee about byte values to only apply in the case of
> normal forms.  So in terms of that, what would change in the future is that the
> normal form for dictionaries would become different.

What I actually need is the full cycle of:

GVariant normal form <-> in-memory unpacked

If there is a new normal form introduced (let's call the current one "normal form version 0" and the new one "normal form version 1"), then what we have is:

normal form 0 <-> in-memory unpacked <-> normal form 1

And while all of those conversions are possible without data loss, the byte level representation of version 0 and 1 will be different, and hence they'll have different checksums.

Concretely, let's say I have a GVariant a{sv} that basically represents bits of "struct stat" with keys like uid, gid, rdev, mode.

Now let's say I want my "fsck" command to verify integrity.  The way I do this is to compute the checksum over the GVariant.  If the old in-database gvariant is in normal form 0, but my process is running a newer glib which generates a{sv} in normal form 1, then they'll have different checksums.

And that's precisely the issue.  Now, we could say that I should be checksumming "manually" by unpacking the variant and going over the individual members in a well-defined order.  And I might be willing to accept that, but the reason I filed this bug is because it would be *really convenient* for me to know my program only generates "normal form version 0", forever.

>  { 'a': 1, 'b': 2 }
> 
> is already encoded differently than this one:
> 
>  { 'b': 2, 'a': 1 }
> 
> even though most people would argue that they are equal.

They're only dictionaries by convention - the fact that they're actually arrays is precisely what I want.

(However, we could use more convenience utilities for storing vardicts that e.g. include associated perfect hash tables).
Comment 5 Allison Karlitskaya (desrt) 2012-03-29 00:34:05 UTC
(In reply to comment #4)
> What I actually need is the full cycle of:
> 
> GVariant normal form <-> in-memory unpacked
 
> If there is a new normal form introduced (let's call the current one "normal
> form version 0" and the new one "normal form version 1"), then what we have is:
> 
> normal form 0 <-> in-memory unpacked <-> normal form 1
> 
> And while all of those conversions are possible without data loss, the byte
> level representation of version 0 and 1 will be different, and hence they'll
> have different checksums.

The "in-memory unpacked" form *is* the serialised form.  There is no unpacking to speak of.

If you load a GVariant from data then save it back again (in "form 0") and save it back again, it will still be in "form 0".  In this case, "form 0" can be anything -- the existing GVariant format, the new one, some corrupt bytes -- anything.  GVariant is very little more than a buffer.

If you wanted "form 1" from "form 0" you'd either need to call g_variant_get_normal_form() or manually unpack/repack all the data (ie: iterate over the array, pulling each child out and repack it into a new GVariant array).

> Now let's say I want my "fsck" command to verify integrity.  The way I do this
> is to compute the checksum over the GVariant.  If the old in-database gvariant
> is in normal form 0, but my process is running a newer glib which generates
> a{sv} in normal form 1, then they'll have different checksums.

Only if you renormalise, as above.
Comment 6 Colin Walters 2012-03-29 14:47:10 UTC
(In reply to comment #5)
>
> The "in-memory unpacked" form *is* the serialised form.  There is no unpacking
> to speak of.

Sorry, what I meant by "in-memory unpacked" is:

guint32 uid;
guint32 gid;
...

g_variant_get (variant, "(uu)", &uid, &gid);
/* Data is now "unpacked" */
newvariant = g_variant_new ("(uu)", uid, gid);

I could have phrased this much more clearly, admittedly.
 
> If you load a GVariant from data then save it back again (in "form 0") and save
> it back again, it will still be in "form 0".  

Yes, but I'm defining "load" here as actually extracting the data.  The reason this is important is because I have the data in a non-GVariant form (e.g. "struct stat", or the contents of a file), and I need to be sure that going back and forth between those results in the same byte-level variants.

> If you wanted "form 1" from "form 0" you'd either need to call
> g_variant_get_normal_form() or manually unpack/repack all the data (ie: iterate
> over the array, pulling each child out and repack it into a new GVariant
> array).

Yes, that - or simply getting the data from a non-GVariant source.  That's the core concern.

The patch I attached here I think illustrates the effect I need, which is that for a given set of inputs to g_variant_new(), the resulting checksum is stable.

Right?  Does this all make sense?
Comment 7 Allison Karlitskaya (desrt) 2012-03-29 17:36:15 UTC
It makes sense, but I'd have to say that the answer is "no", at least for dictionaries.

I don't anticipate changes in any of the other types and I'd probably be willing to commit to that.
Comment 8 Colin Walters 2012-03-30 17:03:57 UTC
(In reply to comment #7)
> It makes sense, but I'd have to say that the answer is "no", at least for
> dictionaries.
> 
> I don't anticipate changes in any of the other types and I'd probably be
> willing to commit to that.

Interesting, that's a middle ground answer I hadn't expected.  Hmm.  I think I could get away without using dictionaries.

As far as improving dictionaries...it's quite possible to use GVariant for very efficient lookup tables as it is today, but it requires some manual coding.  For example, one can have attached perfect hashing (e.g. http://cmph.sourceforge.net/ ) tables.  So a serialized a{sv} would then be (aya(sv)) where the first ay is a lookup table which gives you an offset into the second array.

This is obviously only worthwhile once one approaches a certain data size.
Comment 9 Colin Walters 2012-08-27 20:22:40 UTC
Created attachment 222577 [details] [review]
gvariant: Add tests for exact SHA256 checksums

This update drops the vardict checksums, per discussion.
Comment 10 Allison Karlitskaya (desrt) 2012-08-27 20:35:45 UTC
Review of attachment 222577 [details] [review]:

This is just about the most ridiculous testcase that I've ever seen, but if it will close this bug then please commit it.
Comment 11 Colin Walters 2012-08-27 21:03:30 UTC
Attachment 222577 [details] pushed as d54e106 - gvariant: Add tests for exact SHA256 checksums