GNOME Bugzilla – Bug 797086
matroskamux: don't store used UIDs
Last modified: 2018-09-06 18:17:14 UTC
Created attachment 373551 [details] [review] 0001-matroskamux-don-t-store-used-UIDs.patch Currently, whenever we generate a 128-bit UID, we store it in a list and return 0 if we ever encounter a collision. This is so mathematically improbable that it's not worth checking for, so we can save memory and time by not tracking the UID. Even if a collision happened, a list of only 10 UIDs would be unlikely to detect it. This article has a good description of how improbable a collision is: https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
Before this can be merged, the UID generator should be updated to include a timestamp combined with the random data. This is best practices for UID generator and it really makes the collision only possible after a very long amount of time.
Could you clarify what you mean by this? Do you mean, for instance, that the first N bits of the UID would be a timestamp? Are you concerned about a buggy random number generator creating a collision? Absent a random number generator bug, I'm not sure how collisions could happen at all... to reach a 50% collision probability, you would need to generate "1 billion UIDs per second for about 85 years" (https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions). The only practical way I think we should see collisions is if the random number generator is cyclical or buggy. Adding a timestamp could theoretically help in such a case, but if we are speculating, the clock could be buggy too.
If we add a timestamp, couldn't we arguably be in violation of the matroska spec? The wording I see for "SegmentUID" is: A randomly generated unique ID to identify the current Segment between many others (128 bits). If we add a timestamp, this is no longer fully random. https://matroska.org/technical/specs/index.html
Timestamp + random number will give you a random number. For the rest, read: https://tools.ietf.org/html/rfc4122 https://en.wikipedia.org/wiki/Universally_unique_identifier It's not a official standard, just a suggested one, but the one being implemented by the Linux kernel and many other libraries.
Could you clarify which UUID version you're proposing? Version 1 uses date-time + MAC address while version 4 is "truly random". I don't see a version that uses just a time and a random number. As far as I can tell, the Linux kernel just uses version 4 ("truly random"): https://github.com/torvalds/linux/blob/master/lib/uuid.c#L41
Apparently things have change in the kernel, I bet the by truly random, they use a RNG that has been validated. Anyway, feel free to go ahead, I don't disagree, but I don't have a strong trust on glib RNG.
I agree that checking the GLib implementation to see if we should trust it is a good idea. I did a bit of code reading as this is quite an interesting subject. The short answer is that every library I can find generating random numbers on Linux relies on /dev/urandom. This includes glib, libossp-uuid (which backs the uuid -v4 command), the uuid function in the Linux kernel (which uses get_random_bytes(), exported by /dev/urandom), and even an strace of openssh (I didn't check the code in that case). In any case, it looks like all our random number generation will be as good/bad as /dev/urandom. That said, I'm not a crytographer, but it looks like /dev/urandom should be good enough for practical purposes: https://www.2uo.de/myths-about-urandom. On Windows, glib uses rand_s, which the documentation says is "cryptographically secure" (https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rand-s). If rand_s doesn't exist (Windows XP!), it uses the time as a seed and then applies a Mersenne Twister algorithm to generate random numbers from that. I think we should be OK in preventing collisions.
GLib have it's own RNG, which is documented not recommended for crypto. It's seeded from urandom on Linux, rand_s on Win32 and wallclock time otherwise. But this is not crypto, we just want to avoid collision, we don't care if one can predict the values. That UID cache was initially global, so it's not the first time it becomes a bottleneck. I think the explanation for the change are fair enough, and will improve performance over long period of time. https://gitlab.gnome.org/GNOME/glib/blob/master/glib/grand.c#L62
Pushed as be05515da797115739bf352d6efdbc34eda511bb
Thanks, this will appear in the next release of GStreamer.