GNOME Bugzilla – Bug 751621
Please make the genered IDs deterministic
Last modified: 2021-07-05 11:01:09 UTC
Created attachment 306252 [details] [review] Make-generate-id-deterministic.patch generate-id() used to return identifiers based on the memory address of the node object. This unfortunately prevents documentation to be built reproducible. In the proposed patch we now increment a static counter and store its value in the node _private on the first invocation of generate-id(). This helps carry on the Debian Reproducible Builds project [0], where this patch got written and widely tested, and the the Fedora folks where they need to strip those IDs for multi-arch reasons [1]. [0] https://wiki.debian.org/ReproducibleBuilds [1] basically Fedora compiles all binary packages, including noarch ones; even if all the copies but one are discarded, they are compared, so they must be the same. To do so they pass the output of whatever xslt based through sed to directly drop that ID => ugly.
Obvious problem, you leak, all those blocks of memory allocated are never freed. If you allocate for __private, you must keep track of it and free, because libxml2 can't as the object was allocated outside libxml2. __private is also used for other things like keys in libxslt, so if you use a key on a node where you generate an id you very likely break the key or break the uniqueness of the id. So no that doesn't work there is at least 2 major issues with your patch. Also a static global variable won't do what you want, it may work for xsltproc called with a single source but if you run multiple transformations from the same process it's all random again. So it's forbidden to use a random generator to generate anything in a build ? Also timestamps for the time fo the document generation would have to be banned ... That sounds like a bug introducing attempt to fix a broken rule, not good at all, Daniel
Review of attachment 306252 [details] [review]: Seriously broken for reasons exposed
With my fedora packager hat on, we too have often hit the problem of multilib conflicts in the case of non-deterministic output documented here.
The problem: - generate-id() must be unique across all nodes ever used by the XSLT transformation - it must be really fast, people use it to test node equality in various XSLT libraries the placeholder __private is already used by something else (the keys) there is no requirement in XSLT of the id being deterministic, that's something you guys are adding on top of the spec, the only requirements is uniqueness. A lookaside index would be too slow, testing node equality would mean get from pointer A to index a and then from pointer B to index, if the document grow large that really not practical. Plus since the docs are generated by upstream makefiles you just can't add an argument to xsltproc like --slow-id which would change the behaviour when building the docs for you. I don't see a simple way forward, the problem is that you are asking for something that is not part of the standard. I assume you have the same "non-reproductible" issue for everything generated on object code, tarballs, etc ... where timestamps are obviously embedded. Daniel
Just to enlighten you, yeah, there are quite some stuff embedding timestamps. We're just rooting them out. Take a look at https://reproducible.debian.net/index_issues.html that's a list of issues we have and we are trying to fix. We manage to remove quite some timestamps. Tarballs does not embed timestamps on their own, rather the gzip does unless -n is used; object code does not either. In Debian we're currently continuously double-building our entire archive in different conditions https://reproducible.debian.net/reproducible.html#variation , and out of >22k package in the debian archive more than 80% are reproducible. Clearly we have our own toolchain that did not make its way to the archive yet, usually because their maintainers are not ok with the implementation (like this case) or we're still debating which package should fix what. Btw, i/we'll try to come up with a better patch also following your suggestions/observation/complains/.... :) Thanks for your quick review! [0] https://reproducible.debian.net/reproducible.html#variation
Created attachment 306429 [details] [review] Implements deterministic generate-id with a lookaside index Here's another patch which implements deterministic generate-id by maintaining a lookaside index. It's only enabled if --deterministic is passed to xsltproc, though we would prefer this to be the default, to avoid having to patch lots of upstream makefiles. To try to determine the overhead, I ran some tests on an XML document with 1.6 million elements[1]. The first test[2] calls generate-id on the root node 1,000 times. There is no time or memory overhead when --deterministic is enabled. The second test[3] calls generate-id() on every single element. Runtime increases by 538%, and peak memory consumption increases by 24% with --deterministic enabled. The third test[4] is the same as the second test but repeated 5 times (without reparsing the document). Runtime overhead is 740%, and memory overhead is 14%. So there is real overhead with this method, but these are *extremely* pathological test cases. Do you know of any XSLT benchmarks that try to mimic real world use cases? I'm definitely concerned about how often generate-id(foo) = generate-id(bar) is used in practice to compare node equality. As one data point, I looked through all of my XSLT stylesheets, and asked a colleague who uses XSLT extensively to do the same, and neither of us use this idiom much (consequentially, there was no noticeable overhead from --deterministic). Keep in mind that what really matters is how many distinct nodes generate-id is called on - a huge document won't see any overhead if generate-id is only called on a small number of nodes. Andrew [1] https://www.cloudmutt.com/s/bigxml.xml.gz [2] http://paste.debian.net/270667 [3] http://paste.debian.net/270668 [4] http://paste.debian.net/270669
Created attachment 306475 [details] [review] A patch with no performance loss A patch with no performance degradation but assuming the context node used for the ID generation is an element.
Thanks Andrew for the patch, see above for a different one. It relies on the fact that ID generated by docbook stylesheets and similar compute the id in the context of an element node input, and on those nodes I can save a numbering, actually the input elements already have a number here to speed up XPath queries. Could you test that patch and confirm that at least for the majority of the stylesheet employed you now get a stable ID in the output. I have no speed degradation for the 3 stylesheets you provide on that input, the first one calling generate-id() in the context of the document node which is not an element still get you the address based output ids but the other ones which actually go through the element set do generate the new stable ids. Can you double check and give feedback ? Rex, your input would be appreciated too, thanks, Daniel
Daniel, thanks for this patch! We are going to test all Debian packages which were reproducible with the first patch sent by Mattia with your new patch. We'll see how it goes. :)
We have been running builds of Debian packages for the past months using this patch. We have not seen any breakage and it seems to work in making packages reproducible. :)
Okay good, thanks for the feedback. I need to roll a new release, hopefully this month ! Daniel
Did this patch ever make it into a libxslt release? I can't seem to find the changes in libxslt git. I see Debian is still patching this into 1.1.29 locally.
Here are the commits in git: https://git.gnome.org/browse/libxslt/commit/?id=e57df303eca25a2a3f9e0625c29f4b20177858cc https://git.gnome.org/browse/libxslt/commit/?id=8f9303f3b2a73523767b158f9be8dbf2a89ba3a2 Apparently it didn't make it in 1.1.29 despite being before the tag in git(?), or something weird happened when YunQiang Su imported the 1.1.29 in Debian, I haven't actually checked myself. I suppose I could try importing 1.1.31 and see for myself.
Oh, you are right, https://bugzilla.gnome.org/attachment.cgi?id=306475&action=diff was never committed apparently, I was confused by a related bug. Reopening the bug, then :)
(In reply to Daniel Veillard from comment #7) > Created attachment 306475 [details] [review] [review] > A patch with no performance loss > > A patch with no performance degradation but assuming the context node used > for the ID generation is an element. Hm, I think there's a problem with this patch. With patched libxslt, when referring to nodes using document() the generated IDs appear to sometimes be the same even for nodes in different documents. I believe this is at odds with the XSLT specification. Without the patch the IDs are different (but random), as expected. Example (call this file test.xsl): <?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*" /> <xsl:output indent="yes" /> <xsl:template match="/"> <ids> <a> <xsl:value-of select='generate-id(document("a.xml")/*)' /> </a> <b> <xsl:value-of select='generate-id(document("b.xml")/*)' /> </b> </ids> </xsl:template> </xsl:stylesheet> Put anything you want in a.xml / b.xml (these files need at least one element) Unpatched libxslt-v1.1.32: % xsltproc test.xsl test.xsl <?xml version="1.0"?> <ids> <a>idm45891049150288</a> <b>idm45891049100304</b> </ids> (note that the generated IDs are different, as expected). With the patch: % xsltproc-patched test.xsl test.xsl <?xml version="1.0"?> <ids> <a>idm1</a> <b>idm1</b> </ids> This problem can be observed on Debian systems because their libxslt package has this patch applied :(
(In reply to Nick Bowler from comment #15) > Hm, I think there's a problem with this patch. With patched libxslt, > when referring to nodes using document() the generated IDs appear to > sometimes be the same even for nodes in different documents. I believe > this is at odds with the XSLT specification. Mh, I not sure, why would anyone expect IDs are different between different documents? I don't recall any word about that, and I consider such idea totally unrealistic. > Without the patch the IDs are different (but random), as expected. Indeed they are random, which is what this very bug is about, we don't want them to be random. Besides, if they are random you have no assurances that they won't be different between documents, which defeats your point.
(In reply to Mattia Rizzolo from comment #16) > (In reply to Nick Bowler from comment #15) > > Hm, I think there's a problem with this patch. With patched libxslt, > > when referring to nodes using document() the generated IDs appear to > > sometimes be the same even for nodes in different documents. I believe > > this is at odds with the XSLT specification. > > Mh, I not sure, why would anyone expect IDs are different between different > documents? Because the spec says they are? 12.4 Miscellaneous additional functions Function string generate-id(node-set?) "... different identifiers are always generated from different nodes" The nodes resulting from the document() function on different documents are clearly different nodes. Yet with the patch we have two different nodes for which generate-id creates the same identifier. > I don't recall any word about that, and I consider such idea totally > unrealistic. But the unpatched (upstream) libxslt works correctly, so correct behaviour seems perfectly realistic.
Hi all, Can someone clarify the latest status of this issue? AIUI Debian has applied a patch: https://sources.debian.org/src/libxslt/1.1.32-2/debian/patches/0004-Make-generate-id-deterministic.patch/ … which has problems? Moving forward from here, shall we move over to Gitlab? :) —lamby
I just spent some time debugging this, and yes, the Debian patch has problems. The patch relies on the assumption that the xmlNode struct's content field is unused, and normally NULL, for element nodes. The patch (ab)uses this field to store a deterministic ID, using the following logic: - If content is NULL, generate a deterministic ID and store it in content. - Otherwise, use the deterministic ID already stored in content. Unfortunately, the assumption is no longer true. The content field is never NULL, and consequentially the patch is using whatever value is in content as the ID, instead of generating its own. This value is not unique across different documents, and for all we know might not even be unique within a single document. I haven't been able to figure out what the value means or who is setting it. I also noticed a couple of other minor problems. First, the patch prints a debug message to stderr. That's not very nice for a library to do. Second, there is no distinction between IDs generated deterministically and IDs that are generated using pointer arithmetic. Consequentially, there's a small chance of a collision. The deterministic IDs should be prefixed with a different string, such as "idd", to distinguish them from pointer arithmetic IDs. This patch is actually pretty busted and could be breaking people's stylesheets in hard-to-detect ways. It should be backed out of Debian in a stable point release until it can be fixed. I'll open a Debian bug about this. Andrew
> I'll open a Debian bug about this. (This now exists as https://bugs.debian.org/902051 - thanks!)
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxslt/-/issues/ Thank you for your understanding and your help.