GNOME Bugzilla – Bug 751621
Please make the genered IDs deterministic
Last modified: 2021-07-05 11:01:09 UTC
Created attachment 306252 [details] [review]
generate-id() used to return identifiers based on the memory address of the node object. This unfortunately prevents documentation to be built reproducible. In the proposed patch we now increment a static counter and store its value in the node _private on the first invocation of generate-id().
This helps carry on the Debian Reproducible Builds project , where this patch got written and widely tested, and the the Fedora folks where they need to strip those IDs for multi-arch reasons .
 basically Fedora compiles all binary packages, including noarch ones; even if all the copies but one are discarded, they are compared, so they must be the same. To do so they pass the output of whatever xslt based through sed to directly drop that ID => ugly.
Obvious problem, you leak, all those blocks of memory allocated
are never freed. If you allocate for __private, you must keep track
of it and free, because libxml2 can't as the object was allocated
__private is also used for other things like keys in libxslt,
so if you use a key on a node where you generate an id you very likely
break the key or break the uniqueness of the id.
So no that doesn't work there is at least 2 major issues with your
Also a static global variable won't do what you want, it may work
for xsltproc called with a single source but if you run multiple
transformations from the same process it's all random again.
So it's forbidden to use a random generator to generate anything
in a build ? Also timestamps for the time fo the document generation
would have to be banned ...
That sounds like a bug introducing attempt to fix a broken rule,
not good at all,
Review of attachment 306252 [details] [review]:
Seriously broken for reasons exposed
With my fedora packager hat on, we too have often hit the problem of multilib conflicts in the case of non-deterministic output documented here.
- generate-id() must be unique across all nodes ever used by
the XSLT transformation
- it must be really fast, people use it to test node equality
in various XSLT libraries
the placeholder __private is already used by something else (the keys)
there is no requirement in XSLT of the id being deterministic, that's
something you guys are adding on top of the spec, the only requirements
A lookaside index would be too slow, testing node equality would mean get
from pointer A to index a and then from pointer B to index, if the document
grow large that really not practical.
Plus since the docs are generated by upstream makefiles you just can't
add an argument to xsltproc like --slow-id which would change the behaviour
when building the docs for you.
I don't see a simple way forward, the problem is that you are asking
for something that is not part of the standard. I assume you have the
same "non-reproductible" issue for everything generated on object code,
tarballs, etc ... where timestamps are obviously embedded.
Just to enlighten you, yeah, there are quite some stuff embedding timestamps. We're just rooting them out.
Take a look at https://reproducible.debian.net/index_issues.html that's a list of issues we have and we are trying to fix.
We manage to remove quite some timestamps.
Tarballs does not embed timestamps on their own, rather the gzip does unless -n is used; object code does not either.
In Debian we're currently continuously double-building our entire archive in different conditions https://reproducible.debian.net/reproducible.html#variation , and out of >22k package in the debian archive more than 80% are reproducible.
Clearly we have our own toolchain that did not make its way to the archive yet, usually because their maintainers are not ok with the implementation (like this case) or we're still debating which package should fix what.
Btw, i/we'll try to come up with a better patch also following your suggestions/observation/complains/.... :)
Thanks for your quick review!
Created attachment 306429 [details] [review]
Implements deterministic generate-id with a lookaside index
Here's another patch which implements deterministic generate-id by
maintaining a lookaside index. It's only enabled if --deterministic
is passed to xsltproc, though we would prefer this to be the default,
to avoid having to patch lots of upstream makefiles.
To try to determine the overhead, I ran some tests on an XML document
with 1.6 million elements. The first test calls generate-id on
the root node 1,000 times. There is no time or memory overhead when
--deterministic is enabled. The second test calls generate-id()
on every single element. Runtime increases by 538%, and peak memory
consumption increases by 24% with --deterministic enabled. The third
test is the same as the second test but repeated 5 times (without
reparsing the document). Runtime overhead is 740%, and memory overhead
So there is real overhead with this method, but these are *extremely*
pathological test cases. Do you know of any XSLT benchmarks that try
to mimic real world use cases? I'm definitely concerned about how often
generate-id(foo) = generate-id(bar) is used in practice to compare node
equality. As one data point, I looked through all of my XSLT stylesheets,
and asked a colleague who uses XSLT extensively to do the same, and
neither of us use this idiom much (consequentially, there was no
noticeable overhead from --deterministic). Keep in mind that what really
matters is how many distinct nodes generate-id is called on - a huge
document won't see any overhead if generate-id is only called on a small
number of nodes.
Created attachment 306475 [details] [review]
A patch with no performance loss
A patch with no performance degradation but assuming the context node used
for the ID generation is an element.
Thanks Andrew for the patch, see above for a different one.
It relies on the fact that ID generated by docbook stylesheets and
similar compute the id in the context of an element node input,
and on those nodes I can save a numbering, actually the input elements
already have a number here to speed up XPath queries.
Could you test that patch and confirm that at least for the majority
of the stylesheet employed you now get a stable ID in the output.
I have no speed degradation for the 3 stylesheets you provide on that
input, the first one calling generate-id() in the context of the document
node which is not an element still get you the address based output ids
but the other ones which actually go through the element set do generate the
new stable ids.
Can you double check and give feedback ? Rex, your input would be
Daniel, thanks for this patch!
We are going to test all Debian packages which were reproducible with the first patch sent by Mattia with your new patch. We'll see how it goes. :)
We have been running builds of Debian packages for the past months using this patch. We have not seen any breakage and it seems to work in making packages reproducible. :)
Okay good, thanks for the feedback. I need to roll a new release, hopefully
this month !
Did this patch ever make it into a libxslt release? I can't seem to find the changes in libxslt git.
I see Debian is still patching this into 1.1.29 locally.
Here are the commits in git:
Apparently it didn't make it in 1.1.29 despite being before the tag in git(?), or something weird happened when YunQiang Su imported the 1.1.29 in Debian, I haven't actually checked myself.
I suppose I could try importing 1.1.31 and see for myself.
Oh, you are right, https://bugzilla.gnome.org/attachment.cgi?id=306475&action=diff was never committed apparently, I was confused by a related bug.
Reopening the bug, then :)
(In reply to Daniel Veillard from comment #7)
> Created attachment 306475 [details] [review] [review]
> A patch with no performance loss
> A patch with no performance degradation but assuming the context node used
> for the ID generation is an element.
Hm, I think there's a problem with this patch. With patched libxslt,
when referring to nodes using document() the generated IDs appear to
sometimes be the same even for nodes in different documents. I believe
this is at odds with the XSLT specification.
Without the patch the IDs are different (but random), as expected.
Example (call this file test.xsl):
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*" />
<xsl:output indent="yes" />
<xsl:value-of select='generate-id(document("a.xml")/*)' />
<xsl:value-of select='generate-id(document("b.xml")/*)' />
Put anything you want in a.xml / b.xml (these files need at least one element)
% xsltproc test.xsl test.xsl
(note that the generated IDs are different, as expected). With the patch:
% xsltproc-patched test.xsl test.xsl
This problem can be observed on Debian systems because their libxslt package has this patch applied :(
(In reply to Nick Bowler from comment #15)
> Hm, I think there's a problem with this patch. With patched libxslt,
> when referring to nodes using document() the generated IDs appear to
> sometimes be the same even for nodes in different documents. I believe
> this is at odds with the XSLT specification.
Mh, I not sure, why would anyone expect IDs are different between different documents?
I don't recall any word about that, and I consider such idea totally unrealistic.
> Without the patch the IDs are different (but random), as expected.
Indeed they are random, which is what this very bug is about, we don't want them to be random. Besides, if they are random you have no assurances that they won't be different between documents, which defeats your point.
(In reply to Mattia Rizzolo from comment #16)
> (In reply to Nick Bowler from comment #15)
> > Hm, I think there's a problem with this patch. With patched libxslt,
> > when referring to nodes using document() the generated IDs appear to
> > sometimes be the same even for nodes in different documents. I believe
> > this is at odds with the XSLT specification.
> Mh, I not sure, why would anyone expect IDs are different between different
Because the spec says they are?
12.4 Miscellaneous additional functions
Function string generate-id(node-set?)
"... different identifiers are always generated from different nodes"
The nodes resulting from the document() function on different documents are clearly different nodes. Yet with the patch we have two different nodes for which generate-id creates the same identifier.
> I don't recall any word about that, and I consider such idea totally
But the unpatched (upstream) libxslt works correctly, so correct behaviour seems perfectly realistic.
Can someone clarify the latest status of this issue? AIUI Debian has applied a patch:
… which has problems?
Moving forward from here, shall we move over to Gitlab? :)
I just spent some time debugging this, and yes, the Debian patch has problems.
The patch relies on the assumption that the xmlNode struct's content field is unused, and normally NULL, for element nodes. The patch (ab)uses this field to store a deterministic ID, using the following logic:
- If content is NULL, generate a deterministic ID and store it in content.
- Otherwise, use the deterministic ID already stored in content.
Unfortunately, the assumption is no longer true. The content field is never NULL, and consequentially the patch is using whatever value is in content as the ID, instead of generating its own. This value is not unique across different documents, and for all we know might not even be unique within a single document. I haven't been able to figure out what the value means or who is setting it.
I also noticed a couple of other minor problems.
First, the patch prints a debug message to stderr. That's not very nice for a library to do.
Second, there is no distinction between IDs generated deterministically and IDs that are generated using pointer arithmetic. Consequentially, there's a small chance of a collision. The deterministic IDs should be prefixed with a different string, such as "idd", to distinguish them from pointer arithmetic IDs.
This patch is actually pretty busted and could be breaking people's stylesheets in hard-to-detect ways. It should be backed out of Debian in a stable point release until it can be fixed. I'll open a Debian bug about this.
> I'll open a Debian bug about this.
(This now exists as https://bugs.debian.org/902051 - thanks!)
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).
If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
and create a new ticket at
Thank you for your understanding and your help.