Bug 751621 – Please make the genered IDs deterministic

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 751621 - Please make the genered IDs deterministic


Summary:	Please make the genered IDs deterministic


Status:	RESOLVED OBSOLETE

Product:	libxslt
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-06-28 22:22 UTC by Mattia Rizzolo
Modified:	2021-07-05 11:01 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Make-generate-id-deterministic.patch (2.04 KB, patch) 2015-06-28 22:22 UTC, Mattia Rizzolo	none	Details \| Review
Implements deterministic generate-id with a lookaside index (4.54 KB, patch) 2015-06-30 18:50 UTC, Andrew Ayer	none	Details \| Review
A patch with no performance loss (4.62 KB, patch) 2015-07-01 09:04 UTC, Daniel Veillard	none	Details \| Review

Description Mattia Rizzolo 2015-06-28 22:22:08 UTC

Created attachment 306252 [details] [review]
Make-generate-id-deterministic.patch

generate-id() used to return identifiers based on the memory address of the node object. This unfortunately prevents documentation to be built reproducible. In the proposed patch we now increment a static counter and store its value in the node _private on the first invocation of generate-id().


This helps carry on the Debian Reproducible Builds project [0], where this patch got written and widely tested, and the the Fedora folks where they need to strip those IDs for multi-arch reasons [1].



[0] https://wiki.debian.org/ReproducibleBuilds
[1] basically Fedora compiles all binary packages, including noarch ones; even if all the copies but one are discarded, they are compared, so they must be the same. To do so they pass the output of whatever xslt based through sed to directly drop that ID => ugly.

Comment 1 Daniel Veillard 2015-06-29 01:49:40 UTC

Obvious problem, you leak, all those blocks of memory allocated
are never freed. If you allocate for __private, you must keep track
of it and free, because libxml2 can't as the object was allocated
outside libxml2.
__private is also used for other things like keys in libxslt,
so if you use a key on a node where you generate an id you very likely
break the key or break the uniqueness of the id.

So no that doesn't work there is at least 2 major issues with your
patch.

Also a static global variable won't do what you want, it may work
for xsltproc called with a single source but if you run multiple
transformations from the same process it's all random again.

So it's forbidden to use a random generator to generate anything
in a build ? Also timestamps for the time fo the document generation
would have to be banned ...
That sounds like a bug introducing attempt to fix a broken rule,
not good at all,

Daniel

Comment 2 Daniel Veillard 2015-06-29 01:50:16 UTC

Review of attachment 306252 [details] [review]:

Seriously broken for reasons exposed

Comment 3 Rex Dieter 2015-06-30 15:29:37 UTC

With my fedora packager hat on, we too have often hit the problem of multilib conflicts in the case of non-deterministic output documented here.

Comment 4 Daniel Veillard 2015-06-30 15:57:55 UTC

The problem:
  - generate-id() must be unique across all nodes ever used by
    the XSLT transformation
  - it must be really fast, people use it to test node equality
    in various XSLT libraries
the placeholder __private is already used by something else (the keys)
there is no requirement in XSLT of the id being deterministic, that's
something you guys are adding on top of the spec, the only requirements
is uniqueness.

A lookaside index would be too slow, testing node equality would mean get
from pointer A to index a and then from pointer B to index, if the document
grow large that really not practical.

Plus since the docs are generated by upstream makefiles you just can't
add an argument to xsltproc like --slow-id which would change the behaviour
when building the docs for you.

  I don't see a simple way forward, the problem is that you are asking
for something that is not part of the standard. I assume you have the
same "non-reproductible" issue for everything generated on object code,
tarballs, etc ... where timestamps are obviously embedded.

Daniel

Comment 5 Mattia Rizzolo 2015-06-30 16:39:25 UTC

Just to enlighten you, yeah, there are quite some stuff embedding timestamps. We're just rooting them out.

Take a look at https://reproducible.debian.net/index_issues.html that's a list of issues we have and we are trying to fix.
We manage to remove quite some timestamps.
Tarballs does not embed timestamps on their own, rather the gzip does unless -n is used; object code does not either.
In Debian we're currently continuously double-building our entire archive in different conditions https://reproducible.debian.net/reproducible.html#variation , and out of >22k package in the debian archive more than 80% are reproducible.

Clearly we have our own toolchain that did not make its way to the archive yet, usually because their maintainers are not ok with the implementation (like this case) or we're still debating which package should fix what.



Btw, i/we'll try to come up with a better patch also following your suggestions/observation/complains/.... :)

Thanks for your quick review!

[0] https://reproducible.debian.net/reproducible.html#variation

Comment 6 Andrew Ayer 2015-06-30 18:50:32 UTC

Created attachment 306429 [details] [review]
Implements deterministic generate-id with a lookaside index

Here's another patch which implements deterministic generate-id by
maintaining a lookaside index.  It's only enabled if --deterministic
is passed to xsltproc, though we would prefer this to be the default,
to avoid having to patch lots of upstream makefiles.

To try to determine the overhead, I ran some tests on an XML document
with 1.6 million elements[1].  The first test[2] calls generate-id on
the root node 1,000 times.  There is no time or memory overhead when
--deterministic is enabled.  The second test[3] calls generate-id()
on every single element.  Runtime increases by 538%, and peak memory
consumption increases by 24% with --deterministic enabled.  The third
test[4] is the same as the second test but repeated 5 times (without
reparsing the document).  Runtime overhead is 740%, and memory overhead
is 14%.

So there is real overhead with this method, but these are *extremely*
pathological test cases.  Do you know of any XSLT benchmarks that try
to mimic real world use cases?  I'm definitely concerned about how often
generate-id(foo) = generate-id(bar) is used in practice to compare node
equality.  As one data point, I looked through all of my XSLT stylesheets,
and asked a colleague who uses XSLT extensively to do the same, and
neither of us use this idiom much (consequentially, there was no
noticeable overhead from --deterministic). Keep in mind that what really
matters is how many distinct nodes generate-id is called on - a huge
document won't see any overhead if generate-id is only called on a small
number of nodes.

Andrew

[1] https://www.cloudmutt.com/s/bigxml.xml.gz
[2] http://paste.debian.net/270667
[3] http://paste.debian.net/270668
[4] http://paste.debian.net/270669

Comment 7 Daniel Veillard 2015-07-01 09:04:56 UTC

Created attachment 306475 [details] [review]
A patch with no performance loss

A patch with no performance degradation but assuming the context node used
for the ID generation is an element.

Comment 8 Daniel Veillard 2015-07-01 09:11:55 UTC

Thanks Andrew for the patch, see above for a different one.
It relies on the fact that ID generated by docbook stylesheets and
similar compute the id in the context of an element node input,
and on those nodes I can save a numbering, actually the input elements
already have a number here to speed up XPath queries.
Could you test that patch and confirm that at least for the majority
of the stylesheet employed you now get a stable ID in the output.

I have no speed degradation for the 3 stylesheets you provide on that
input, the first one calling generate-id() in the context of the document
node which is not an element still get you the address based output ids
but the other ones which actually go through the element set do generate the
new stable ids.

Can you double check and give feedback ? Rex, your input would be
appreciated too,

 thanks,

Daniel

Comment 9 Jérémy Bobbio 2015-07-01 09:57:56 UTC

Daniel, thanks for this patch!

We are going to test all Debian packages which were reproducible with the first patch sent by Mattia with your new patch. We'll see how it goes. :)

Comment 10 Jérémy Bobbio 2015-09-15 18:21:07 UTC

We have been running builds of Debian packages for the past months using this patch. We have not seen any breakage and it seems to work in making packages reproducible. :)

Comment 11 Daniel Veillard 2015-09-16 03:03:27 UTC

Okay good, thanks for the feedback. I need to roll a new release, hopefully
this month !

Daniel

Comment 12 Nick Bowler 2017-10-07 19:06:44 UTC

Did this patch ever make it into a libxslt release?  I can't seem to find the changes in libxslt git.

I see Debian is still patching this into 1.1.29 locally.

Comment 13 Mattia Rizzolo 2017-10-08 11:18:54 UTC

Here are the commits in git:
https://git.gnome.org/browse/libxslt/commit/?id=e57df303eca25a2a3f9e0625c29f4b20177858cc
https://git.gnome.org/browse/libxslt/commit/?id=8f9303f3b2a73523767b158f9be8dbf2a89ba3a2

Apparently it didn't make it in 1.1.29 despite being before the tag in git(?), or something weird happened when YunQiang Su imported the 1.1.29 in Debian, I haven't actually checked myself.
I suppose I could try importing 1.1.31 and see for myself.

Comment 14 Mattia Rizzolo 2017-10-08 13:20:23 UTC

Oh, you are right, https://bugzilla.gnome.org/attachment.cgi?id=306475&action=diff was never committed apparently, I was confused by a related bug.

Reopening the bug, then :)

Comment 15 Nick Bowler 2018-05-02 17:30:49 UTC

(In reply to Daniel Veillard from comment #7)
> Created attachment 306475 [details] [review] [review]
> A patch with no performance loss
> 
> A patch with no performance degradation but assuming the context node used
> for the ID generation is an element.

Hm, I think there's a problem with this patch.  With patched libxslt,
when referring to nodes using document() the generated IDs appear to
sometimes be the same even for nodes in different documents.  I believe
this is at odds with the XSLT specification.

Without the patch the IDs are different (but random), as expected.

Example (call this file test.xsl):

  <?xml version="1.0" encoding="UTF-8" ?>
  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:strip-space elements="*" />
  <xsl:output indent="yes" />
  
  <xsl:template match="/">
    <ids>
      <a>
        <xsl:value-of select='generate-id(document("a.xml")/*)' />
      </a>
      <b>
        <xsl:value-of select='generate-id(document("b.xml")/*)' />
      </b>
    </ids>
  </xsl:template>
  
  </xsl:stylesheet>

Put anything you want in a.xml / b.xml (these files need at least one element)

Unpatched libxslt-v1.1.32:

  % xsltproc test.xsl test.xsl
  <?xml version="1.0"?>
  <ids>
    <a>idm45891049150288</a>
    <b>idm45891049100304</b>
  </ids>

(note that the generated IDs are different, as expected).  With the patch:

  % xsltproc-patched test.xsl test.xsl
  <?xml version="1.0"?>
  <ids>
    <a>idm1</a>
    <b>idm1</b>
  </ids>

This problem can be observed on Debian systems because their libxslt package has this patch applied :(

Comment 16 Mattia Rizzolo 2018-05-21 12:10:56 UTC

(In reply to Nick Bowler from comment #15)
> Hm, I think there's a problem with this patch.  With patched libxslt,
> when referring to nodes using document() the generated IDs appear to
> sometimes be the same even for nodes in different documents.  I believe
> this is at odds with the XSLT specification.

Mh, I not sure, why would anyone expect IDs are different between different documents?
I don't recall any word about that, and I consider such idea totally unrealistic.

> Without the patch the IDs are different (but random), as expected.

Indeed they are random, which is what this very bug is about, we don't want them to be random.  Besides, if they are random you have no assurances that they won't be different between documents, which defeats your point.

Comment 17 Nick Bowler 2018-05-21 16:18:31 UTC

(In reply to Mattia Rizzolo from comment #16)
> (In reply to Nick Bowler from comment #15)
> > Hm, I think there's a problem with this patch.  With patched libxslt,
> > when referring to nodes using document() the generated IDs appear to
> > sometimes be the same even for nodes in different documents.  I believe
> > this is at odds with the XSLT specification.
> 
> Mh, I not sure, why would anyone expect IDs are different between different
> documents?

Because the spec says they are?

  12.4 Miscellaneous additional functions

  Function string generate-id(node-set?)
  "... different identifiers are always generated from different nodes"

The nodes resulting from the document() function on different documents are clearly different nodes.  Yet with the patch we have two different nodes for which generate-id creates the same identifier.

> I don't recall any word about that, and I consider such idea totally
> unrealistic.

But the unpatched (upstream) libxslt works correctly, so correct behaviour seems perfectly realistic.

Comment 18 Chris Lamb 2018-06-21 21:41:50 UTC

Hi all,

Can someone clarify the latest status of this issue? AIUI Debian has applied a patch:

  https://sources.debian.org/src/libxslt/1.1.32-2/debian/patches/0004-Make-generate-id-deterministic.patch/

… which has problems?

Moving forward from here, shall we move over to Gitlab? :)

—lamby

Comment 19 Andrew Ayer 2018-06-21 23:07:02 UTC

I just spent some time debugging this, and yes, the Debian patch has problems.

The patch relies on the assumption that the xmlNode struct's content field is unused, and normally NULL, for element nodes. The patch (ab)uses this field to store a deterministic ID, using the following logic:

- If content is NULL, generate a deterministic ID and store it in content.
- Otherwise, use the deterministic ID already stored in content.

Unfortunately, the assumption is no longer true. The content field is never NULL, and consequentially the patch is using whatever value is in content as the ID, instead of generating its own. This value is not unique across different documents, and for all we know might not even be unique within a single document. I haven't been able to figure out what the value means or who is setting it.

I also noticed a couple of other minor problems.

First, the patch prints a debug message to stderr. That's not very nice for a library to do.

Second, there is no distinction between IDs generated deterministically and IDs that are generated using pointer arithmetic. Consequentially, there's a small chance of a collision. The deterministic IDs should be prefixed with a different string, such as "idd", to distinguish them from pointer arithmetic IDs.

This patch is actually pretty busted and could be breaking people's stylesheets in hard-to-detect ways. It should be backed out of Debian in a stable point release until it can be fixed. I'll open a Debian bug about this.

Andrew

Comment 20 Chris Lamb 2018-06-22 07:26:12 UTC

> I'll open a Debian bug about this.

(This now exists as https://bugs.debian.org/902051 - thanks!)

Comment 21 GNOME Infrastructure Team 2021-07-05 11:01:09 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxslt/-/issues/

Thank you for your understanding and your help.