GNOME Bugzilla – Bug 620190
xpointer/string-range() miscalculates ranges near string edges
Last modified: 2021-07-05 13:23:40 UTC
Created attachment 162411 [details] test case This is an attempt to have a closer look at what gives when you use the string-range() function of XPointer in xmllint. The trick is to match the empty string before the first character of the given text node and then request the following X characters. This method has been advocated in several linguistic-corpus-architecture papers, notably those by Nancy Ide and Laurent Romary. I've never seen it work due to lack of tools supporting XPointer properly back when those articles were written. libxml2 seems the only tool that may have such functionality, if the wrinkles and smoothed out. I'm attaching a test case containing the source, the xinclude/xpointer directives and the output. This is to be run as follows: $ xmllint --xinclude xpointer-near_edge.xml > output.xml Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418 The "internal error" messages are the topic of bug #562541 . They may be orthogonal to the issue at hand. The output file follows below. My guess at what is happening is that, possibly, the edge points are not handled properly. By edge points, I mean points before and after each string. s t r i n g ^ ^ ^ ^ ^ ^ ^ 0 1 2 3 4 5 6 Per the W3C draft, "an empty string is defined to match before each character of the string-value and after the final character." http://www.w3.org/TR/xptr-xpointer/#stringrange To make things funnier, and the XPointer draft harder to implement, the string-range() function uses different calculations on the surface: the first character to be matched is always designated by "1" (on the understanding that point 0 is the point before character 1). I'm afraid that the two types of calculation (points and segments) may have gotten confused in the current xpointer implementation in libxml2. I'm pasting the output file from the attached test case: <?xml version="1.0" encoding="UTF8"?> <!-- Each capitalized chunk is 10 characters long; I'm trying to pull them into separate segments, which is crucial for my corpus-based work --> <body xmlns="http://example.org/"> <!-- case one, where *proper addressing* makes the first segment too long by 1, and last segment reach across the closing tag of its <p> and grab a newline (see below for a hint of why the <p> appears here)--> <div> <seg>XXAAXXAAXX </seg> <seg>YYBBYYBBYY</seg> <seg>ZZCCZZCCZZ</seg> <seg><p xmlns="http://example.org/">WWDDWWDDWW</p> </seg> </div> <!-- case two again has proper addressing and the last chunk is handled well: it doesn't reach beyond the <p> thanks to an extra character on the edge --> <div> <seg>XXAAXXAAXX </seg> <seg>YYBBYYBBYY</seg> <seg>ZZCCZZCCZZ</seg> <seg>WWDDWWDDWW</seg> </div> <!-- case three: proper addressing begins at the second segment this time (note that the 1st segment has length=0, which is obviously not right); the right edge is not 'protected' by an extra character, so again the closing <p> tag is selected and we see XInclude, properly handling "partially selected ranges" (see my explanation in bug #306081 : XInclude works alright, but in the case at hand it's misinformed by XPointer trying to reach too far) --> <div> <seg>.</seg> <seg>XXAAXXAAXX</seg> <seg>YYBBYYBBYY</seg> <seg>ZZCCZZCCZZ</seg> <seg><p xmlns="http://example.org/">WWDDWWDDWW</p> </seg> </div> <!-- case four: thanks to the extra dots "protecting" the edges, the offsets and lengths work alright. BUT this is not a real workaround: recall that in most cases, we have no control over the source, so we can't add "protective" dots - a bugfix is needed --> <div> <seg>XXAAXXAAXX</seg> <seg>YYBBYYBBYY</seg> <seg>ZZCCZZCCZZ</seg> <seg>WWDDWWDDWW</seg> </div> <!-- case five looks at the first <p> of the source again, with the "unprotected" string: <p>XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW</p> you'd think that if we can kludge around the first segment, faking its length to be 9, we should be able to pull the same trick at the end, but, alas, "34,9" doesn't work - this is the real hurdle --> <div> <seg>XXAAXXAAXX</seg> <seg>YYBBYYBBYY</seg> <seg>ZZCCZZCCZZ</seg> <seg>WWDDWWDDW</seg> </div> </body>
A somewhat related issue concerning the handling of embedded nodes in reported on on bug #620195.
Let me also paste the source and the stylesheet, as a convenience for those who'd rather not download the test case. Note the extra dots in the source, they are rather crucial. Note also that the first, dot-less, <p> is referenced twice (in "case one" and "case five"). The desired output should be as in case four (consisting of 10-character segments). SOURCE: <!-- Each segment is 10 characters long --> <!-- p[2] ends in a dot --> <!-- p[3] begins in a dot --> <!-- p[4] has dots on both edges --> <div xmlns="http://example.org/"> <p>XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW</p> <p>XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW.</p> <p>.XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW</p> <p>.XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW.</p> </div> STYLESHEET: <body xmlns="http://example.org/"> <!-- case one, proper addressing that doesn't work --> <div> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',1,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',12,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',23,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',34,10)[1])"/></seg> </div> <!-- case two --> <div> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',1,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',12,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',23,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',34,10)[1])"/></seg> </div> <!-- case three --> <div> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',1,0)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',2,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',13,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',24,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',35,10)[1])"/></seg> </div> <!-- case four --> <div> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',2,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',13,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',24,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',35,10)[1])"/></seg> </div> <!-- case five looks at the first <p> of the source again --> <div> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',1,9)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',12,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',23,10)[1])"/></seg> <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',34,9)[1])"/></seg> </div> </body>
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.