After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 620190 - xpointer/string-range() miscalculates ranges near string edges
xpointer/string-range() miscalculates ranges near string edges
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: xpointer
2.7.6
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2010-05-31 21:54 UTC by Piotr Banski
Modified: 2021-07-05 13:23 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test case (1.54 KB, application/x-compressed)
2010-05-31 21:54 UTC, Piotr Banski
Details

Description Piotr Banski 2010-05-31 21:54:45 UTC
Created attachment 162411 [details]
test case

This is an attempt to have a closer look at what gives when you use the string-range() function of XPointer in xmllint. The trick is to match the empty string before the first character of the given text node and then request the following X characters. This method has been advocated in several linguistic-corpus-architecture papers, notably those by Nancy Ide and Laurent Romary. I've never seen it work due to lack of tools supporting XPointer properly back when those articles were written. libxml2 seems the only tool that may have such functionality, if the wrinkles and smoothed out.

I'm attaching a test case containing the source, the xinclude/xpointer directives and the output.

This is to be run as follows:

$ xmllint --xinclude xpointer-near_edge.xml > output.xml                        Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418
Internal error at /usr/src/ports/libs/libxml2/libxml2-2.7.6-1/src/libxml2-2.7.6/xpointer.c:2418

The "internal error" messages are the topic of bug #562541 . They may be orthogonal to the issue at hand.

The output file follows below. My guess at what is happening is that, possibly, the edge points are not handled properly. By edge points, I mean points before and after each string.

 s t r i n g
^ ^ ^ ^ ^ ^ ^
0 1 2 3 4 5 6

Per the W3C draft, "an empty string is defined to match before each character of the string-value and after the final character."

http://www.w3.org/TR/xptr-xpointer/#stringrange

To make things funnier, and the XPointer draft harder to implement, the string-range() function uses different calculations on the surface: the first character to be matched is always designated by "1" (on the understanding that point 0 is the point before character 1).

I'm afraid that the two types of calculation (points and segments) may have gotten confused in the current xpointer implementation in libxml2.

I'm pasting the output file from the attached test case:

<?xml version="1.0" encoding="UTF8"?>
<!-- Each capitalized chunk is 10 characters long; 
     I'm trying to pull them into separate segments, 
     which is crucial for my corpus-based work -->
<body xmlns="http://example.org/">

  <!-- case one, where *proper addressing* makes the first segment too long 
       by 1, 
       and last segment reach across the closing tag of its <p> and grab a 
       newline 
       (see below for a hint of why the <p> appears here)-->
  <div>
    <seg>XXAAXXAAXX </seg>
    <seg>YYBBYYBBYY</seg>
    <seg>ZZCCZZCCZZ</seg>
    <seg><p xmlns="http://example.org/">WWDDWWDDWW</p>
</seg>
  </div>

  <!-- case two again has proper addressing and the last chunk is handled well:
      it doesn't reach beyond the <p> thanks to an extra character on the edge
   -->
  <div>
    <seg>XXAAXXAAXX </seg>
    <seg>YYBBYYBBYY</seg>
    <seg>ZZCCZZCCZZ</seg>
    <seg>WWDDWWDDWW</seg>
  </div>

  <!-- case three:  proper addressing begins at the second segment this time 
       (note that the 1st segment has length=0, which is obviously not right); 
       the right edge is not 'protected' by an extra character, so again 
       the closing <p> tag is selected and we see XInclude, properly handling 
       "partially selected ranges" (see my explanation in bug #306081 : 
       XInclude works alright, 
       but in the case at hand it's misinformed by XPointer trying to reach too 
       far) -->
  <div>
    <seg>.</seg>
    <seg>XXAAXXAAXX</seg>
    <seg>YYBBYYBBYY</seg>
    <seg>ZZCCZZCCZZ</seg>
    <seg><p xmlns="http://example.org/">WWDDWWDDWW</p>
</seg>
  </div>

  <!-- case four: thanks to the extra dots "protecting" the edges, the offsets 
       and lengths work alright.
       BUT this is not a real workaround: recall that in most cases, we have no 
       control over the source, 
       so we can't add "protective" dots - a bugfix is needed -->
  <div>
    <seg>XXAAXXAAXX</seg>
    <seg>YYBBYYBBYY</seg>
    <seg>ZZCCZZCCZZ</seg>
    <seg>WWDDWWDDWW</seg>
  </div>

  <!-- case five looks at the first <p> of the source again, with the 
       "unprotected" string:
    <p>XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW</p>

       you'd think that if we can kludge around the first segment, faking its 
       length to be 9,
       we should be able to pull the same trick at the end, but, alas, "34,9" 
       doesn't work - this is the real hurdle -->
  <div>
    <seg>XXAAXXAAXX</seg>
    <seg>YYBBYYBBYY</seg>
    <seg>ZZCCZZCCZZ</seg>
    <seg>WWDDWWDDW</seg>
  </div>
</body>
Comment 1 Piotr Banski 2010-05-31 22:54:40 UTC
A somewhat related issue concerning the handling of embedded nodes in reported on on bug #620195.
Comment 2 Piotr Banski 2010-06-01 09:17:52 UTC
Let me also paste the source and the stylesheet, as a convenience for those who'd rather not download the test case. Note the extra dots in the source, they are rather crucial. Note also that the first, dot-less, <p> is referenced twice (in "case one" and "case five").

The desired output should be as in case four (consisting of 10-character segments).

SOURCE:

<!-- Each segment is 10 characters long -->
<!-- p[2] ends in a dot -->
<!-- p[3] begins in a dot -->
<!-- p[4] has dots on both edges -->
<div xmlns="http://example.org/">
  <p>XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW</p>
  <p>XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW.</p>
  <p>.XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW</p>
  <p>.XXAAXXAAXX YYBBYYBBYY ZZCCZZCCZZ WWDDWWDDWW.</p>
</div>

STYLESHEET:

<body xmlns="http://example.org/">
  <!-- case one, proper addressing that doesn't work -->
  <div>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',1,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',12,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',23,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',34,10)[1])"/></seg>
  </div>

  <!-- case two -->
  <div>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',1,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',12,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',23,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[2],'',34,10)[1])"/></seg>
  </div>

  <!-- case three  -->
  <div>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',1,0)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',2,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',13,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',24,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[3],'',35,10)[1])"/></seg>
  </div>

  <!-- case four -->
  <div>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',2,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',13,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',24,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[4],'',35,10)[1])"/></seg>
  </div>

  <!-- case five looks at the first <p> of the source again -->
  <div>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',1,9)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',12,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',23,10)[1])"/></seg>
    <seg><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="source-near_edge.xml" xpointer="xmlns(ex=http://example.org/) xpointer(string-range(/ex:div/ex:p[1],'',34,9)[1])"/></seg>
  </div>
</body>
Comment 3 GNOME Infrastructure Team 2021-07-05 13:23:40 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.