GNOME Bugzilla – Bug 303290
xml catalog prefer="public" not supported
Last modified: 2021-07-05 13:26:53 UTC
When XML catalogs are used, and the catalog has a prefer="public" on the catalog element or the group element, it doesn't work. That is, the catalog is not consulted when a System ID works and there is a catalog entry for the PUBLIC id. Also, if the System ID does not work, if the catalog is consulted, the match on system ID is preferred. Here is a catalogtest.xml: <?xml version="1.0"?> <!DOCTYPE catalog PUBLIC "-//OASIS/DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"> <catalog prefer="public" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <group prefer="public"> <public publicId="-//SAGEHILL//General Entities//EN" uri="mysection2.ent"/> <system systemId="bogus.ent" uri="mysection2.ent"/> <system systemId="bogus1.ent" uri="mysection1.ent"/> </group> </catalog> Here is a main test document entitytest.xml: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE article [ <!ELEMENT article (section*)> <!ELEMENT section (title?)> <!ELEMENT title (#PCDATA)> <!ENTITY good PUBLIC "-//SAGEHILL//General Entities//EN" "missing.ent"> <!ENTITY good2 PUBLIC "-//SAGEHILL//General Entities//EN" "bogus.ent"> <!ENTITY bad PUBLIC "-//SAGEHILL//General Entities//EN" "mysection1.ent"> <!ENTITY bad2 PUBLIC "-//SAGEHILL//General Entities//EN" "bogus1.ent"> ]> <article> &good; &good2; &bad; &bad2; </article> Here is one system entity file mysection1.ent: <?xml version="1.0" encoding="utf-8"?> <section> <title>section 1 title</title> </section> Here is a second system entity file mysection2.ent: <?xml version="1.0" encoding="utf-8"?> <section> <title>section 2 title</title> </section> Here is the xmllint version (on Windows XP): c:\xml\libxml\xmllint.exe: using libxml version 20619CVS2407 compiled with: DTDValid FTP HTTP HTML C14N Catalog XPath XPointer XInclude Ic onv Unicode Regexps Automata Schemas Modules Here is the command I tested with: XML_DEBUG_CATALOG=1 \ XML_CATALOG_FILES="catalogtest.xml" \ xmllint --noent --valid entitytest.xml > result.xml Here is the result.xml output: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE article [ <!ELEMENT article (section)*> <!ELEMENT section (title)?> <!ELEMENT title (#PCDATA)> <!ENTITY good PUBLIC "-//SAGEHILL//General Entities//EN" "missing.ent"> <!ENTITY good2 PUBLIC "-//SAGEHILL//General Entities//EN" "bogus.ent"> <!ENTITY bad PUBLIC "-//SAGEHILL//General Entities//EN" "mysection1.ent"> <!ENTITY bad2 PUBLIC "-//SAGEHILL//General Entities//EN" "bogus1.ent"> ]> <article> <section> <title>section 2 title</title> </section> <section> <title>section 2 title</title> </section> <section> <title>section 1 title</title> </section> <section> <title>section 1 title</title> </section> </article> If prefer="public" were working, then all of these should say "section 2 title".
Hello, As far as I understand this, libxml2 tries to implement the resolution algorithm, specified in: http://www.oasis-open.org/committees/entity/spec-2001-08-06.html In your example, in each case a publicId and a systemId are provided. So the algorithm "7.1.2. Resolution of External Identifiers" has to be applied: 1. initial catalog setup, clear. 2. for cases "good2" and "bad2" system id exists in the catalog, "bogus.ent" and "bogus2.ent" that redirect to "mysection2.ent" and "mysection1.ent". 3. catalog does not contain rewriteSystem. 4. catalog does not contain delegateSystem. 5. public id is provided for &good; and &bad;, thus in both cases the rule for <public> matches and the resolution result should be "mysection2.ent" in *both* cases. The real problem is that this is not the case for &bad;. The output of XML_DEBUG_CATALOG shows that resolve algorithm is never called for this entity. The reason is that xmlIO.c's xmlDefaultExternalEntityLoader does not apply catalog lookup for system identifier that exists: /* * If the resource doesn't exists as a file, * try to load it from the resource pointed in the catalogs */ Thus one of the purposes and advantages in using catalogs is gone, to redirect/rewrite the location of a resource by means of a catalog file. Thus I consider this the real bug in the entity loader functions of xmlIO.c that is shown and detected by your example files. The "prefer" attribute, however, is only used for delegatePublic in step 6 that is never reached in this szenario, because the resolution was successful in previous steps. Thus the value of "prefer" does not matter here. See also "4.1.1. The prefer attribute" of the specification. Yours sincerely Heiko <oberdiek@uni-freiburg.de>
Thanks for this analysis. I don't think I will change libxml2 behaviour. Forcing to round-trip on the catalog for a local resource directly referenced by a system URI is more likely to cause troubles as: - being unexpected behaviour - adding catalog parsing cost to any single file reference (any time one would parse a single file with libxml2, this would require prior parsing of the catalog first) that I think it makes sense as the general default case. The entity resolver in xmlDefaultExternalEntityLoader can be overrided trivially with a single API call, and applications can very easilly force a different behaviour. I'm not 100% sure the current behaviour can be considered an XML Catalog failure, based on the Abstract of the spec which defines the intended 3 main use cases. There isn't really a conformance section in the spec. http://www.oasis-open.org/committees/entity/spec-2001-08-06.html Daniel
I think this bug should be reopened. I respectfully disagree with your assessment of the expected default behavior. If a catalog file is specified, it must be possible for the catalog to override a hardcoded reference specified in a file, even if it is to a local resource that exists. That means every reference goes through the catalog first to see if it should be remapped to a new location. I asked Norman Walsh, the author of the XML Catalog specification, about it, and he agrees that all references should go through the catalog first. That is the behavior of the Apache Java resolver classes (which is not surprising since he wrote them). I agree that the specification could be more clear on its conformance standards. For example, I have another question in to him about how the spec describes the resolution of the "prefer" attribute.
"If a catalog file is specified" On Linux a catalog file is *always* present. This was done on the assumption that this would not cost penalty for simple local file access. If this is the case I can't revisit that decision, but I'm not ready to inflict double parsing cost because of this while the limited change in behaviour can be trivially overrided using the public API, for the specific applications which may really need this. Lot of applications use libxml2 to just parse a single config file. Forcing all of them to parse 2 files even if they don't need a catalog sounds far too much of a cost for something which never raised an real error reported by any application. This does not sound reasonnable to me, what application has a trouble with this ? Why can't that application be modified in a trivial way if it really need that support. If I had known about this aspect of the spec before implementing and pushing for it, I would first have objected to this aspect of the spec and then would also have made catalog support in libxml2 at user option and not the default. I stand on my position unless you can provide a reasonable justification for the inherent cost of what you're suggesting. Daniel
I agree that not all libxml2 applications should be forced to use a catalog if it is simply present on a system. I would like to change this request so that the default behavior of libxml2 with regards to catalogs is not changed, but that the two applications xmllint and xsltproc adopt this behavior. You are correct that the presence of a catalog file on a system (like Linux's /etc/catalog) should not be the determining factor of whether the catalog is used. Rather, if either application specifies a catalog file through the XML_CATALOG_FILES environment variable, then it should be used for all lookups needed by that application. I believe that is the expected behavior of users of those two applications.
I thought perhaps a specific use case would make this request more clear. The DocBook XML DTD has several modules. One of the modules is a placeholder designed to contain user-defined entities. Here is how it is declared in the docbookx.dtd file: <!ENTITY % dbgenent PUBLIC "-//OASIS//ENTITIES DocBook Additional General Entities V4.4//EN" "dbgenent.mod"> %dbgenent; The dbgenent.mod file that ships with the DTD is empty, because it is just a placeholder. One could edit the original file, but only if one has write access to the file on the system. But that still leaves you with just one collection of user-defined entities. My application requires reusing the same DocBook files with different conditional text. One way to implement conditional text is with general entities. For example, one could define a companyname entity, with the idea of substituting the actual company name at runtime. So I would like to be able to define several collections of entities, with each collection containing the same entity names but with different expansion text. I would like to be able to select the collection at runtime. So my runtime specifies a catalog file that maps the PUBLIC id "-//OASIS//ENTITIES DocBook Additional General Entities V4.4//EN" to one of my entity collection files. By choosing a different catalog at runtime, I can choose a different collection of entity values. Unfortunately, this doesn't work in xmllint and xsltproc. Because the default dbgenent.mod file exists, the catalog is not consulted, and I don't get my entities. As I stated in my earlier message, I believe that users of these two applications expect all of their system references to be looked up in their catalog, and falling back to the default resource only if no catalog entry matches. Even if you don't want to make this the default behavior for these two applications, I believe there should be a command line option to specify this behavior.
I see that the status of this bug is still NEEDINFO. Is there is any other information that I can supply in order to get some resolution? As I said, I'm only looking to change the behavior of the applications xmllint and xsltproc.
Hum, no, it should be switched to NEW, Daniel
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.