Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent treatment of prefixed attribute names in range index definitions #5189

Open
amclark42 opened this issue Jan 12, 2024 · 5 comments
Labels
bug issue confirmed as bug
Milestone

Comments

@amclark42
Copy link

Describe the bug

I created a collection configuration file with some range index fields defined. To make sure that only the elements I intend are indexed, I introduced a prefixed attribute, @wwp:field. The range index is set to index only those elements which has @wwp:field with the correct value. Here’s an example of one of my range index settings:

<create qname="tei:title">
  <condition attribute="wwp:field" value="title"/>
  <field name="title-text" type="xs:string" case="no"/>
</create>

This index is working! However, when I opened the configuration file in eXide, it informed me that the value of @attribute was invalid:

cvc-datatype-valid.1.2.1: 'wwp:field' is not a valid value for 'NCName'.cvc-attribute.3: The value 'wwp:field' of attribute 'attribute' on element 'condition' is not valid with respect to its type, 'NCName'.

It is unclear to me whether the schema is wrong, or if the prefixed attribute should not actually be working for me.

When I went looking, I found that eXist has a schema for the configuration file, collection.xconf.xsd. eXide has a four-year-old copy of that schema. Both appear to define @attribute as an xs:NCName.

Expected behavior
The collection.xconf.xsd schema and eXist's range index should both refuse or accept prefixed attribute names in @attribute.

To Reproduce

xquery version "3.1";

  module namespace t="http://exist-db.org/xquery/test";
(:  LIBRARIES  :)
  declare namespace test="http://exist-db.org/xquery/xqsuite";
(:  NAMESPACES  :)
  declare namespace range="http://exist-db.org/xquery/range";
  declare namespace xdb="http://exist-db.org/xquery/xmldb";

(:~
  http://exist-db.org/exist/apps/doc/xqsuite.xml
  
  @author Ash Clark
  2024
 :)
 
(:  VARIABLES  :)
  
  declare variable $t:xconf := 
      <collection xmlns="http://exist-db.org/collection-config/1.0">
        <index xmlns:wwp="http://www.wwp.northeastern.edu/ns/textbase"
               xmlns:tei="http://www.tei-c.org/ns/1.0"
               xmlns:xs="http://www.w3.org/2001/XMLSchema">
          <!-- Range indexing -->
          <range>
            <!-- Only match <title> when @wwp:field="title". -->
            <create qname="tei:title">
              <condition attribute="wwp:field" value="title"/>
              <field name="wwp-attr" type="xs:string"/>
            </create>
            <!-- Only match <title> when an attribute with an NCName of "field" equals "title". -->
            <create qname="tei:title">
              <condition attribute="field" value="title"/>
              <field name="ncname" type="xs:string"/>
            </create>
          </range>
        </index>
      </collection>;
  
  declare variable $t:XML :=
    <biblFull xmlns="http://www.tei-c.org/ns/1.0" xmlns:wwp="http://www.wwp.northeastern.edu/ns/textbase">
      <fileDesc>
        <titleStmt>
          <title type="main">A Peep at the Pilgrims, 1824</title>
          <title type="display" wwp:field="title">A Peep at the Pilgrims</title>
          <author>
            <persName key="hcheney.nei" type="person-female">Cheney, Harriet Vaughan (Foster)</persName>
          </author>
          <sponsor>Northeastern University</sponsor>
          <funder>U.S. National Endowment for the Humanities</funder>
        </titleStmt>
        <publicationStmt>
          <publisher>Northeastern University Women Writers Project</publisher>
        </publicationStmt>
        <sourceDesc n="OT00304">
          <bibl>
            <author>
              <persName key="hcheney.nei" type="titlePage">The Author of Divers Unfinished Manuscripts, &amp;c.</persName>
              <persName key="hcheney.nei" type="regularized">Cheney, Harriet Vaughan (Foster)</persName>
            </author>
            <title wwp:field="title">A peep at the pilgrims in sixteen hundred thirty-six. A tale of olden times.</title>
            <date when="1824">1824</date>
          </bibl>
        </sourceDesc>
      </fileDesc>
    </biblFull>
    ;

(:  FUNCTIONS  :)
  
  declare
    %test:setUp
  function t:setup() {
    let $testCol := xdb:create-collection("/db", "test")
    let $indexCol := xdb:create-collection("/db/system/config/db", "test")
    return (
        xdb:store("/db/test", "test.xml", $t:XML),
        xdb:store("/db/system/config/db/test", "collection.xconf", $t:xconf),
        xdb:reindex("/db/test")
      )
  };
  
  declare
    %test:tearDown
  function t:tearDown() {
    xdb:remove("/db/test"),
    xdb:remove("/db/system/config/db/test")
  };
  
  declare
    %test:assertExists
    %test:assertEquals('A Peep at the Pilgrims', 'A peep at the pilgrims in sixteen hundred thirty-six. A tale of olden times.')
  function t:prefixed-attribute-indexed() {
    let $test-data := collection('/db/test')
    return collection('/db/test')/range:index-keys-for-field('wwp-attr', '', 
      function ($i, $d) { $i }, 10)
  };
  
  declare
    %test:assertExists
    %test:assertEquals('A Peep at the Pilgrims', 'A peep at the pilgrims in sixteen hundred thirty-six. A tale of olden times.')
  function t:ncname-attribute-indexed() {
    let $test-data := collection('/db/test')
    return collection('/db/test')/range:index-keys-for-field('ncname', '', 
      function ($i, $d) { $i }, 10)
  };

According to the above test, the conditional test for @wwp:field finds and indexes two matching <title>s, but the conditional test for the NCName field finds nothing to index.

Context (please always complete the following information)

  • Build: eXist-6.2.0
  • Java: openjdk 21.0.1
  • OS: Mac OS X 12.6.1

Additional context

  • How is eXist-db installed? JAR installer
  • Any custom changes in e.g. conf.xml? whitespace-preservation in serialization
@joewiz
Copy link
Member

joewiz commented Jan 12, 2024

@amclark42 If we can determine what the appropriate datatype would be to accommodate namespace prefixes, I'd gladly support merging a PR and would get a release of eXide out with an updated bundled collection.xconf.xsd file.

@joewiz joewiz added the bug issue confirmed as bug label Jan 12, 2024
@joewiz joewiz modified the milestones: eXist-7.0.0, eXist-6.2.1 Jan 12, 2024
@amclark42
Copy link
Author

@joewiz Thank you! Hm... xs:QName should be able to handle the prefixed name as part of a query. I'm not sure how that datatype would work in a schema context, though; I can try it sometime this weekend.

@joewiz
Copy link
Member

joewiz commented Jan 13, 2024

@amclark42 Yes, that sounds like the right route! I thought I'd see if ChatGPT (GPT-4) could explain the difference between these two data types:

In XML Schema Definition (XSD), QName and NCName are two different data types, each with its own purpose and rules:

QName: It stands for Qualified Name. A QName is used to define an element or attribute name that can potentially have a namespace prefix. For example, in xsd:element, xsd is the namespace prefix, and element is the local part. A QName has two parts: the local part and the namespace prefix, which are separated by a colon. The namespace prefix is optional, but if it’s there, it must be defined in the XML document.
NCName: NCName stands for Non-Colonized Name. It’s a string that cannot contain colons, which means it cannot have a namespace prefix. This type is used for names that are guaranteed not to be associated with any namespace. An NCName is essentially a simpler form of a QName, used in situations where a namespace is not needed or not allowed.

In summary, the key difference is that QNames can include a namespace prefix, while NCNames cannot. This distinction is important in XML documents and schemas where namespace management is a crucial aspect.

@adamretter
Copy link
Member

adamretter commented Jan 19, 2024

The Java code that processes: attribute="wwp:field" looks quite questionable: https://github.com/eXist-db/exist/blob/eXist-6.2.0/extensions/indexes/range/src/main/java/org/exist/indexing/range/RangeIndexConfigAttributeCondition.java#L70 i.e.:

this.attributeName = elem.getAttribute("attribute");
if (this.attributeName == null || this.attributeName.isEmpty()) {
    throw new DatabaseConfigurationException("Range index module: Empty or no attribute qname in condition");
}

try {
    this.attribute = new QName(QName.extractLocalName(this.attributeName), XMLConstants.NULL_NS_URI, QName.extractPrefix(this.attributeName), ElementValue.ATTRIBUTE);
} catch (final QName.IllegalQNameException e) {
    throw new DatabaseConfigurationException("Rand index module error: " + e.getMessage(), e);
}

This code initially treats wwp:field as a string, and then tries to parse it as a QName but at that point it ignores the namespace and only records the prefix and local-part of the QName. I would have thought that at this point it should attempt to resolve the namespace. What it is doing is akin to binding wwp to the null (or default) empty namespace.

Looking at how these values are used by the indexer, we can see further issues - if you had the prefix wwp which is (legitimately) bound to two different namespaces at different points, you will likely find that the indexer thinks that they are identical, when of course they are not! Additionally, I have not tested this, but I suspect the indexer will also confuse wwp:field with just field (from the default namespace).

I wonder if the schema may have intentionally automatically limited people to using an NCName to avoid the code's bug in QName handling? Regardless it looks like the intention of the original author of the Java code was to try and correctly handle QName's - and from what I understand you are saying @amclark42 that would be desirable behaviour.

So we just need someone to send a PR to fix the Java code, and also then to update the collection.xconf.xsd to allow QName instead of NCName. I would not suggest fixing one without the other!

@amclark42
Copy link
Author

Oh dang! Well sleuthed, @adamretter!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue confirmed as bug
Projects
None yet
Development

No branches or pull requests

3 participants