Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-Text Indexing: Mixed Content #2079

Open
ChristianGruen opened this issue Mar 5, 2022 · 6 comments
Open

Full-Text Indexing: Mixed Content #2079

ChristianGruen opened this issue Mar 5, 2022 · 6 comments
Labels

Comments

@ChristianGruen
Copy link
Member

ChristianGruen commented Mar 5, 2022

Presently, only text nodes and attribute values end up in the BaseX indexes. Whenever a path expression points to a text node (or an element that only has text nodes as children), it can be rewritten for index access, no matter how the full paths look like. This design decision turned out to be powerful for exact searches and for full-text queries on arbitrary text nodes, but it is too unflexible for mixed-content data.

A few years ago, we added features to restrict indexing to the text nodes of specific element names. We could enhance this approach for full-text queries:

  1. Index the string value of specific elements that will be specified via FTINCLUDE and
  2. rewrite only paths for index access that do not address descendants of the indexed element.

As an example, a user might want to query the head and p elements of a TEI document:

<div>
  <head>No. 2, September 2006</head>
  <p>It was clearly popular, for it appears in Peter Stent’s
advertisements of 1654 and 1662, and is still listed in his successor
John Overton’s catalogue of 1673,<note>Alexander Globe, <title
level="m">Peter Stent, London Printseller, c.</title> 1642-65
(Vancouver, 1985), p. 123 (no.*448).</note> yet only the unique
impression in the British Museum's Department of Prints and Drawings
survives - testimony to the great rarity of such popular material.</p>
</div>

The following queries could then be evaluated via the index:

/div[head contains text '2006']
//p[. contains text 'popular']

Queries such as the following ones would not be rewritten for index access anymore:

//p[text() contains text 'popular']
@ChristianGruen ChristianGruen added this to the 10 milestone Mar 5, 2022
@liamquin
Copy link

Why not rewrite //p[text() contains text 'popular'] as
//p[text()[. contains text 'popular']]
would it then use the index??

Or, betterr maybe,
text()[. contains text 'popular']/..[self::p]
?

@ChristianGruen
Copy link
Member Author

If we can assess at compile time that all p elements in a database are leaf elements (i.e., have a single text node as child), we could indeed rewrite //p[text() contains text 'popular'] for index access, too.

Otherwise, if p elements have child elements, we don’t know which substring of the indexed text occurs in that text node. The following two expressions will yield a different result:

<p>popular<suffix>s</suffix></p>        contains text 'popular',
<p>popular<suffix>s</suffix></p>/text() contains text 'popular'

@graydon2014
Copy link

Without regard to practicality of indexing (because I have no idea!),

//p[normalize-space(.) contains text 'popular']

is what I'm usually after -- where is this phrase in the document? There can be a lot of inline markup and for "where's the phrase?" purposes I want to know the nearest common ancestor of all the text nodes in the phrase.

@ChristianGruen
Copy link
Member Author

For finding the nearest common ancestor elements, it’s still recommendable to search on text node level:

let $xml := document {
  <p>
    There’s is a <b>popular</b> saying …
  </p>  
}
return $xml//p//text()[. contains text 'popular']/ancestor::*[1]  (:  → <b>...</b> :)

If nodes are atomized, things are getting complicated because the found tokens may appear on different node levels. The token in the following query is assembled from the child text nodes of p and b:

let $xml := document {
  <p>There’s is a <b>p</b>opular saying …</p>
}
return $xml//p[. contains text 'popular']  (:  → <p>...</p> :)

About normalize-space(.), full-text tokenization includes this (so you can replace normalize-space(.) by .), and it additionally removes diacritics, normalizes upper/case, etc. The behavior can be made explicit by calling ft:tokenize.

@graydon2014
Copy link

graydon2014 commented Apr 25, 2022

I managed to express the use case in a muddled way; apologies!

let $xml := document {
<bucket>
<title>Complex Reference</title>
<p>There's a <i>complex <link>reference</i> to this document.</p>
</bucket>
}
return $xml//*[. contains text 'complex reference']

This is the kind of search I want to do against a relatively large amount of content (e.g., a national legal code) where the specific element is not known and could in principle be one of a number of elements and in practice is a variety of elements expressing different semantics and in some cases you want the titles and in other cases the references but the first step is to find everywhere the phrase occurs. The goal is to get the closest containing element with all the text nodes of the searched phrase.

The query above returns all the elements of which it is true, which is what it's supposed to do:
<bucket>
<title>Complex Reference</title>
<p>There's a <i>complex <link>reference</link>
</i> to this document.</p>
</bucket>
<title>Complex Reference</title>
<p>There's a <i>complex <link>reference</link>
</i> to this document.</p>
<i>complex <link>reference</link>
</i>

But ideally there'd be a way to do the "closest ancestor" version with the case where it's a multi-word phrase with components in different text nodes. My (probably naive) thought is that maybe there could be an index of string properties of elements, which would allow returning the closest containing element of the full-text match.

@ChristianGruen ChristianGruen removed this from the 10 milestone Jul 31, 2022
@ChristianGruen
Copy link
Member Author

Postponed to a later version.

@ChristianGruen ChristianGruen changed the title Full-Text Indexing: Index specific full-text elements Full-Text Indexing: Mixed Content Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants