Skip to content

DATAWAVE Query

Ivan Bella edited this page Jun 25, 2020 · 11 revisions

Table Structure

The main query logic in DATAWAVE makes use of 4 main tables in Accumulo: shard, shardIndex, shardReverseIndex, and DatawaveMetadata. The format of these tables are as follows:

Table Purpose Row Column Family Column Qualifier Value
Shard Fields ShardId (yyyymmdd_#) Datatype \0 UID Fieldname \0 Fieldvalue n/a
Field Index ShardId fi \0 Fieldname Fieldvalue’ \0 Datatype \0 UID n/a
Term Frequencies ShardId tf Datatype \0 UID \0 Fieldname Term Frequencies (protobuf)
Document ShardId d Datatype \0 UID \0 Viewname Document content
ShardIndex Global Index Fieldvalue’ Fieldname Shardid \0 Datatype UID List (up to 20)
Shard Reverse Index Global Index Reverse-Fieldvalue’ Fieldname Shardid \0 Datatype UID List (up to 20)
DatawaveMetadata Metadata Fieldname MetadataType Metadata Metadata

Query

The basic components that uniquely identify a document is the ShardId + Datatype + UID. So the goal of a query is to derive those components and match all of the fields in the document. A query is composed of a JEXL query or LUCENE like query, a date range, the query logic name, and potentially other query parameters. The general process for the main query logic (EventQuery) is as follows:

  1. Query Planning
  2. Scan global index
  • Fieldvalue’ + Fieldname → Shardid + Datatype + UID*
  • Note that we may not have a UID if the UID list is empty because there are more than 20 documents with the same fieldname, fieldvalue’, shardid, and datatype.
  1. If we do not have a UID, then scan the field index
  • Shardid + Datatype + Fieldvalue’ + Fieldname → UID
  1. Pull out the document
  • Shardid + Datatype + UID → Fieldnames + Fieldvalues
  1. Evaluate the documents
  • Fieldnames + Fieldvalues + Query → match (or not)

Query Planning

The first thing that happens is query planning which means turning the query into one that can be run against the index is composed of these main steps:

  1. apply query model (found in DatawaveMetadata table)
  • FIELD=value → (FIELD1==value || FIELD2==value)
  1. expand functions: this will add a query suitable to find the ranges needed to match the function.
  • function() → (function() && (query))
  1. unfielded expansion: this will expand the terms where the fieldnames were not specified
  • value → (FIELD1==value || FIELD2==value)
  • regex → (FIELD1==value1 || FIELD2==value2)
  • regex → ((ExceededValueThreshold=true) && (FIELD1=~regex)) (NOTE: This is the query form that results in an Ivarator. This happens when the number of values matching the regex (or range below) exceeds a configured threshold.)
  1. normalize: this will normalize the values to match those put in the index.
  • FIELD=value → FIELD=value’
  1. pushdown: this will delay the terms (meaning DO NOT use the index) where the selectivity is too low or cannot be used (index-only)
  • FIELD=value → ((ASTDelayed=true) && (FIELD==value))
  • FIELD=~regex → ((ASTDelayed=true) && (FIELD=~regex))
  1. expand regex/ranges: this uses the global index to expand a regex or range into specific values. Additional Ivarators may be signaled here with ExceededValueThreshold markers.
  • FIELD=~regex → (FIELD==value1 || FIELD==value2)
  • (FIELD>value1 && FIELD<value2) → (FIELD==value1 || FIELD==value2)
  • FIELD=~regex → ((ExceededValueThreshold=true) && (FIELD=~regex))
  • (FIELD>value1 && FIELD<value2) → ((ExceededValueThreshold=true) && (FIELD>value1 && FIELD<value2))
  1. pullup: This will un-delay the terms (meaning use the index) as required to be able to appropriately execute the query.
  • ((ASTDelayed=true) && (FIELD==value)) → FIELD==value
  • ((ASTDelayed=true) && (FIELD=~regex)) → FIELD=~regex
  1. expand regex/ranges (same as above)

Scan Global Index

For all of the indexed fieldname=fieldvalue’ terms in the query which are not in a ASTDelayed or ExceededValueThreshold expression, scan the global index. Given the structure of the global index, this will get us shardid+datatypes (shard range) and optionally shardid+datatype+UID (document range). The boolean structure of those terms are used to determine the final ranges to scan in the shard table.
Given the following query:

(FIELD1==value1 && FIELD2==value2) || FIELD3==value3

This will result in three scans each resulting in a stream of shard and document ranges:

  1. FIELD1==value1 → shard1+dt1, shard1+dt2, shard2+dt1+uid1, shard2+dt1+uid1
  2. FIELD2==value2 → shard1+dt1, shard2+dt1
  3. FIELD3==value3 → shard1+dt1, shard3+dt1+uid3

If we intersect the first two terms (because of the && boolean) we get:

  • shard1+dt1, shard2+dt1+uid1, shard2+dt1+uid1 The shard2+dt1 range exists for FIELD2=value2, and we have UIDs therein matching the FIELD1=value1. So only those UIDs potentially contain both terms and hence the intersections produces the document ranges.

Then if we union with the third term (because of the || boolean) we get:

  • shard1+dt1, shard2+dt1+uid1, shard2+dt1+uid1, shard3+dt1+uid3

So this gives us one shard range and three document ranges. The shard range will require going to the field index to get the UIDs, but the document ranges can go directly to pulling out the documents for evaluation.

Scan the field index

Scanning the field index is basically identical to scanning the global index except that in this case we can handle the ExceededValueThreshold expressions (Ivarator). Given the following query:

(FIELD1==value1 && FIELD2==value2) || ((ExceededValueThreshold=true) && (FIELD3=~value3))

This will result in three scans which result in UID lists in sorted order:

  1. FIELD1==value1 → uid1, uid2, uid3
  2. FIELD2==value2 → uid1, uid3
  3. FIELD3==value3 → uid2, uid4, uid5 (see Ivarator section)

If we intersect the first two terms (because of the && boolean) we get:

  • uid1, uid3

Then if we union with the third term (because of the || boolean) we get:

  • uid1, uid2, uid3, uid4, uid5

Since we are already scanning the shard, we can go directly to those documents within the same iterator/scan for the next step

Ivarator

The reason an Ivarator is required to scan the field index is that when the expression can match multiple values, then the UIDs being pulled out of the field index will not be in sorted order given the structure of the field index. In order to do the intersections and unions as required by the boolean logic, each of the streams of uids being pulled out must be in sorted order. The field index is constructed as follows:

shardId fi \0 Fieldname Fieldvalue’ \0 Datatype \0 UID

So the scan of the field index may result in the following stream of UIDs: uid2, uid5, uid3, uid10, … These uids are pulled into a buffer and are sorted. As the buffer is filled up with uids, they are stored in separate files either on disk (i.e. local) or in hdfs. Note that this all has to be done before the first union or intersection is performed. Once the scan is complete, then the files are read back out doing a merge sort in the process:

  • file1: uid1, uid5, uid3, uid10, uid2
  • file2: uid3, uid8, uid9, uid13
  • final merge sort: uid1, uid2, uid3, uid5, uid8, uid9, uid10, uid13

Pull out the document

Given the shardid, datatype, and uid we can directly scan for all of the fieldname, fieldvalue pairs that correspond to the document. Note that if we scanned the field index first then the document will already be filled with the “index only” fieldname, fieldvalue pairs. If we did not scan the field index first and we have “index only “ fields in the query, then we will pull those values directly out of the field index as well. Everything exists in the same shard, and hence is located in the same table so we can bounce between the field index and the document in the same scan. In then end we will have a set of fieldname, fieldvalue pairs that can be used for the final evaluation

Evaluate the document

Evaluating the document is running the full set of boolean logic, functions, etc against the set of fieldname, fieldvalue pairs. In the end we will either have a match or we will not. If we have a match, then the document as a set of fieldname, fieldvalue pairs are returned back to the client.