Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Latest commit

Β 

History

History
1118 lines (897 loc) Β· 24.4 KB

basics.rst

File metadata and controls

1118 lines (897 loc) Β· 24.4 KB

Basic Queries

Table of contents

SELECT statement in SQL is the most common query that retrieves data from Elasticsearch index. In this doc, only simple SELECT statement with single index and query involved is covered. A SELECT statement includes SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY and LIMIT clause. Among these clauses, SELECT and FROM are the foundation to specify which fields to be fetched and which index they should be fetched from. All others are optional and used according to your needs. Please read on for their description, syntax and use cases in details.

The syntax of SELECT statement is as follows:

SELECT [ALL | DISTINCT] (* | expression) [[AS] alias] [, ...]
FROM index_name
[WHERE predicates]
[GROUP BY expression [, ...]
 [HAVING predicates]]
[ORDER BY expression [ASC | DESC] [NULLS {FIRST | LAST}] [, ...]]
[LIMIT [offset, ] size]

Although multiple query statements to execute in batch is not supported, ending with semicolon ; is still allowed. For example, you can run SELECT * FROM accounts; without issue. This is useful to support queries generated by other tool, such as Microsoft Excel or BI tool.

Apart from predefined keyword of SQL language, the most basic element is literal and identifier. Literal is numeric, string, date or boolean constant. Identifier represents Elasticsearch index or field name. With arithmetic operators and SQL functions applied, the basic literals and identifiers can be built into complex expression.

Rule expressionAtom:

/docs/user/img/rdd/expressionAtom.png

The expression in turn can be combined into predicate with logical operator. Typically, predicate is used in WHERE and HAVING clause to filter out data by conditions specified.

Rule expression:

/docs/user/img/rdd/expression.png

Rule predicate:

/docs/user/img/rdd/predicate.png

The actual order of execution is very different from its appearance:

FROM index
 WHERE predicates
  GROUP BY expressions
   HAVING predicates
    SELECT expressions
     ORDER BY expressions
      LIMIT size

SELECT clause specifies which fields in Elasticsearch index should be retrieved.

Rule selectElements:

/docs/user/img/rdd/selectElements.png

Rule selectElement:

/docs/user/img/rdd/selectElement.png

You can use * to fetch all fields in the index which is very convenient when you just want to have a quick look at your data.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT * FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 200
}

Result set:

account_number firstname gender city balance employer state email address lastname age
1 Amber M Brogan 39225 Pyrami IL amberduke@pyrami.com 880 Holmes Lane Duke 32
6 Hattie M Dante 5686 Netagy TN hattiebond@netagy.com 671 Bristol Street Bond 36
13 Nanette F Nogal 32838 Quility VA nanettebates@quility.com 789 Madison Street Bates 28
18 Dale M Orick 4180 null MD daleadams@boink.com 467 Hutchinson Court Adams 33

More often you would give specific field name(s) in SELECT clause to avoid large and unnecessary data retrieved.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT firstname, lastname FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "firstname",
      "lastname"
    ],
    "excludes" : [ ]
  }
}

Result set:

firstname lastname
Amber Duke
Dale Adams
Hattie Bond
Nanette Bates

Alias is often used to make your query more readable by giving your field a shorter name.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number AS num FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  }
}

Result set:

num
1
6
13
18

By default, SELECT ALL takes effect to return all rows. DISTINCT is useful when you want to de-duplicate and get unique field value. You can provide one or more field names ('DISTINCT *' is not supported yet).

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT DISTINCT age FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

age
28
32
33
36

In fact your can use any expression in a DISTINCT clause as follows:

od> SELECT DISTINCT SUBSTRING(lastname, 1, 1) FROM accounts;
fetched rows / total rows = 3/3
+-----------------------------+
| SUBSTRING(lastname, 1, 1)   |
|-----------------------------|
| A                           |
| B                           |
| D                           |
+-----------------------------+

FROM clause specifies Elasticsearch index where the data should be retrieved from. You've seen how to specify a single index in FROM clause in last section. Here we provide examples for more use cases.

Subquery in FROM clause is also supported. Please check out the documentation for more details.

Rule tableName:

/docs/user/img/rdd/tableName.png

Similarly you can give index in FROM clause an alias and use it across clauses in query.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT acc.account_number FROM accounts acc"
}

Alternatively you can query from multiple indices of similar names by index pattern. This is very convenient for indices created by Logstash index template with date as suffix.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number FROM account*"
}

You can also specify type name explicitly though this has been deprecated in later Elasticsearch version.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number FROM accounts/account"
}

WHERE clause specifies only Elasticsearch documents that meet the criteria should be affected. It consists of predicates that uses =, <>, >, >=, <, <=, IN, BETWEEN, LIKE, IS NULL or IS NOT NULL. These predicates can be combined by logical operator NOT, AND or OR to build more complex expression.

For LIKE and other full text search topics, please refer to Full Text Search documentation.

Besides SQL query, WHERE clause can also be used in SQL statement such as DELETE. Please refer to Data Manipulation Language documentation for details.

Basic comparison operators, such as =, <>, >, >=, <, <=, can work for number, string or date. IN and BETWEEN is convenient for comparison with multiple values or a range.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number
        FROM accounts
        WHERE account_number = 1
        """
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "query" : {
    "bool" : {
      "filter" : [
        {
          "bool" : {
            "must" : [
              {
                "term" : {
                  "account_number" : {
                    "value" : 1,
                    "boost" : 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  }
}

Result set:

account_number
1

As NoSQL database, Elasticsearch allows for flexible schema that documents in an index may have different fields. In this case, you can use IS NULL or IS NOT NULL to retrieve missing fields or existing fields only.

Note that for now we don't differentiate missing field and field set to NULL explicitly.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number, employer
        FROM accounts
        WHERE employer IS NULL
        """
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "query" : {
    "bool" : {
      "filter" : [
        {
          "bool" : {
            "must" : [
              {
                "bool" : {
                  "must_not" : [
                    {
                      "exists" : {
                        "field" : "employer.keyword",
                        "boost" : 1.0
                      }
                    }
                  ],
                  "adjust_pure_negative" : true,
                  "boost" : 1.0
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "account_number",
      "employer"
    ],
    "excludes" : [ ]
  }
}

Result set:

account_number employer
18 null

GROUP BY groups documents with same field value into buckets. It is often used along with aggregation functions to aggregate inside each bucket. Please refer to SQL Functions documentation for more details.

Note that WHERE clause is applied before GROUP BY clause.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT age
        FROM accounts
        GROUP BY age
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

age
28
32
33
36

Field alias is accessible in GROUP BY clause.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number AS num
        FROM accounts
        GROUP BY num
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "account_number",
  "aggregations" : {
    "num" : {
      "terms" : {
        "field" : "account_number",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

num
1
6
13
18

Alternatively field ordinal in SELECT clause can be used too. However this is not recommended because your GROUP BY clause depends on fields in SELECT clause and require to change accordingly.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT age
        FROM accounts
        GROUP BY 1
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

age
28
32
33
36

Scalar function can be used in GROUP BY clause and it's required to be present in SELECT clause too.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT ABS(age) AS a
        FROM accounts
        GROUP BY ABS(age)
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "script"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "abs(age)",
  "script_fields" : {
    "abs(age)" : {
      "script" : {
        "source" : "def abs_1 = Math.abs(doc['age'].value);return abs_1;",
        "lang" : "painless"
      },
      "ignore_failure" : false
    }
  },
  "aggregations" : {
    "abs(age)" : {
      "terms" : {
        "script" : {
          "source" : "def abs_1 = Math.abs(doc['age'].value);return abs_1;",
          "lang" : "painless"
        },
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

a
28.0
32.0
33.0
36.0

HAVING clause filters result from GROUP BY clause by predicate(s). Because of this, aggregation function, even different from those on SELECT clause, can be used in predicate.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT age, MAX(balance)
        FROM accounts
        GROUP BY age
        HAVING MIN(balance) > 10000
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age",
      "MAX"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      },
      "aggregations" : {
        "MAX_0" : {
          "max" : {
            "field" : "balance"
          }
        },
        "min_0" : {
          "min" : {
            "field" : "balance"
          }
        },
        "bucket_filter" : {
          "bucket_selector" : {
            "buckets_path" : {
              "min_0" : "min_0",
              "MAX_0" : "MAX_0"
            },
            "script" : {
              "source" : "params.min_0 > 10000",
              "lang" : "painless"
            },
            "gap_policy" : "skip"
          }
        }
      }
    }
  }
}

Result set:

age MAX(balance)
28 32838
32 39225

ORDER BY clause specifies which fields used to sort the result and in which direction.

Besides regular field names, ordinal, alias or scalar function can also be used similarly as in GROUP BY. ASC (by default) or DESC can be appended to indicate sorting in ascending or descending order.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number FROM accounts ORDER BY account_number DESC"
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "account_number" : {
        "order" : "desc"
      }
    }
  ]
}

Result set:

account_number
18
13
6
1

Additionally you can specify if documents with missing field be put first or last. The default behavior of Elasticsearch is to return nulls or missing last. You can make them present before non-nulls by using IS NOT NULL.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT employer
        FROM accounts
        ORDER BY employer IS NOT NULL
        """
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "employer"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "employer.keyword" : {
        "order" : "asc",
        "missing" : "_first"
      }
    }
  ]
}

Result set:

employer
null
Netagy
Pyrami
Quility

Note that the example above is essentially sorting on a predicate expression. In this case, nulls are put first because it's evaluated to false (0), though all the rest are evaluated to true and still in random order. If you want to specify order for both nulls and non-nulls, NULLS FIRST or NULLS LAST in SQL standard can help. Basically, it allows you to specify an independent order for nulls along with ASC or DESC keyword:

od> SELECT employer FROM accounts ORDER BY employer ASC NULLS LAST;
fetched rows / total rows = 4/4
+------------+
| employer   |
|------------|
| Netagy     |
| Pyrami     |
| Quility    |
| null       |
+------------+

The sorting rule can be summarized as follows:

  • Without NULLS clause
    • ASC: sort non-nulls in ascending order and put nulls first
    • DESC: sort non-nulls in descending order and put nulls last
  • With NULLS clause: just use the nulls order given

Here is another example for sort in descending order without NULLS clause:

od> SELECT employer FROM accounts ORDER BY employer DESC;
fetched rows / total rows = 4/4
+------------+
| employer   |
|------------|
| Quility    |
| Pyrami     |
| Netagy     |
| null       |
+------------+

Aggregate functions are allowed to be used in ORDER BY clause. You can reference it by same function call or its alias or ordinal in select list:

od> SELECT gender, MAX(age) FROM accounts GROUP BY gender ORDER BY MAX(age) DESC;
fetched rows / total rows = 2/2
+----------+------------+
| gender   | MAX(age)   |
|----------+------------|
| M        | 36         |
| F        | 28         |
+----------+------------+

Even if it's not present in SELECT clause, it can be also used as follows:

od> SELECT gender, MIN(age) FROM accounts GROUP BY gender ORDER BY MAX(age) DESC;
fetched rows / total rows = 2/2
+----------+------------+
| gender   | MIN(age)   |
|----------+------------|
| M        | 32         |
| F        | 28         |
+----------+------------+

Mostly specifying maximum number of documents returned is necessary to prevent fetching large amount of data into memory. LIMIT clause is helpful in this case. Basically the limit is set to the query planning, so different LIMIT and OFFSET might end up unpredictable subset in the results. Thus it is suggested to use order by in query with limit keyword to enforce a fixed ordering in the result set.

Given a positive number, LIMIT uses it as page size to fetch result of that size at most.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number
        FROM accounts
        ORDER BY account_number LIMIT 1
        """
}

Explain:

{
  "from" : 0,
  "size" : 1,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "account_number" : {
        "order" : "asc"
      }
    }
  ]
}

Result set:

account_number
1

Offset position can be given as first argument to indicate where to start fetching. This can be used as simple pagination solution though it's inefficient on large index. Generally ORDER BY is required in this case to ensure the same order between pages.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number
        FROM accounts
        ORDER BY account_number LIMIT 1, 1
        """
}

Explain:

{
  "from" : 1,
  "size" : 1,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "account_number" : {
        "order" : "asc"
      }
    }
  ]
}

Result set:

account_number
6

Offset position can be given following the OFFSET keyword as well, here is an example:

>od SELECT age FROM accounts ORDER BY age LIMIT 2 OFFSET 1
fetched rows / total rows = 2/2
+-------+
| age   |
|-------|
| 32    |
| 33    |
+-------+

Generally, sort plan is pushed down into the Elasticsearch DSL in plan optimization, but note that if a query has complex sorting, like sort expression, which would not be pushed down during optimization (see Optimizations for details), but computed in local memory. However, the engine fetches the index of a default size that is set in plugin setting (See Settings <../admin/settings.rst> opendistro.query.size_limit for details). Therefore, the result might not be absolutely correct if the index size is larger than the default size of index scan. For example, the engine has a index scan size of 200 and the index size is 500. Then a query with limit 300 can only fetch 200 rows of the index, compute and return the sorted result with 200 rows, while the rest 300 rows of the index are ignored and would not be fetched into the engine. To get an absolutely correct result, it is suggested to set the query size limit to a larger value before run the query.