Basic Queries

Table of contents

Introduction
SELECT
FROM
WHERE
GROUP BY
HAVING
- Description
- Example
ORDER BY
LIMIT

Introduction

SELECT statement in SQL is the most common query that retrieves data from Elasticsearch index. In this doc, only simple SELECT statement with single index and query involved is covered. A SELECT statement includes SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY and LIMIT clause. Among these clauses, SELECT and FROM are the foundation to specify which fields to be fetched and which index they should be fetched from. All others are optional and used according to your needs. Please read on for their description, syntax and use cases in details.

Syntax

The syntax of SELECT statement is as follows:

SELECT [ALL | DISTINCT] (* | expression) [[AS] alias] [, ...]
FROM index_name
[WHERE predicates]
[GROUP BY expression [, ...]
 [HAVING predicates]]
[ORDER BY expression [ASC | DESC] [NULLS {FIRST | LAST}] [, ...]]
[LIMIT [offset, ] size]

Although multiple query statements to execute in batch is not supported, ending with semicolon ; is still allowed. For example, you can run SELECT * FROM accounts; without issue. This is useful to support queries generated by other tool, such as Microsoft Excel or BI tool.

Fundamentals

Apart from predefined keyword of SQL language, the most basic element is literal and identifier. Literal is numeric, string, date or boolean constant. Identifier represents Elasticsearch index or field name. With arithmetic operators and SQL functions applied, the basic literals and identifiers can be built into complex expression.

Rule expressionAtom:

The expression in turn can be combined into predicate with logical operator. Typically, predicate is used in WHERE and HAVING clause to filter out data by conditions specified.

Rule expression:

Rule predicate:

Execution Order

The actual order of execution is very different from its appearance:

FROM index
 WHERE predicates
  GROUP BY expressions
   HAVING predicates
    SELECT expressions
     ORDER BY expressions
      LIMIT size

SELECT

Description

SELECT clause specifies which fields in Elasticsearch index should be retrieved.

Syntax

Rule selectElements:

Rule selectElement:

Example 1: Selecting All Fields

You can use * to fetch all fields in the index which is very convenient when you just want to have a quick look at your data.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT * FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 200
}

Result set:

account_number	firstname	gender	city	balance	employer	state	email	address	lastname	age
1	Amber	M	Brogan	39225	Pyrami	IL	amberduke@pyrami.com	880 Holmes Lane	Duke	32
6	Hattie	M	Dante	5686	Netagy	TN	hattiebond@netagy.com	671 Bristol Street	Bond	36
13	Nanette	F	Nogal	32838	Quility	VA	nanettebates@quility.com	789 Madison Street	Bates	28
18	Dale	M	Orick	4180	null	MD	daleadams@boink.com	467 Hutchinson Court	Adams	33

Example 2: Selecting Specific Fields

More often you would give specific field name(s) in SELECT clause to avoid large and unnecessary data retrieved.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT firstname, lastname FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "firstname",
      "lastname"
    ],
    "excludes" : [ ]
  }
}

Result set:

firstname	lastname
Amber	Duke
Dale	Adams
Hattie	Bond
Nanette	Bates

Example 3: Using Field Alias

Alias is often used to make your query more readable by giving your field a shorter name.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number AS num FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  }
}

Result set:

num
1
6
13
18

Example 4: Selecting Distinct Fields

By default, SELECT ALL takes effect to return all rows. DISTINCT is useful when you want to de-duplicate and get unique field value. You can provide one or more field names ('DISTINCT *' is not supported yet).

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT DISTINCT age FROM accounts"
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

age
28
32
33
36

In fact your can use any expression in a DISTINCT clause as follows:

od> SELECT DISTINCT SUBSTRING(lastname, 1, 1) FROM accounts;
fetched rows / total rows = 3/3
+-----------------------------+
| SUBSTRING(lastname, 1, 1)   |
|-----------------------------|
| A                           |
| B                           |
| D                           |
+-----------------------------+

FROM

Description

FROM clause specifies Elasticsearch index where the data should be retrieved from. You've seen how to specify a single index in FROM clause in last section. Here we provide examples for more use cases.

Subquery in FROM clause is also supported. Please check out the documentation for more details.

Syntax

Rule tableName:

Example 1: Using Index Alias

Similarly you can give index in FROM clause an alias and use it across clauses in query.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT acc.account_number FROM accounts acc"
}

Example 2: Selecting From Multiple Indices by Index Pattern

Alternatively you can query from multiple indices of similar names by index pattern. This is very convenient for indices created by Logstash index template with date as suffix.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number FROM account*"
}

Example 3: [Deprecating] Selecting From Specific Index Type

You can also specify type name explicitly though this has been deprecated in later Elasticsearch version.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number FROM accounts/account"
}

WHERE

Description

WHERE clause specifies only Elasticsearch documents that meet the criteria should be affected. It consists of predicates that uses =, <>, >, >=, <, <=, IN, BETWEEN, LIKE, IS NULL or IS NOT NULL. These predicates can be combined by logical operator NOT, AND or OR to build more complex expression.

For LIKE and other full text search topics, please refer to Full Text Search documentation.

Besides SQL query, WHERE clause can also be used in SQL statement such as DELETE. Please refer to Data Manipulation Language documentation for details.

Example 1: Comparison Operators

Basic comparison operators, such as =, <>, >, >=, <, <=, can work for number, string or date. IN and BETWEEN is convenient for comparison with multiple values or a range.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number
        FROM accounts
        WHERE account_number = 1
        """
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "query" : {
    "bool" : {
      "filter" : [
        {
          "bool" : {
            "must" : [
              {
                "term" : {
                  "account_number" : {
                    "value" : 1,
                    "boost" : 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  }
}

Result set:

account_number
1

Example 2: Missing Fields

As NoSQL database, Elasticsearch allows for flexible schema that documents in an index may have different fields. In this case, you can use IS NULL or IS NOT NULL to retrieve missing fields or existing fields only.

Note that for now we don't differentiate missing field and field set to NULL explicitly.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number, employer
        FROM accounts
        WHERE employer IS NULL
        """
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "query" : {
    "bool" : {
      "filter" : [
        {
          "bool" : {
            "must" : [
              {
                "bool" : {
                  "must_not" : [
                    {
                      "exists" : {
                        "field" : "employer.keyword",
                        "boost" : 1.0
                      }
                    }
                  ],
                  "adjust_pure_negative" : true,
                  "boost" : 1.0
                }
              }
            ],
            "adjust_pure_negative" : true,
            "boost" : 1.0
          }
        }
      ],
      "adjust_pure_negative" : true,
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "account_number",
      "employer"
    ],
    "excludes" : [ ]
  }
}

Result set:

account_number	employer
18	null

GROUP BY

Description

GROUP BY groups documents with same field value into buckets. It is often used along with aggregation functions to aggregate inside each bucket. Please refer to SQL Functions documentation for more details.

Note that WHERE clause is applied before GROUP BY clause.

Example 1: Grouping by Fields

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT age
        FROM accounts
        GROUP BY age
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

age
28
32
33
36

Example 2: Grouping by Field Alias

Field alias is accessible in GROUP BY clause.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number AS num
        FROM accounts
        GROUP BY num
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "account_number",
  "aggregations" : {
    "num" : {
      "terms" : {
        "field" : "account_number",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

num
1
6
13
18

Example 3: Grouping by Ordinal

Alternatively field ordinal in SELECT clause can be used too. However this is not recommended because your GROUP BY clause depends on fields in SELECT clause and require to change accordingly.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT age
        FROM accounts
        GROUP BY 1
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

age
28
32
33
36

Example 4: Grouping by Scalar Function

Scalar function can be used in GROUP BY clause and it's required to be present in SELECT clause too.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT ABS(age) AS a
        FROM accounts
        GROUP BY ABS(age)
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "script"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "abs(age)",
  "script_fields" : {
    "abs(age)" : {
      "script" : {
        "source" : "def abs_1 = Math.abs(doc['age'].value);return abs_1;",
        "lang" : "painless"
      },
      "ignore_failure" : false
    }
  },
  "aggregations" : {
    "abs(age)" : {
      "terms" : {
        "script" : {
          "source" : "def abs_1 = Math.abs(doc['age'].value);return abs_1;",
          "lang" : "painless"
        },
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      }
    }
  }
}

Result set:

a
28.0
32.0
33.0
36.0

HAVING

Description

HAVING clause filters result from GROUP BY clause by predicate(s). Because of this, aggregation function, even different from those on SELECT clause, can be used in predicate.

Example

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT age, MAX(balance)
        FROM accounts
        GROUP BY age
        HAVING MIN(balance) > 10000
        """
}

Explain:

{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "age",
      "MAX"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "age",
  "aggregations" : {
    "age" : {
      "terms" : {
        "field" : "age",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      },
      "aggregations" : {
        "MAX_0" : {
          "max" : {
            "field" : "balance"
          }
        },
        "min_0" : {
          "min" : {
            "field" : "balance"
          }
        },
        "bucket_filter" : {
          "bucket_selector" : {
            "buckets_path" : {
              "min_0" : "min_0",
              "MAX_0" : "MAX_0"
            },
            "script" : {
              "source" : "params.min_0 > 10000",
              "lang" : "painless"
            },
            "gap_policy" : "skip"
          }
        }
      }
    }
  }
}

Result set:

age	MAX(balance)
28	32838
32	39225

ORDER BY

Description

ORDER BY clause specifies which fields used to sort the result and in which direction.

Example 1: Ordering by Fields

Besides regular field names, ordinal, alias or scalar function can also be used similarly as in GROUP BY. ASC (by default) or DESC can be appended to indicate sorting in ascending or descending order.

SQL query:

POST /_opendistro/_sql
{
  "query" : "SELECT account_number FROM accounts ORDER BY account_number DESC"
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "account_number" : {
        "order" : "desc"
      }
    }
  ]
}

Result set:

account_number
18
13
6
1

Example 2: Specifying Order for Null

Additionally you can specify if documents with missing field be put first or last. The default behavior of Elasticsearch is to return nulls or missing last. You can make them present before non-nulls by using IS NOT NULL.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT employer
        FROM accounts
        ORDER BY employer IS NOT NULL
        """
}

Explain:

{
  "from" : 0,
  "size" : 200,
  "_source" : {
    "includes" : [
      "employer"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "employer.keyword" : {
        "order" : "asc",
        "missing" : "_first"
      }
    }
  ]
}

Result set:

employer
null
Netagy
Pyrami
Quility

Note that the example above is essentially sorting on a predicate expression. In this case, nulls are put first because it's evaluated to false (0), though all the rest are evaluated to true and still in random order. If you want to specify order for both nulls and non-nulls, NULLS FIRST or NULLS LAST in SQL standard can help. Basically, it allows you to specify an independent order for nulls along with ASC or DESC keyword:

od> SELECT employer FROM accounts ORDER BY employer ASC NULLS LAST;
fetched rows / total rows = 4/4
+------------+
| employer   |
|------------|
| Netagy     |
| Pyrami     |
| Quility    |
| null       |
+------------+

The sorting rule can be summarized as follows:

Without NULLS clause
- ASC: sort non-nulls in ascending order and put nulls first
- DESC: sort non-nulls in descending order and put nulls last
With NULLS clause: just use the nulls order given

Here is another example for sort in descending order without NULLS clause:

od> SELECT employer FROM accounts ORDER BY employer DESC;
fetched rows / total rows = 4/4
+------------+
| employer   |
|------------|
| Quility    |
| Pyrami     |
| Netagy     |
| null       |
+------------+

Example 3: Ordering by Aggregate Functions

Aggregate functions are allowed to be used in ORDER BY clause. You can reference it by same function call or its alias or ordinal in select list:

od> SELECT gender, MAX(age) FROM accounts GROUP BY gender ORDER BY MAX(age) DESC;
fetched rows / total rows = 2/2
+----------+------------+
| gender   | MAX(age)   |
|----------+------------|
| M        | 36         |
| F        | 28         |
+----------+------------+

Even if it's not present in SELECT clause, it can be also used as follows:

od> SELECT gender, MIN(age) FROM accounts GROUP BY gender ORDER BY MAX(age) DESC;
fetched rows / total rows = 2/2
+----------+------------+
| gender   | MIN(age)   |
|----------+------------|
| M        | 32         |
| F        | 28         |
+----------+------------+

LIMIT

Description

Mostly specifying maximum number of documents returned is necessary to prevent fetching large amount of data into memory. LIMIT clause is helpful in this case. Basically the limit is set to the query planning, so different LIMIT and OFFSET might end up unpredictable subset in the results. Thus it is suggested to use order by in query with limit keyword to enforce a fixed ordering in the result set.

Example 1: Limiting Result Size

Given a positive number, LIMIT uses it as page size to fetch result of that size at most.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number
        FROM accounts
        ORDER BY account_number LIMIT 1
        """
}

Explain:

{
  "from" : 0,
  "size" : 1,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "account_number" : {
        "order" : "asc"
      }
    }
  ]
}

Result set:

account_number
1

Example 2: Fetching at Offset

Offset position can be given as first argument to indicate where to start fetching. This can be used as simple pagination solution though it's inefficient on large index. Generally ORDER BY is required in this case to ensure the same order between pages.

SQL query:

POST /_opendistro/_sql
{
  "query" : """
        SELECT account_number
        FROM accounts
        ORDER BY account_number LIMIT 1, 1
        """
}

Explain:

{
  "from" : 1,
  "size" : 1,
  "_source" : {
    "includes" : [
      "account_number"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "account_number" : {
        "order" : "asc"
      }
    }
  ]
}

Result set:

account_number
6

Offset position can be given following the OFFSET keyword as well, here is an example:

>od SELECT age FROM accounts ORDER BY age LIMIT 2 OFFSET 1
fetched rows / total rows = 2/2
+-------+
| age   |
|-------|
| 32    |
| 33    |
+-------+

Limitation

Generally, sort plan is pushed down into the Elasticsearch DSL in plan optimization, but note that if a query has complex sorting, like sort expression, which would not be pushed down during optimization (see Optimizations for details), but computed in local memory. However, the engine fetches the index of a default size that is set in plugin setting (See Settings <../admin/settings.rst> opendistro.query.size_limit for details). Therefore, the result might not be absolutely correct if the index size is larger than the default size of index scan. For example, the engine has a index scan size of 200 and the index size is 500. Then a query with limit 300 can only fetch 200 rows of the index, compute and return the sorted result with 200 rows, while the rest 300 rows of the index are ignored and would not be fetched into the engine. To get an absolutely correct result, it is suggested to set the query size limit to a larger value before run the query.

Files

basics.rst

Latest commit

History

basics.rst

File metadata and controls

Basic Queries