Skip to content

Latest commit

 

History

History

sql-to-dbsp-compiler

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SQL to DBSP compiler

This directory holds the source code for a compiler translating SQL view definitions into DBSP circuits. The SQL compiler is based on the Apache Calcite compiler infrastructure https://calcite.apache.org/

Dependencies

The code has been tested on Windows 10/11 and Linux.

You need Java 19 or later to build the compiler, and the maven (mvn) Java build program. Maven will take care of installing all required Java dependencies.

The code generated by the compiler is Rust. To run it you need a working installation of Rust (we recommend using rustup to install: https://www.rust-lang.org/tools/install). The compiler uses the current version of the DBSP library sources from github (https://github.com/feldera/feldera).

If you want to generate images of the query plans you need to also install graphviz as described here https://graphviz.org/download/.

Building

To build the compiler run:

mvn -DskipTests package

Rust compilation errors

If you get Rust compilation errors you should try to make sure you have the latest version of the Rust libraries and toolchain:

$ rustup update
$ cd temp
$ cargo update

(temp is the directory where the tests write the generated Rust code.)

Incremental view maintenance

The DBSP runtime is optimized for performing incremental view maintenance. In consequence, DBSP programs in SQL are expressed as VIEWS, or standing queries. A view is essentially a computation that describes a relation that is computed from other tables or views.

For example, the following query defines a view:

CREATE VIEW V AS SELECT T.COL1 FROM T WHERE T.age > 18

In order to interpret this query the compiler needs to have been given a definition of table (or view) T. The table T should be defined using a SQL Data Definition Language (DDL) statement, e.g.:

CREATE TABLE T
(
    name    VARCHAR,
    age     INT,
    present BOOLEAN
)

The compiler must be given the table definition first, and then the view definition. The compiler generates a Rust library which implements the query as a function: given the input data, it produces the output data.

The compiler can (optionally) generate a library which will incrementally maintain the view V when presented with changes to table T:

                                           table changes
                                                V
tables -----> SQL-to-DBSP compiler ------> DBSP circuit
views                                           V
                                           view changes

Using the compiler

See the documentation in docs/contributors/compiler.md in this repo, or online.

Compiler architecture

A presentation about the internals of the compiler implementation.

Compilation proceeds in several stages:

  • SQL statements are parsed using the calcite SQL parser generating an IR representation using the Calcite SqlNode data types
  • the SQL IR tree is validated, optimized, and converted to the Calcite RelNode representation
  • The result of this stage is a CalciteProgram data structure, which packages together all the views that are being compiled (multiple views can be maintained simultaneously)
  • CalciteToDBSPCompiler converts a CalciteProgram data structure into a DBSPCircuit data structure.
  • The CircuitOptimizer class can optimize the generated circuit, optionally converting it into an incremental circuit, which is expected to compute only on changes.
  • The circuit can be serialized as Rust using the ToRustString visitor.

Testing

Unit tests

Unit tests are written using JUnit and test pointwise parts of the compiler. They can be executed using mvn test. They also run as a side effect of mvn package (omitting -DskipTests).

The unit tests are implemented as individual Java files under sql-to-dbsp-compiler/SQL-compiler/src/test/java/org/dbsp/sqlCompiler/compiler/. Each unit test uses the SQL to DBSP compiler to generate Rust code under sql-to-dbsp-compiler/temp and compiles and runs that code.

If one of the unit tests fails, you may re-populate the Rust code for it into temp by telling mvn to run only that particular test. If an error message indicates that org.dbsp.sqlCompiler.compiler.postgres.PostgresTimestampTests failed, for example, you may rerun just that test, first with mvn test and then again with cargo test, via:

$ mvn test -Dtest=PostgresTimestampTests
$ cd ../temp
$ cargo test

SQL logic tests

One of the means of testing the compiler is using sqllogictests: https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki.

The sqllogictests includes more than 5 million tests. It takes weeks to run all of them. Most of the time is spent compiling Rust. We hope to speed that up at some point.

To start running these tests:

$ ./run-tests.sh

We have implemented a general-purpose testing framework in Java for running SqlLogicTest programs. The framework parses SqlLogicTest files and creates an internal representation of these files. The files are executed by "test executors".

The tests are run by a standalone executable (the executable is invoked by the run-tests.sh script). The executable supports the following command-line arguments:

slt [options] files_or_directories_with_tests
Executes the SQL Logic Tests using a SQL execution engine
See https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki
Options:
-h            Show this help message and exit
-x            Stop at the first encountered query error
-n            Do not execute, just parse the test files
-e executor   Executor to use
-b filename   Load a list of buggy commands to skip from this file
-v            Increase verbosity (can be repeated)
-inc          Incremental validation
-u username   Postgres user name
-p password   Postgres password
Registered executors:
        hybrid
        dbsp
        hsql
        psql
        none

We have multiple executors. Some executors are inherited from the hydromatic project.

The NoExecutor test executor

This executor is part of the base hydromatic package; it does not really run any tests. But it can still be used by the test loading mechanism to check that we correctly parse all SQL logic test files.

The DBSPExecutor

The model of SQLLogicTest has to be adapted for testing using DBSP. Since DBSP is not a database, but a streaming system, some SQL statements are ignored (e.g., CREATE INDEX) and some other cannot be supported (e.g., CREATE UNIQUE INDEX).

  • The DDL statements to create tables are compiled into definitions of circuit inputs.
  • The DML INSERT statements are converted into input-generating functions. In the absence of a database we cannot really execute statements that are supposed to fail, so we ignore such statements.
  • Some DML statements like DELETE based on a WHERE clause cannot be compiled at all
  • The queries are converted into DDL VIEW create statements, which are compiled into circuits.
  • The validation rules in SQLLogicTest are compiled into Rust functions that compare the outputs of queries.

So a SqlLogicTest script is turned into multiple DBSP tests, each of which creates a circuit, feeds it one input, reads the output, and validates it, executing exactly one transaction. When testing incremental circuits the test code feeds multiple inputs and only checks the final output.

The hybrid executor

This executor is a combination of the DBSP executor and the JDBC executor, using a real database to store data in tables, but using DBSP as a query engine. It should be able to execute all SqlLogicTest queries that are supported by the underlying database.

SqlLogicTest Test results

The 'inc' column shows tests for incremental circuits, the other columns show tests for non-incremental (standard) implementations.

The matrix represents the current test results; the numbers indicate the passing/failing tests in each category. The failing test cases are detailed below.

Test group DBSP JDBC_DBSP JDBC_DBSP inc
random/select 1,120,329/0 1,120,329/0 1,120,329/0
random/groupby 118,757/0 118,757/0 118,757/0
random/expr 1,317,682/0 1,198,926/0 1,198,926/0
random/aggregates 1,172,825/2 1,172,825/0 1,172,825/0
select1 1,000/0 1,000/0 1,000/0
select2 1,000/0 1,000/0 1,000/0
select3 3,320/0 3,320/0 3,320/0
select4 2,832/0 2,832/0 2,832/0
select5 732/0 732/0 732/0
index/delete N/A 40,525/0 40,525/0
index/in N/A 130,065/0 130,065/0
index/commute N/A 507,514/0 507,514/0
index/between N/A 121,811/0 121,811/0
index/orderby_nosort N/A 490,986/0 490,986/0
index/view N/A 53,490/0 53,490/0
index/random N/A 188,449/0 188,449/0
index/orderby N/A 310,630/0 310,630/0
evidence N/A 153/25

The "index" tests cannot be executed with the DBSPExecutor since it does not support the "unique index" SQL statement.

Postgres tests

We are manually converting Postgres tests https://github.com/postgres/postgres/blob/master/src/test/regress/expected to run on DBSP, since the Postgres SQL syntax is too different from the Calcite syntax (especially in the functions supported). In some cases the semantics of Postgres SQL is different from Calcite; for example, in Postgres the days of the week are numbered from 0, where 0 is Sunday, but in Calcite the days of the week are numbered from 1, where 1 is Sunday. In such cases we have to change the tests.