Ceph s3 select

Overview

The purpose of s3 select engine is to create an efficient pipe between user client to storage node (the engine should be close as possible to storage).
It enables the user to define the exact portion of data should be received by his side.
It also enables for higher level analytic-applications (such as SPARK-SQL) , using that feature to improve their latency and throughput.

For example, a s3-object of several GB (CSV file), a user needs to extract a single column which filtered by another column.
As the following query:
select customer-id from s3Object where age>30 and age<65;

Currently the whole s3-object must retrieve from OSD via RGW before filtering and extracting data.
By "pushing down" the query into OSD , it's possible to save a lot of network and CPU(serialization / deserialization).

The bigger the object, and the more accurate the query, the better the performance.

Basic workflow

S3-select query is sent to RGW via AWS-CLI

It passes the authentication and permission process as an incoming message (POST).
RGWSelectObj_ObjStore_S3::send_response_data is the “entry point”, it handles each fetched chunk according to input object-key.
send_response_data is first handling the input query, it extracts the query and other CLI parameters.

Per each new fetched chunk (~4m), it runs the s3-select query on that chunk.
The current implementation supports CSV objects and since chunks are randomly “cutting” the CSV rows in the middle, those broken-lines (first or last per chunk) are skipped while processing the query.
Those “broken” lines are stored and later merged with the next broken-line (belong to the next chunk), and finally processed.

Per each processed chunk an output message is formatted according to AWS specification and sent back to the client.
For aggregation queries the last chunk should be identified as the end of input, following that the s3-select-engine initiates end-of-process and produces an aggregate result.

Design Concepts

AST- Abstract Syntax Tree

The s3-select main flow is initiated with parsing of input-string (i.e user query), and follows
with building an AST (abstract-syntax-tree) as a result.
The execution phase is built upon the AST.

Base_statement is the base for the all object-nodes participating in the execution phase, it consists of the eval() method which returns the <value> object.

value object is handling the known basic-types such as int,string,float,time-stamp
It is able to operate comparison and basic arithmetic operations on mentioned types.

The execution-flow is actually calling the eval() method on the root-node (per each projection), it goes all the way down, and returns the actual result (value object) from bottom node to root node(all the way up) .

Alias programming-construct is an essential part of s3-select language, it enables much better programming especially with objects containing many columns or in the case of complex queries.

Upon parsing the statement containing alias construct, it replaces alias with reference to the correct AST-node, on runtime the node is simply evaluated as any other node.

There is a risk that self(or cyclic) reference may occur causing stack-overflow(endless-loop), for that concern upon evaluating an alias, it is validated for cyclic reference.

Alias also maintains result-cache, meaning upon using the same alias more than once, it’s not evaluating the same node again(it will return the same result),instead it uses the result from cache.

Of Course, per each new row the cache is invalidated.

S3 select parser definition

The implementation of s3-select uses the boost::spirit the definition of s3-select command is according to AWS.

Upon parsing is initiated on input text, and a specific rule is identified, an action which is bound to that rule is executed.
Those actions are building the AST, each action is unique (as its rule), at the end of the process it forms a structure similar to a tree.

As mentioned, running eval() on the root node, execute the s3-select statement (per projection).
The input stream is accessible to the execution tree, by the scratch-area object, that object is constantly updated per each new row.

Basic functionalities

S3select has a definite set of functionalities that should be implemented (if we wish to stay compliant with AWS), currently only a portion of it is implemented.

The implemented software architecture supports basic arithmetic expressions, logical and compare expressions, including nested function calls and casting operators, that alone enables the user reasonable flexibility.
review the bellow feature-table.

Memory handling

S3select structures and objects are lockless and thread-safe, it uses placement-new in order to reduce the alloc/dealloc intensive cycles, which may impact the main process hosting s3-select.

Once AST is built there is no need to allocate memory for the execution itself, the AST is “static” for the query-execution life-cycle.

The execution itself is stream-oriented, meaning there is no pre-allocation before execution, object size has no impact on memory consumption.

It processes chunk after chunk, row after row, all memory needed for processing resides on AST.

The AST is similar to stack behaviour in that it consumes already allocated memory and “releases” it upon completing its task.

S3 Object different types

The processing of input stream is decoupled from s3-select-engine, meaning , each input-type should have its own parser, converting s3-object into columns.

Current implementation includes only CSV reader; its parsing definitions are according to AWS.
The parser is implemented using boost::state-machine.

The CSV parser handles NULL,quote,escape rules,field delimiter,row delimiter and users may define (via AWS CLI) all of those dynamically.

Error Handling

S3-select statement may be syntactically correct but semantically wrong, for one example select a * b from … , where a is number and b is a string.
Current implementation is for CSV file types, CSV has no schema, column-types may evaluate on runtime.
The above means that wrong semantic statements may occur on runtime.

As for syntax error select x frm stdin; , the builtin parser fails on first miss-match to language definition, and produces an error message back to client (AWS-CLI).
The error message is point on location of miss-match.

Fatal severity (attached to the exception) will end execution immediately, other error severity are counted, upon reaching 100, it ends execution with an error message.

AST denostration

select

+------ +---------------------+---------+ | | | | | | | | | | V | | +--------------------+ | | | s3object | | | +--------------------+ | | | V V

+---------------------+ +-------------+ | projections | | where | +---------------------+ +-------------+ | | | | | | | | | | | | | | | | | | V V V

+-----------+ +-----------+ +-------------+ | multiply | | date | | and | +-----------+ +-----------+ +-------------+ | | | | | | | | | | | | | | | | V V V V

+-------+ +-------+ +-----+ +-----+ | 0.3 | | EQ | | LT | +-------+ +-------+ +--+-----+ +-----+--+ | | | | | | | | V V V V +-------+ +----+ +-----+ +-----+ | region| | 30 | +-------+ +----+ +-----+ +-----+

Features Support

The following table describes the support for s3-select functionalities:

Feature	Detailed	Example
Arithmetic operators	^ * / + - ( )	select (int(_1)+int(_2))*int(_9) from stdin;
		select ((1+2)*3.14) ^ 2 from stdin;
Compare operators	> < >= <= == !=	select _1,_2 from stdin where (int(1)+int(_3))>int(_5);
logical operator	AND OR	select count(*) from stdin where int(1)>123 and int(_5)<200;
casting operator	int(expression)	select int(_1),int( 1.2 + 3.4) from stdin;
	float(expression)
	timestamp(...)	select timestamp("1999:10:10-12:23:44") from stdin;
Aggregation Function	sum	select sum(int(_1)) from stdin;
Aggregation Function	min	select min( int(_1) * int(_5) ) from stdin;
Aggregation Function	max	select max(float(_1)),min(int(_5)) from stdin;
Aggregation Function	count	select count(*) from stdin where (int(1)+int(_3))>int(_5);
Timestamp Functions	extract	select count(*) from stdin where extract("year",timestamp(_2)) > 1950 and extract("year",timestamp(_1)) < 1960;
Timestamp Functions	dateadd	select count(0) from stdin where datediff("year",timestamp(_1),dateadd("day",366,timestamp(_1))) == 1;
Timestamp Functions	datediff	select count(0) from stdin where datediff("month",timestamp(_1),timestamp(_2))) == 2;
Timestamp Functions	utcnow	select count(0) from stdin where datediff("hours",utcnow(),dateadd("day",1,utcnow())) == 24 ;
String Functions	substr	select count(0) from stdin where int(substr(_1,1,4))>1950 and int(substr(_1,1,4))<1960;
alias support		select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from stdin where a3>100 and a3<300;

Sending Query to RGW

Syntax

CSV default defintion for field-delimiter,row-delimiter,quote-char,escape-char are: { , \n " \ }

aws --endpoint-url http://localhost:8000 s3api select-object-content 
 --bucket {BUCKET-NAME}  
 --expression-type 'SQL'     
 --input-serialization 
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}' 
 --output-serialization '{"CSV": {}}' 
 --key {OBJECT-NAME} 
 --expression "select count(0) from stdin where int(_1)<10;" output.csv

CSV parsing behavior

Feature	Description	input ==> tokens
NULL	successive field delimiter	,,1,,2, ==> {null}{null}{1}{null}{2}{null}
QUOTE	quote character overrides field delimiter	11,22,"a,b,c,d",last ==> {11}{22}{"a,b,c,d"}{last}
Escape	escape char overrides meta-character. escape removed	11,22,str=\"abcd\"\,str2=\"123\",last ==> {11}{22}{str="abcd",str2="123"}{last}
row delimiter	no close quote, row delimiter is closing line	11,22,a="str,44,55,66 ==> {11}{22}{a="str,44,55,66}
csv header info	FileHeaderInfo tag	"USE" value means each token on first line is column-name, "IGNORE" value means to skip the first line

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3select.rst

s3select.rst

Ceph s3 select

Overview

Basic workflow

Design Concepts

AST- Abstract Syntax Tree

S3 select parser definition

Basic functionalities

Memory handling

S3 Object different types

Error Handling

AST denostration

Features Support

Sending Query to RGW

Syntax

CSV parsing behavior

Files

s3select.rst

Latest commit

History

s3select.rst

File metadata and controls

Ceph s3 select

Overview

Basic workflow

Design Concepts

AST- Abstract Syntax Tree

S3 select parser definition

Basic functionalities

Memory handling

S3 Object different types

Error Handling

AST denostration

Features Support

Sending Query to RGW

Syntax

CSV parsing behavior