Skip to content

Python AWS Lambda functions to generate single CSV file in a S3 bucket from a Trino SQL query

Notifications You must be signed in to change notification settings

victorcouste/trino-s3-csv-generation-python

Repository files navigation

CSV file generation in S3 from Trino query with AWS Lambda

This repository contains 2 examples of Python AWS Lambda functions to generate a single CSV file in a S3 bucket from a Trino SQL query result.

Tests and examples are done with Starburst Galaxy, the Trino SaaS version, but it's the same for a hosted Trino or Starburst Enterprise cluster.

We use 2 different options to generate a CSV file:

  • With a CTAS (Create table) query
  • With a Python Pandas DataFrame object

And to run these AWS Lambda functions, you will also need to add this layer package in order to use Trino Python library and Python pandas objects (about Lambda layer).

You can also find a Python code to test these functions and a Flask Web application using these functions.

Architecture

Architecture of the interaction between Starburst Galaxy and the AWS environment (Glue, S3, Lambda).


Function with a CTAS query

The lambda_CTAS.py function uses a CTAS query to generate a single CSV file in a S3 bucket.

The CTAS query is like:

CREATE TABLE s3_catalog.tmp.your_file WITH (csv_separator = ',',external_location='s3://your_bucket/tmp/your_file', format='csv') as SELECT ....

Where tmp is an existing Schema in your Trino or Galaxy S3 Catalog (Glue or Hive), here named s3_catalog.

The extra steps into the function after the CTAS query run are to:

  • Add .csv suffix to the file name
  • Add columns name as header (from Columns name passed as function parameters)
  • Copy the file in your specific bucket folder
  • Remove the temporary table created (from the CTAS)

Important, you need also to set some properties to the session in order to generate only 1 file not compressed.

session_properties:

  • scale_writers : true
  • writer_min_size : 1TB
  • task_writer_count : 1
  • s3_catalog.compression_codec : NONE

Function with a Pandas DataFrame

For the lambda_pandas.py function, we just load the SQL query output in a Pandas DataFrame object, and then we generate a CSV file in a bucket folder from the DataFrame.

No specific session properties need to be set but the query result must fit in-memory.

Flask application

In the s3-file-application folder you can find a Flask Web application using the 2 Lambda functions.

Flask application


Finally, you can connect to another existing Trino or Starburst Enterprise (Trino enterprise edition) deployment or cluster for more scalability, security and performance.

Have fun!

About

Python AWS Lambda functions to generate single CSV file in a S3 bucket from a Trino SQL query

Topics

Resources

Stars

Watchers

Forks