Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you guys have benchmark result for Athena in the plan? #13

Open
syang opened this issue May 13, 2021 · 1 comment
Open

Do you guys have benchmark result for Athena in the plan? #13

syang opened this issue May 13, 2021 · 1 comment

Comments

@syang
Copy link

syang commented May 13, 2021

Given that Athena as a major 'serverless' data query engine, it would be great if you guys can put them into perspective.

Any thoughts?

@mike-weinberg
Copy link
Collaborator

Hey @syang, right now the focus is on converting the benchmark to run in DBT so that it is easier to

  1. re-run the benchmark yourself and make changes as you see fit for your purposes
  2. contribute new backends to the benchmark and get your name on an open source benchmark!

In truth, Athena's architecture means that it is certain to be slower than redshift. In general it may be better to think of athena less as a serverless data warehouse and more as serverless a data lake processing engine for companies that originally built out their data infrastructure on HDFS or S3. As a result, I think the decision to use Athena vs a more traditional cloud data warehouse should be based more on compatibility with existing infrastructure and less on performance, since Athena is not really intended to have the same performance characteristics as Redshift, Snowflake, BQ, et al, since it is dramatically more dependent on upstream optimization decisions like file-types, file size, parquet block configuration, etc which are entirely obscured in traditional warehouse systems.

That being said, I don't want you to feel like I'm waving my hands to get away with not writing an Athena benchmark. As you said, Athena is fully serverless, and the closest equivalent to it is probably BigQuery. Fundamentally, bigquery is just a really tightly controlled implementation of a similar architecture to athena, so we should expect a highly optimized athena implementation to perform similarly to bigquery, and in fact this is exactly what we see in a benchmark from the highly specialized data-lake-ingestion-optimization platform "Upsolver". In this benchmark they find that after optimizing for storage concerns, athena is basically equivalent to bigquery for normal looking sql.

Given this, I think it's safe to assume that Bigquery acts as a proxy for best-case athena performance, and so by reading Fivetran's benchmark you can implicitly compare purpose-built warehouses to so called "lakehouses" like athena, presto, etc.

I really hope this helps ! If you have further general questions about data warehouses, please find me on DBT Slack =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants