Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Difference in performance of STS & Kyuubi thrift server #6345

Open
3 of 4 tasks
prathit06 opened this issue Apr 29, 2024 · 1 comment
Open
3 of 4 tasks

Comments

@prathit06
Copy link

prathit06 commented Apr 29, 2024

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

While trying to use kyuubi with tableau through thrift server exposed by kyuubi, we have noticed that for small dataset everything works fine. But for larger dataset when the collect operation is called, data transfer fails due to driver going OOM. For the same reason kyuubi exposes kyuubi.operation.incremental.collect flag which can be set to true for incremental collect. While this looks like an ideal solution (which it is) but there are performance bottlenecks when using this flag.

I ran same query through STS & Kyuubi thrift server

Query data size :

  • Data size : ~5Gbs
  • Records count : 32854241

Results as below with spark.sql.shuffle.partitions = 50 :

  • STS took 12-15 mins to run the job & transfer the data to tableau.
  • Kyuubi thrift was left to run for 2+ hours & transfered data size was only ~600Mbs

As can be seen there is significant performance difference between the two.

Kyuubi version : 1.9
Spark Version : 3.1.2
Running kyuubi on AWS EMR (version 6.5.0) only on primary node

kyuubi-defaults.conf config

kyuubi.ha.addresses ..compute.internal
kyuubi.operation.incremental.collect true
spark.submit.deployMode cluster # ( have tried with client as well )
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.scheduler.mode FAIR
spark.rdd.compress true
spark.shuffle.service.enabled true
spark.sql.hive.convertMetastoreParquet false
spark.sql.catalogImplementation hive
spark.sql.shuffle.partitions 50
spark.kryoserializer.buffer.max 1g
spark.driver.maxResultSize 25g
spark.driver.memory 35g
spark.executor.memory 25g
spark.driver.memoryOverhead 4g
spark.executor.memoryOverhead 3g
spark.cleaner.periodicGC.interval 10min

How should we improve?

Upon looks at STS & Kyuubi code, i could see a lot of similarities but also differences here & there. One major point i noticed is that some logs were missing from kyuubi, which could help to notice what is happening in kyuubi during data transfer.

For e.g.
Below logs were getting printed in STS but not in kyuubi

24/04/24 06:49:51 INFO SparkExecuteStatementOperation: Received getNextRowSet request order=FETCH_NEXT and maxRowsL=10000 with a1d07d3a-d6bb-4706-99c0-728fa8115816
24/04/24 06:49:52 INFO SparkExecuteStatementOperation: Returning result set with 10000 rows from offsets [1320000, 1330000) with a1d07d3a-d6bb-4706-99c0-728fa8115816

To investigate further, maybe adding more logs is a good start to check what exactly is happening & where is the bottleneck.

Please feel free to suggest or ask for any other additional information if needed.

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
Copy link

Hello @prathit06,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant