Speedup stats processing in Spark cluster #2871

amCap1712 · 2024-05-10T11:44:27Z

Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.
Remove Pydantic validation in places where it seemed redundant or of not much use.

Before this PR, an entire stats run took about 9 hours. With step 2, it went down to 6.25 hours and then with step 1 on top of it, it goes down to 5.75 hours.

Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.

mayhem

Great speedup!

listenbrainz_spark/stats/user/entity.py

amCap1712 added 4 commits May 26, 2024 17:29

Write listens as partitioned parquet on import

a98ca2b

Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.

Eliminate redundant or overzealous validation steps for speedup

a26061a

Fix tests

4bf0d6d

removed outdated upload function

f39a2ac

amCap1712 force-pushed the spark-speedup branch from 13a4a88 to f39a2ac Compare May 26, 2024 20:15

amCap1712 requested a review from mayhem May 26, 2024 20:16

amCap1712 marked this pull request as ready for review May 26, 2024 20:16

mayhem approved these changes May 27, 2024

View reviewed changes

listenbrainz_spark/stats/user/entity.py Show resolved Hide resolved

amCap1712 merged commit ec77f69 into master May 28, 2024
4 checks passed

amCap1712 deleted the spark-speedup branch May 28, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup stats processing in Spark cluster #2871

Speedup stats processing in Spark cluster #2871

amCap1712 commented May 10, 2024

mayhem left a comment

Speedup stats processing in Spark cluster #2871

Speedup stats processing in Spark cluster #2871

Conversation

amCap1712 commented May 10, 2024

mayhem left a comment

Choose a reason for hiding this comment