Speedup stats processing in Spark cluster #2871

amCap1712 · 2024-05-10T11:44:27Z

Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.
Remove Pydantic validation in places where it seemed redundant or of not much use.

Before this PR, an entire stats run took about 9 hours. With step 2, it went down to 6.25 hours and then with step 1 on top of it, it goes down to 5.75 hours.

Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.

amCap1712 added 2 commits May 10, 2024 17:12

Write listens as partitioned parquet on import

1fe9db0

Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.

Eliminate redundant or overzealous validation steps for speedup

13a4a88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup stats processing in Spark cluster #2871

Speedup stats processing in Spark cluster #2871

amCap1712 commented May 10, 2024

Speedup stats processing in Spark cluster #2871

Are you sure you want to change the base?

Speedup stats processing in Spark cluster #2871

Conversation

amCap1712 commented May 10, 2024