Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parallel read #74

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Improve parallel read #74

wants to merge 9 commits into from

Conversation

leo-schick
Copy link
Member

@leo-schick leo-schick commented May 13, 2022

See #75

@leo-schick leo-schick changed the base branch from master to 3.2.x May 13, 2022 21:06
@leo-schick leo-schick marked this pull request as ready for review May 16, 2022 13:20
@leo-schick
Copy link
Member Author

I have this now running in production without any issue.

@leo-schick leo-schick requested a review from jankatins May 25, 2022 11:22
@leo-schick leo-schick mentioned this pull request May 30, 2022
@leo-schick leo-schick changed the base branch from 3.2.x to main October 10, 2022 08:01
@leo-schick leo-schick force-pushed the improve-parallel-read branch 2 times, most recently from cf4880a to 57bf417 Compare January 31, 2023 11:03
…d a Worker node failed

add basic unit testing for FeedWorkerProcess logic
add unit test for when command queue is full
@leo-schick leo-schick mentioned this pull request Nov 23, 2023
@jankatins
Copy link
Member

jankatins commented Nov 23, 2023

Whats the actual problem here? That the reads run as python code in threads and therefore run into the GIL? I always thought due to the "run everything as subprocess" we never run into that problem?

This feels like a lot of complexity and I don't really see the gain here. Any chance to make that gain clearer to me?

@leo-schick
Copy link
Member Author

@jankatins the problem what I was trying to solve is that when running a parallel task, the commands for the internal sub pipelines need to be evaluated before the pipeline starts working. I had a file bucket with over millions of files which I had to process. In my case, the pipeline became so big that it was unable to start; probably because of memory consumption or the job was still reading the complete file list of the bucket after more than 1 hour.

This PR changes the parallel task behavior by putting the sub pipeline generation into a separate feed worker task. This PR is complex and I am not 100% sure if it should be part of mara. It is a first try to implement file based micro batch streaming via mara. I realized that it might not have been the best idea💡 I had in the last years 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants