Improve parallel read #74

leo-schick · 2022-05-13T21:03:40Z

leo-schick · 2022-05-25T11:22:11Z

I have this now running in production without any issue.

… starting to work

…d a Worker node failed add basic unit testing for FeedWorkerProcess logic add unit test for when command queue is full

…s has started

jankatins · 2023-11-23T19:36:52Z

Whats the actual problem here? That the reads run as python code in threads and therefore run into the GIL? I always thought due to the "run everything as subprocess" we never run into that problem?

This feels like a lot of complexity and I don't really see the gain here. Any chance to make that gain clearer to me?

leo-schick · 2023-11-23T22:34:05Z

@jankatins the problem what I was trying to solve is that when running a parallel task, the commands for the internal sub pipelines need to be evaluated before the pipeline starts working. I had a file bucket with over millions of files which I had to process. In my case, the pipeline became so big that it was unable to start; probably because of memory consumption or the job was still reading the complete file list of the bucket after more than 1 hour.

This PR changes the parallel task behavior by putting the sub pipeline generation into a separate feed worker task. This PR is complex and I am not 100% sure if it should be part of mara. It is a first try to implement file based micro batch streaming via mara. I realized that it might not have been the best idea💡 I had in the last years 😉

leo-schick mentioned this pull request May 13, 2022

Issues with _ParallelRead / Redesign adding optional Worker nodes #75

Open

leo-schick changed the base branch from master to 3.2.x May 13, 2022 21:06

leo-schick force-pushed the improve-parallel-read branch from cf686d3 to 3efb63f Compare May 16, 2022 13:19

leo-schick marked this pull request as ready for review May 16, 2022 13:20

leo-schick requested a review from jankatins May 25, 2022 11:22

leo-schick mentioned this pull request May 30, 2022

Handle SIGTERM #40

Open

leo-schick force-pushed the improve-parallel-read branch from 7bde9b5 to fe367ed Compare October 10, 2022 08:01

leo-schick changed the base branch from 3.2.x to main October 10, 2022 08:01

leo-schick added 5 commits January 31, 2023 12:00

improve _ParallelRead not reading the whole folder into memory before…

8068542

… starting to work

fix typing

0a4e549

small fixes

4f9e00d

limit command_queue to limit memory usage

eba32f6

do not add empty command chains into the queue

3e9be23

leo-schick force-pushed the improve-parallel-read branch 2 times, most recently from cf4880a to 57bf417 Compare January 31, 2023 11:03

fix endless loop when command queue is full from FeedWorkerProcess an…

0c5c5ad

…d a Worker node failed add basic unit testing for FeedWorkerProcess logic add unit test for when command queue is full

leo-schick force-pushed the improve-parallel-read branch from 57bf417 to 0c5c5ad Compare January 31, 2023 11:13

leo-schick added 3 commits January 31, 2023 12:46

restructuring code, fixing tests

e9a78a8

fix crash when the FeedWorkersProcess exists before the Worker proces…

ad0bdf6

…s has started

fix endless running worker nodes when feed worker process fails

3f17514

leo-schick mentioned this pull request Nov 23, 2023

add dynamic Task #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parallel read #74

Improve parallel read #74

leo-schick commented May 13, 2022 •

edited

leo-schick commented May 25, 2022

jankatins commented Nov 23, 2023 •

edited

leo-schick commented Nov 23, 2023

Improve parallel read #74

Are you sure you want to change the base?

Improve parallel read #74

Conversation

leo-schick commented May 13, 2022 • edited

leo-schick commented May 25, 2022

jankatins commented Nov 23, 2023 • edited

leo-schick commented Nov 23, 2023

leo-schick commented May 13, 2022 •

edited

jankatins commented Nov 23, 2023 •

edited