Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Create separate Dask worker numbers for tax func estimation vs. model solution #566

Closed
wants to merge 6 commits into from

Conversation

rickecon
Copy link
Member

@rickecon rickecon commented May 13, 2020

This PR addresses the different memory requirements in the parallelization of tax function estimation operations of OG-USA versus the lower memory requirements of the parallelization in the model solution. Some of these issues are addressed in Issue #562.

OG-USA's parallelization uses the Dask library. The dask.distributed schedule has the following memory cutoffs for each worker.

distributed:
  worker:
    # Fractions of worker memory at which we take action to avoid memory blowup
    # Set any of the lower three values to False to turn off the behavior entirely
    memory:
      target: 0.60  # target fraction to stay below
      spill: 0.70  # fraction at which we spill to disk
      pause: 0.80  # fraction at which we pause worker threads
      terminate: 0.95  # fraction at which we terminate the worker

In the tax function estimation, the memory requirements on each worker range between 3.2 GB and 3.8 GB. This PR calculates separate numbers of workers for the tax function estimation (num_worker_txf) and for the model solution (num_worker_mod). These calculations are made in the OG-USA run start script (run_ogusa_example.py) and are passed as inputs to the execute.py runner() function.

The calculation for the optimal number of workers for a given process is the following.

num_workers = min(N_max, int(np.floor(0.7 * RAM_tot_GB / mem_per_wkr)))

cc: @jdebacker

@rickecon
Copy link
Member Author

rickecon commented May 13, 2020

@jdebacker . I am currently running the new run_ogusa_example.py. I get the same error as described in Issue #562 when I go to the tax function estimation in the reform despite setting the num_workers_txf=2.

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.60 GB -- Worker memory limit: 4.29 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.35 GB -- Worker memory limit: 4.29 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 4.06 GB -- Worker memory limit: 4.29 GB
...
distributed.scheduler.KilledWorker: ('taxcalc_advance-6ba9fc1a-b387-4911-80c6-1ab48caedec8', <Worker 'tcp://127.0.0.1:62856', name: 1, memory: 0, processing: 6>)
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker

I am going to continue working on this PR.

RAM_stats = psutil.virtual_memory()
RAM_total_bytes = RAM_stats.total
RAM_total_GB = RAM_total_bytes / 1073741824
mem_per_wkr_txf = 3.5 # Memory per worker (GB) in tax function estimation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make put these amounts for the memory for tax function and memory for model solution in constants.py so they only need to be adjusted in one place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdebacker Agreed. Plus, I just guessed at the mem_per_wkr_mod=0.05 amount. But I know that it is much less than a gigabyte so that the processor constraint rarely binds. I'll go through and profile the memory footprint before I remove the WIP designation of this PR.

@jdebacker
Copy link
Member

@rickecon This is looking good! Thanks for doing this!

A reminder to as psutils to the environment file.

@jdebacker
Copy link
Member

@rickecon Re your issue with run_ogusa_example.py. It does surprise me. I'd proceed by making the changes you think are necessary, then running the unit tests. Pay particular attention to those related modules/functions that have changed and esp test_get_micro_data.py. For that, you might use the verbose option and save the log file so that you can review for any warnings like above (which you wouldn't see if the tests pass). Once unit tests check out, go back to the full model run and see if how it works with different processes.

It might be worth while to put the following lines into their own function (e.g., in utils.py):

RAM_stats = psutil.virtual_memory()
     RAM_total_bytes = RAM_stats.total
     RAM_total_GB = RAM_total_bytes / 1073741824
     mem_per_wkr_txf = 3.5  # Memory per worker (GB) in tax function estimation
    client = Client()
     num_workers = min(multiprocessing.cpu_count(), 7)
     print('Number of workers = ', num_workers)
     num_workers_max = int(min(multiprocessing.cpu_count(), 7))
     num_workers_txf = \
         np.minimum(num_workers_max, int(np.floor((0.7 * RAM_total_GB) /
                                                  mem_per_wkr_txf)))
     num_workers_mod = \
         np.minimum(num_workers_max, int(np.floor((0.7 * RAM_total_GB) /
                                                  mem_per_wkr_mod)))

And then write a test for this function to be sure things are working as expected (though I'm not sure off hand how to test with some large data to see that limits are hit or not, but you might just use the Calculator object across a few workers).

@codecov-io
Copy link

codecov-io commented May 13, 2020

Codecov Report

Merging #566 into master will increase coverage by 0.01%.
The diff coverage is 81.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #566      +/-   ##
==========================================
+ Coverage   84.18%   84.19%   +0.01%     
==========================================
  Files          47       47              
  Lines        6657     6664       +7     
==========================================
+ Hits         5604     5611       +7     
  Misses       1053     1053              
Impacted Files Coverage Δ
ogusa/SS.py 88.15% <0.00%> (ø)
ogusa/TPI.py 76.34% <ø> (ø)
ogusa/execute.py 14.58% <ø> (ø)
ogusa/tests/test_basic.py 33.33% <ø> (ø)
ogusa/tests/test_execute.py 68.18% <ø> (ø)
ogusa/tests/test_run_ogusa.py 32.65% <ø> (ø)
ogusa/tests/test_startyears.py 50.00% <ø> (ø)
ogusa/tests/test_txfunc.py 65.07% <ø> (ø)
ogusa/tests/test_user_inputs.py 60.71% <ø> (ø)
ogusa/tests/test_TPI.py 59.81% <50.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd9b897...99857fa. Read the comment docs.

@rickecon
Copy link
Member Author

rickecon commented Jul 7, 2020

The Compute.Studio parallelization for OG-USA is currently set to 5 workers for both tax function estimation and model run (see line 160 of ./cs-config/cs_config/functions.py). However, @hdoupe added a TODO in lines 161-164 of ./cs-config/cs_config/functions.py saying to set num_workers_txf = 5 and num_workers_mod = 6. So I will add this and see if it passes the Compute.Studio tests.

@rickecon
Copy link
Member Author

rickecon commented Jul 7, 2020

I also need to make sure that I close the Dask.Client after each tax function estimation before I start a new Client with the baseline or reform simulation.

@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2020

Codecov Report

Merging #566 into master will increase coverage by 0.18%.
The diff coverage is 81.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #566      +/-   ##
==========================================
+ Coverage   84.00%   84.19%   +0.18%     
==========================================
  Files          47       47              
  Lines        6848     6664     -184     
==========================================
- Hits         5753     5611     -142     
+ Misses       1095     1053      -42     
Impacted Files Coverage Δ
ogusa/SS.py 88.15% <0.00%> (ø)
ogusa/TPI.py 76.34% <ø> (ø)
ogusa/execute.py 14.58% <ø> (ø)
ogusa/tests/test_basic.py 33.33% <ø> (ø)
ogusa/tests/test_execute.py 68.18% <ø> (ø)
ogusa/tests/test_run_ogusa.py 32.65% <ø> (ø)
ogusa/tests/test_startyears.py 50.00% <ø> (ø)
ogusa/tests/test_txfunc.py 65.07% <ø> (-0.53%) ⬇️
ogusa/tests/test_user_inputs.py 60.71% <ø> (ø)
ogusa/tests/test_TPI.py 59.81% <50.00%> (-0.73%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7af1c19...0551707. Read the comment docs.

@rickecon rickecon closed this Aug 7, 2020
@rickecon rickecon deleted the memory branch August 7, 2020 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants