Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default number of workers for multiprocessing #62

Open
ieivanov opened this issue Jul 21, 2023 · 6 comments
Open

Default number of workers for multiprocessing #62

ieivanov opened this issue Jul 21, 2023 · 6 comments
Assignees

Comments

@ieivanov
Copy link
Collaborator

@talonchandler and I found that on bruno multiprocessing.cpu_count() returns the total number of CPUs on the given machine, which is usually more than the user has reserved. In that sense, multiprocessing.cpu_count() is not a good default value on bruno (as @edyoshikun has also found). We should check if on bruno there is an env variable that gives the number of nodes reserved and use that as default. If that variable doesn't exist (e.g. when not working on bruno) we can fall back to multiprocessing.cpu_count()

@talonchandler
Copy link
Contributor

Thanks for summarizing @ieivanov .

@edyoshikun I think you've been using multiprocessing on slurm the most, so I'd appreciate your comments. I just took a quick look with srun --cpus-per-task=13 --pty bash then env | grep SLURM and I found some good candidate environment variables: SLURM_CPUS_PER_TASK=13, SLURM_JOB_CPUS_PER_NODE=13, SLURM_CPUS_ON_NODE=13.

@ieivanov
Copy link
Collaborator Author

I think this now becomes a question for griznog to know what are the differences between these

@edyoshikun
Copy link
Contributor

Thanks for finding this. When we run the bash script for slurm we use --cpus-per-task as the the variable to reserve, so I think this is probably the best candidate.

# Get the number of CPUs per task
CPUS_PER_TASK=$(scontrol show job $SLURM_JOB_ID | awk -F= '/^CPUS-per-task/ {print $2}')

I think griznog or chatgpt would be good candidates to ask.

@edyoshikun
Copy link
Contributor

Also related but maybe part of different conversation, but I did find this other command sacct and I think Jordao mentioned at some point, to know how many resources the job took (i.e CPU usage and memory usage)

# View resource usage for a specific completed job (replace JOB_ID with the actual job ID)
sacct -j JOB_ID

@talonchandler
Copy link
Contributor

Sounds good, I'll trial SLURM_CPUS_PER_TASK and see how it works. (@edyoshikun FYI this variable is directly available from within a SLURM job, so I don't think I'll need to use the scontrol and awk stuff?)


I also recently discovered seff JOB_ID which directly solves at least one of the headaches we were dealing with: monitoring maximum memory utilization @ieivanov

Job ID: 10022176
Cluster: cluster
User/Group: eduardo.hirata/eduardo.hirata.grp
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 00:09:02
CPU Efficiency: 12.01% of 01:15:12 core-walltime
Job Wall-clock time: 00:02:21
Memory Utilized: 6.17 GB
Memory Efficiency: 38.58% of 16.00 GB

@ieivanov
Copy link
Collaborator Author

We should unify the --num-processes across CLI calls using a decorator, similar to input_position_dirpaths for example. This will ensure we have the right number of workers for every CLI call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants