Make ensemble trainings natively possible #171

RandomDefaultUser · 2021-07-09T15:20:25Z

I think it is doable to make training of ensemble networks natively possible. My idea would be the following:

Create a new parameter "parallel_execution", that by default is set to 1
If the parameter is anything but 1, upon initialization of the Parameters() object, it will try to initialize a MPI library and get no. of ranks and rank
Should this not succeed, serial execution is assumed
If this succeeds, the same script will be performed nr_ranks times, in parallel without communication between the ranks
saving of parameters, network parameters etc. will be made parallel-safe, i.e. through "_rank" in the file name
At the end the user has to only write ONE script but can use it to train an ensemble of networks by simply requesting the ressources from slurm and doing:

mpirun -np nr_ranks training.py

and editing training.py to include something like

params.parallel_execution=5

The text was updated successfully, but these errors were encountered:

RandomDefaultUser · 2021-07-09T15:22:25Z

I am aware that the same can be done with bash scripts, but I would argue this way is more user-friendly. Also I'd like to offer MALA native solutions to problems where possible. And I believe this is possible.

RandomDefaultUser added enhancement New feature or request not critical labels Jul 9, 2021

RandomDefaultUser self-assigned this Jul 9, 2021

elcorto mentioned this issue May 12, 2023

Implemented basic NN ensemble class #441

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ensemble trainings natively possible #171

Make ensemble trainings natively possible #171

RandomDefaultUser commented Jul 9, 2021

RandomDefaultUser commented Jul 9, 2021

Make ensemble trainings natively possible #171

Make ensemble trainings natively possible #171

Comments

RandomDefaultUser commented Jul 9, 2021

RandomDefaultUser commented Jul 9, 2021