Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NERSC-specific installation documentation #122

Open
2 tasks
OliviaLynn opened this issue Jan 3, 2024 · 6 comments
Open
2 tasks

NERSC-specific installation documentation #122

OliviaLynn opened this issue Jan 3, 2024 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@OliviaLynn
Copy link
Member

There are two items from #51 that will not necessarily be addressed for v1, but we may still want to include:

  • instructions to get around NERSC ceci path issue
  • add a link to NERSC-specific installation instructions
@OliviaLynn OliviaLynn added the documentation Improvements or additions to documentation label Jan 3, 2024
@sschmidt23
Copy link
Collaborator

sschmidt23 commented Jan 5, 2024

What is the "ceci path issue" referring to? Is it the issue that there is an old copy of ceci somewhere in the base python path on NERSC? I usually get around that by creating a custom conda env from scratch, which seems to resolve that issue.

For NERSC-specific instructions and creating a custom conda environment, getting mpi4py and hdf5 writing set up correctly at NERSC can be a bit of a pain. This may already be included somewhere in RAIL docs, but in case it's not, there's a NERSC page addressing this: https://docs.nersc.gov/development/languages/python/parallel-python/

I've gotten things working by installing both mpi4py and h5py from source with the following procedure (I'm going to copy/paste from a slack message I sent to Josue a while back):

Following the directions on that Parallel Python page, I could not get the pre-built conda environments nersc-mpi4py or nersc-h5py to work correctly, either mpi4py or h5py would have problems. The solution that worked for me was to install both mpi4py and h5py myself in a new conda environment, following the instructions for that on the NERSC webpage. Here's the rough procedure for how I put together an environment to run rail_tpz in parallel two weeks ago:

  1. log in to NERSC
  2. do a module load python to load the module with a base conda (skip if you have a local conda at NERSC)
  3. run conda create -n [envname] python=3.10 numpy scipy
  4. run conda activate [envname]
  5. do a module swap PrgEnv-${PE_ENV,,} PrgEnv-gnu to make sure that the PrgEnv-gnu module is loaded rather than the other one.
  6. install mpi4py with MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py
  7. load the hdf5-parallel module at NERSC with module load cray-hdf5-parallel
  8. install python with conda install -c defaults --override-channels numpy "cython<3"
  9. install h5py with HDF5_MPI=ON CC=cc pip install -v --force-reinstall --no-cache-dir --no-binary=h5py --no-build-isolation --no-deps h5py
  10. clone whatever rail package you need e.g. git clone https://github.com/LSSTDESC/rail_tpz.git
  11. install rail_tpz (or whichever package) with pip install -e . in the rail_tpz directory

We could probably set up a conda environment with steps 1-8 somewhere that users can clone to make things easier. This should work with the pre-built nersc-h5py and nersc-mpi4py are, though I could not get those to work for me.

@ztq1996 ztq1996 self-assigned this Jan 31, 2024
@ztq1996
Copy link
Contributor

ztq1996 commented Jan 31, 2024

I will try this on NERSC and if it works out we can put this into the documentation and close

@sschmidt23
Copy link
Collaborator

sschmidt23 commented Jan 31, 2024

coincidentally, I just did the above set of instructions again today to set up a fresh environment to re-train a rail_tpz model, and things worked fine, I submitted a job to the debug queue using 5 processors and everything worked as intended.

oh, and I missed the (hopefully obvious) step 3.5 in the above instructions: conda activate [envname].

@ztq1996
Copy link
Contributor

ztq1996 commented Feb 7, 2024

I can follow Sam's guide to run the tpz notebook on NERSC, we should include this in the installation documentation (of rail_tpz?).

@sschmidt23
Copy link
Collaborator

I think this is more general than rail_tpz, I follow the same procedure if I want to run rail_flexzboost in parallel at NERSC, for example. Not sure where the best place for this would be.

@OliviaLynn
Copy link
Member Author

OliviaLynn commented Feb 12, 2024

Notes from the meeting:

  • these instructions could be added to the Installation page in RTD
  • we could benefit from asking Heather if this is the best way to do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants