Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kestrel #405

Merged
merged 21 commits into from Nov 21, 2023
Merged

Kestrel #405

merged 21 commits into from Nov 21, 2023

Conversation

nmerket
Copy link
Member

@nmerket nmerket commented Oct 30, 2023

Fixes #313

A start at getting it to work on Kestrel.

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on Checks -> BuildStockBatch Tests -> Artifacts)
  • Coverage has increased or at least not decreased. Update minimum_coverage in .github/workflows/ci.yml as necessary.
  • All other unit and integration tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run on Eagle to make sure it all works if you made changes that will affect Eagle
  • Run a small batch run on Kestrel to make sure it all works if you made changes that will affect Kestrel
  • Add to the changelog_dev.rst file and propose migration text in the pull request
  • Change from singularity to apptainer.

@nmerket nmerket self-assigned this Oct 31, 2023
Copy link
Member Author

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rHorsey Here's my "working" Kestrel implementation. A couple notes:

pre-commit and black

I'm using pre-commit and black for auto formatting now. Recommend doing a

pre-commit install

after pip installing. If you don't, it's not the end of the world, CI will run black and fix things for you.

Avoid /shared-projects for now

They're still copying stuff over from Eagle and the permissions are all messed up. The top of my testing project file looks like

schema_version: '0.3'
os_version: 3.6.1
os_sha: bb9481519e
buildstock_directory: ../ # Relative to this file or absolute
project_directory: project_national # Relative to buildstock_directory
output_directory: /scratch/nmerket/national_baseline2
# weather_files_url: https://data.nrel.gov/system/files/156/BuildStock_TMY3_FIPS.zip
weather_files_path: /scratch/nmerket/weather/BuildStock_TMY3_FIPS.zip
sys_image_dir: /scratch/nmerket/images

You'll need to copy those files to /scratch or /projects as necessary. Also, create the environment on Kestrel not in /shared-projects (the default) at this point. I have instructions for that below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this file from eagle.py ➡️ hpc.py.

@@ -54,11 +54,12 @@ def get_bool_env_var(varname):
return os.environ.get(varname, "0").lower() in ("true", "t", "1", "y", "yes")


class EagleBatch(BuildStockBatchBase):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code is the same between the Eagle and Kestrel implementation, so I separated it out into a base class I'm calling SlurmBatch that EagleBatch and KestrelBatch both inherit from.

Comment on lines -468 to +506
cores_per_node = 36
minutes_per_sim = eagle_cfg["minutes_per_sim"]
walltime = math.ceil(math.ceil(n_sims_per_job / cores_per_node) * minutes_per_sim)
minutes_per_sim = hpc_cfg["minutes_per_sim"]
walltime = math.ceil(math.ceil(n_sims_per_job / self.CORES_PER_NODE) * minutes_per_sim)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hardcode the number of cores and the like, it's a constant on each of the sub classes.

@@ -677,6 +716,68 @@ def rerun_failed_jobs(self, hipri=False):
self.queue_post_processing(job_ids, hipri=hipri)


class EagleBatch(SlurmBatch):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Eagle specific implementation. It's mostly constant, defaults, and validation that only apply to Eagle.



class KestrelBatch(SlurmBatch):
DEFAULT_SYS_IMAGE_DIR = "/kfs2/shared-projects/buildstock/singularity_images"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're supposed to have this fixed by next week, but they're not ready with /shared-projects on Kestrel yet. They're still copying stuff over from Eagle and the permissions are all messed up. I'd recommend doing your testing from /scratch or /projects for now.

pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "df -i; df -h"

$MY_PYTHON_ENV/bin/dask scheduler --scheduler-file $SCHEDULER_FILE &> $OUT_DIR/dask_scheduler.out &
pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "$MY_PYTHON_ENV/bin/dask worker --scheduler-file $SCHEDULER_FILE --local-directory /tmp/scratch/dask --nworkers ${NPROCS} --nthreads 1 --memory-limit ${MEMORY}MB" &> $OUT_DIR/dask_workers.out &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't working for me when the python environment was on /kfs2/shared-projects/envs. There are some permissions things messed up there. Like the groups weren't being passed down to the compute nodes or something. Supposedly they're working on it. I recommend creating your virtualenv on /scratch or /projects for testing.

@@ -6,6 +6,7 @@ weather_files_url: str(required=False)
sampler: include('sampler-spec', required=True)
workflow_generator: include('workflow-generator-spec', required=True)
eagle: include('hpc-spec', required=False)
kestrel: include('hpc-spec', required=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a kestrel key like you have the eagle one to your project file. Adjust the number of jobs and file locations and stuff. It's all the same structure and format, though.

Comment on lines +13 to +14
module load python apptainer
source "$MY_PYTHON_ENV/bin/activate"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll notice I abandoned conda as our python package and environment manager. There was too much trouble between it and pip when installing buildstockbatch. I opted to go with the system installed python (3.11) and use a venv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out if we actually still used ruby native outside of the container but it looks like not...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To install, it defaults to /shared-projects so you'll want to override that. Also we're using python venv now for environments instead of conda, so activating is a little different.

module load git # yes, really
git clone git@github.com:NREL/buildstockbatch.git
cd buildstockbatch
git checkout kestrel
mkdir -p /scratch/$USER/envs
./create_kestrel_env.sh -e /scratch/$USER/envs -d mybsb
source /scratch/$USER/envs/mybsb/bin/activate
buildstock_kestrel path/to/project_file.yml

Comment on lines +68 to +69
"buildstock_eagle=buildstockbatch.hpc:eagle_cli",
"buildstock_kestrel=buildstockbatch.hpc:kestrel_cli",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a separate buildstock_kestrel cli.

Copy link

github-actions bot commented Oct 31, 2023

File Coverage
All files 86%
base.py 90%
exc.py 57%
hpc.py 78%
local.py 70%
postprocessing.py 84%
utils.py 91%
cloud/docker_base.py 93%
sampler/base.py 79%
sampler/downselect.py 33%
sampler/precomputed.py 93%
sampler/residential_quota.py 61%
test/shared_testing_stuff.py 85%
test/test_docker.py 33%
test/test_validation.py 97%
workflow_generator/base.py 90%
workflow_generator/commercial.py 53%
workflow_generator/residential_hpxml.py 86%

Minimum allowed coverage is 33%

Generated by 🐒 cobertura-action against 3461316

@nmerket
Copy link
Member Author

nmerket commented Nov 1, 2023

It just occurred to me that a venv created by one user might not be able to be used by another user (which was possible with conda). We should check that and ensure it works, and if not, switch back to conda.

@afontani
Copy link
Collaborator

afontani commented Nov 2, 2023

This would be a nice to have feature: Issue 171

@nmerket nmerket marked this pull request as ready for review November 8, 2023 23:21
@nmerket
Copy link
Member Author

nmerket commented Nov 8, 2023

This would be a nice to have feature: Issue 171

@afontani Let's add that once we have a better handle on what the appropriate limits are on Kestrel.

@nmerket nmerket requested a review from rHorsey November 8, 2023 23:24
@nmerket nmerket mentioned this pull request Nov 10, 2023
9 tasks
Copy link
Contributor

@joseph-robertson joseph-robertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I successfully ran a test 100 datapoint project (timeseries_frequency=none). Annual results were postprocessed/uploaded as expected.

@nmerket nmerket merged commit 24a6fa3 into develop Nov 21, 2023
6 checks passed
@nmerket nmerket deleted the kestrel branch November 21, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kestrel Workflow
4 participants