Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kestrel #405

Merged
merged 21 commits into from Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug_report.md
Expand Up @@ -29,8 +29,8 @@ log file here
```

**Platform (please complete the following information):**
- Simulation platform: [e.g. Eagle, AWS, local docker; please label with this as well]
- BuildStockBatch version, branch, or sha:
- Simulation platform: [e.g. Kestrel, Eagle, AWS, local docker; please label with this as well]
- BuildStockBatch version, branch, or sha:
- resstock or comstock repo version, branch, or sha:
- Local Desktop OS: [e.g. Windows, Mac, Linux, especially important if running locally]

Expand Down
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE/pull_request_template.md
Expand Up @@ -14,5 +14,5 @@ Not all may apply
- [ ] All other unit and integration tests passing
- [ ] Update validation for project config yaml file changes
- [ ] Update existing documentation
- [ ] Run a small batch run on Eagle to make sure it all works if you made changes that will affect Eagle
- [ ] Run a small batch run on Kestrel/Eagle to make sure it all works if you made changes that will affect Kestrel/Eagle
- [ ] Add to the changelog_dev.rst file and propose migration text in the pull request
3 changes: 2 additions & 1 deletion .gitignore
Expand Up @@ -5,6 +5,7 @@ __pycache__
docs/_build
dask-worker-space
venv/*
.venv
*.DS_Store
*.zip
.pytest_cache
Expand All @@ -15,4 +16,4 @@ coverage/
.coverage
build/
.env
.history
.history
2 changes: 1 addition & 1 deletion buildstockbatch/base.py
Expand Up @@ -3,7 +3,7 @@
"""
buildstockbatch.base
~~~~~~~~~~~~~~~
This is the base class mixed into the deployment specific classes (i.e. eagle, local)
This is the base class mixed into the deployment specific classes (i.e. kestrel, local)

:author: Noel Merket
:copyright: (c) 2018 by The Alliance for Sustainable Energy
Expand Down
2 changes: 1 addition & 1 deletion buildstockbatch/eagle.sh
Expand Up @@ -12,4 +12,4 @@ df -h
module load conda singularity-container
source activate "$MY_CONDA_ENV"

time python -u -m buildstockbatch.eagle "$PROJECTFILE"
time python -u -m buildstockbatch.hpc eagle "$PROJECTFILE"
2 changes: 1 addition & 1 deletion buildstockbatch/eagle_postprocessing.sh
Expand Up @@ -29,4 +29,4 @@ pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "df -i; df -h"
$MY_CONDA_ENV/bin/dask scheduler --scheduler-file $SCHEDULER_FILE &> $OUT_DIR/dask_scheduler.out &
pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "$MY_CONDA_ENV/bin/dask worker --scheduler-file $SCHEDULER_FILE --local-directory /tmp/scratch/dask --nworkers ${NPROCS} --nthreads 1 --memory-limit ${MEMORY}MB" &> $OUT_DIR/dask_workers.out &

time python -u -m buildstockbatch.eagle "$PROJECTFILE"
time python -u -m buildstockbatch.hpc eagle "$PROJECTFILE"
371 changes: 229 additions & 142 deletions buildstockbatch/eagle.py → buildstockbatch/hpc.py
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this file from eagle.py ➡️ hpc.py.

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions buildstockbatch/kestrel.sh
@@ -0,0 +1,16 @@
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --tmp=1000000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line tells slurm to give us a node with /tmp/scratch.


echo "Job ID: $SLURM_JOB_ID"
echo "Hostname: $HOSTNAME"
echo "QOS: $SLURM_JOB_QOS"

df -i
df -h

module load python apptainer
source "$MY_PYTHON_ENV/bin/activate"
Comment on lines +13 to +14
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll notice I abandoned conda as our python package and environment manager. There was too much trouble between it and pip when installing buildstockbatch. I opted to go with the system installed python (3.11) and use a venv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out if we actually still used ruby native outside of the container but it looks like not...


time python -u -m buildstockbatch.hpc kestrel "$PROJECTFILE"
34 changes: 34 additions & 0 deletions buildstockbatch/kestrel_postprocessing.sh
@@ -0,0 +1,34 @@
#!/bin/bash
#SBATCH --tmp=1000000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line tells slurm to give us nodes with /tmp/scratch. Yes, we're using them in the postprocessing. Dask likes to dump extra stuff to disk and we tell it to go there.


echo "begin kestrel_postprocessing.sh"

echo "Job ID: $SLURM_JOB_ID"
echo "Hostname: $HOSTNAME"

df -i
df -h

module load python apptainer
source "$MY_PYTHON_ENV/bin/activate"

export POSTPROCESS=1

echo "UPLOADONLY: ${UPLOADONLY}"
echo "MEMORY: ${MEMORY}"
echo "NPROCS: ${NPROCS}"

SCHEDULER_FILE=$OUT_DIR/dask_scheduler.json

echo "head node"
echo $SLURM_JOB_NODELIST_PACK_GROUP_0
echo "workers"
echo $SLURM_JOB_NODELIST_PACK_GROUP_1

pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "free -h"
pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "df -i; df -h"

$MY_PYTHON_ENV/bin/dask scheduler --scheduler-file $SCHEDULER_FILE &> $OUT_DIR/dask_scheduler.out &
pdsh -w $SLURM_JOB_NODELIST_PACK_GROUP_1 "$MY_PYTHON_ENV/bin/dask worker --scheduler-file $SCHEDULER_FILE --local-directory /tmp/scratch/dask --nworkers ${NPROCS} --nthreads 1 --memory-limit ${MEMORY}MB" &> $OUT_DIR/dask_workers.out &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't working for me when the python environment was on /kfs2/shared-projects/envs. There are some permissions things messed up there. Like the groups weren't being passed down to the compute nodes or something. Supposedly they're working on it. I recommend creating your virtualenv on /scratch or /projects for testing.


time python -u -m buildstockbatch.hpc kestrel "$PROJECTFILE"
14 changes: 7 additions & 7 deletions buildstockbatch/sampler/base.py
Expand Up @@ -40,7 +40,7 @@ def __init__(self, parent):
"""
Create the buildstock.csv file required for batch simulations using this class.

Multiple sampling methods are available to support local & eagle analyses, as well as to support multiple\
Multiple sampling methods are available to support local & hpc analyses, as well as to support multiple\
sampling strategies. Currently there are separate implementations for commercial & residential stock types\
due to unique requirements created by the commercial tsv set.

Expand All @@ -52,7 +52,7 @@ def __init__(self, parent):
ContainerRuntime.LOCAL_OPENSTUDIO,
):
self.csv_path = os.path.join(self.project_dir, "housing_characteristics", "buildstock.csv")
elif self.container_runtime == ContainerRuntime.SINGULARITY:
elif self.container_runtime == ContainerRuntime.APPTAINER:
self.csv_path = os.path.join(self.parent().output_dir, "housing_characteristics", "buildstock.csv")
else:
self.csv_path = None
Expand Down Expand Up @@ -81,8 +81,8 @@ def run_sampling(self):
"""
if self.container_runtime == ContainerRuntime.DOCKER:
return self._run_sampling_docker()
elif self.container_runtime == ContainerRuntime.SINGULARITY:
return self._run_sampling_singularity()
elif self.container_runtime == ContainerRuntime.APPTAINER:
return self._run_sampling_apptainer()
else:
assert self.container_runtime == ContainerRuntime.LOCAL_OPENSTUDIO
return self._run_sampling_local_openstudio()
Expand All @@ -95,11 +95,11 @@ def _run_sampling_docker(self):
"""
raise NotImplementedError

def _run_sampling_singularity(self):
def _run_sampling_apptainer(self):
"""
Execute the sampling in a singularity container
Execute the sampling in an apptainer container

Replace this in a subclass if your sampling needs docker.
Replace this in a subclass if your sampling needs apptainer.
"""
raise NotImplementedError

Expand Down
2 changes: 1 addition & 1 deletion buildstockbatch/sampler/commercial_sobol.py
Expand Up @@ -39,7 +39,7 @@ def __init__(self, parent, n_datapoints):
"""
super().__init__(parent)
self.validate_args(self.parent().project_filename, n_datapoints=n_datapoints)
if self.container_runtime == ContainerRuntime.SINGULARITY:
if self.container_runtime == ContainerRuntime.APPTAINER:
self.csv_path = os.path.join(self.output_dir, "buildstock.csv")
else:
assert self.container_runtime in (
Expand Down
10 changes: 5 additions & 5 deletions buildstockbatch/sampler/residential_quota.py
Expand Up @@ -88,16 +88,16 @@ def _run_sampling_docker(self):
)
return destination_filename

def _run_sampling_singularity(self):
def _run_sampling_apptainer(self):
args = [
"singularity",
"apptainer",
"exec",
"--contain",
"--home",
"{}:/buildstock".format(self.buildstock_dir),
"--bind",
"{}:/outbind".format(os.path.dirname(self.csv_path)),
self.parent().singularity_image,
self.parent().apptainer_image,
"ruby",
"resources/run_sampling.rb",
"-p",
Expand All @@ -107,9 +107,9 @@ def _run_sampling_singularity(self):
"-o",
"../../outbind/{}".format(os.path.basename(self.csv_path)),
]
logger.debug(f"Starting singularity sampling with command: {' '.join(args)}")
logger.debug(f"Starting apptainer sampling with command: {' '.join(args)}")
subprocess.run(args, check=True, env=os.environ, cwd=self.parent().output_dir)
logger.debug("Singularity sampling completed.")
logger.debug("Apptainer sampling completed.")
return self.csv_path

def _run_sampling_local_openstudio(self):
Expand Down
3 changes: 2 additions & 1 deletion buildstockbatch/schemas/v0.3.yaml
Expand Up @@ -6,6 +6,7 @@ weather_files_url: str(required=False)
sampler: include('sampler-spec', required=True)
workflow_generator: include('workflow-generator-spec', required=True)
eagle: include('hpc-spec', required=False)
kestrel: include('hpc-spec', required=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a kestrel key like you have the eagle one to your project file. Adjust the number of jobs and file locations and stuff. It's all the same structure and format, though.

aws: include('aws-spec', required=False)
output_directory: regex('^(.*\/)?[a-z][a-z0-9_]*\/?$', required=True)
sys_image_dir: str(required=False)
Expand Down Expand Up @@ -48,7 +49,7 @@ hpc-spec:
hpc-postprocessing-spec:
time: int(required=True)
n_workers: int(min=1, max=32, required=False)
node_memory_mb: enum(85248, 180224, 751616, required=False)
node_memory_mb: int(min=85248, max=751616, required=False)
n_procs: int(min=1, max=36, required=False)
parquet_memory_mb: int(min=100, max=4096, required=False)

Expand Down
5 changes: 2 additions & 3 deletions buildstockbatch/test/conftest.py
Expand Up @@ -3,7 +3,6 @@
import shutil
import tempfile
import yaml
from pathlib import Path

OUTPUT_FOLDER_NAME = "output"

Expand All @@ -12,7 +11,7 @@
def basic_residential_project_file():
with tempfile.TemporaryDirectory() as test_directory:

def _basic_residential_project_file(update_args={}, raw=False):
def _basic_residential_project_file(update_args={}, raw=False, hpc_name="eagle"):
output_dir = "simulations_job0" if raw else "simulation_output"
buildstock_directory = os.path.join(test_directory, "openstudio_buildstock")
shutil.copytree(
Expand Down Expand Up @@ -101,7 +100,7 @@ def _basic_residential_project_file(update_args={}, raw=False):
"options": [{"option": "Infiltration|11.25 ACH50"}],
}
],
"eagle": {
hpc_name: {
"sampling": {"time": 20},
"account": "testaccount",
"minutes_per_sim": 1,
Expand Down