Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

28 create simplest running azure batch kick off script #30

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

natemcintosh
Copy link
Collaborator

Initial take on this, based on discussion here.

@natemcintosh natemcintosh linked an issue Mar 18, 2024 that may be closed by this pull request
9 tasks
@natemcintosh
Copy link
Collaborator Author

natemcintosh commented Mar 19, 2024

Still to do:

  • Determine a pool name. E.g. multisignal-epi-inference
  • Provide actual implementations of
    • create_docker_cmd()
    • create_mdl_cfg_filename()
    • create_pp_cfg_filename()

The hard part is figuring out exactly how to name the model configuration files, and the post production configuration files. I think that part will depend on the geographies that the models are running on, and the dates. Once that part is better defined, I think this will be easy to do.

@ghost
Copy link

ghost commented Mar 22, 2024

Hey @natemcintosh, what's the plan for this? Do you need anything from me?

@natemcintosh
Copy link
Collaborator Author

Hi George, I've mostly been working on setting up the gold tables for the NSSP pull.

The list of tasks above is pretty much what still needs to be done. If you have any input on the best way to name the configuration files (they each need to have a unique name for when we upload to Azure blob storage), that would be helpful.

The other major improvement that could be made is to improve the testing. Right now test_local_e2e() only tests that at least one call to client.add_task() is made. It would be better if we could verify the correct number of calls is made, with the correct arguments, and that the the other client methods are called correctly as well.

@ghost
Copy link

ghost commented Mar 22, 2024

I'll think about names and start planning on an e2e example

@ghost
Copy link

ghost commented Mar 22, 2024

Naming

For naming, I think we need to consider the following:

  1. The version control system is with us, so there is no need to have the versioning part of the name (only date, maybe).
  2. I like the pattern where the sub-files follow the name of the main file and add a suffix, e.g., flu-and-covid-winter2025.json -> (flu-and-covid-winter2025-all-us.json, flu-and-covid-winter2025-state1-and-state2.json, etc.)
  3. I think we should have a way to sort files based on creation date, e.g., 00-flu-and-covid-winter2025, 01-flu-winter2025, 02-flu-and-covid-winter2025-pooled.

Idea: Name files using both the date and a descriptive handle, e.g., YYYY-mm-dd_[handle]+.json for the main file, and YYYY-mm-dd_[handle].[sub handle].json for the sub-files; [handle] and [sub handle] should be [:alnum:]+ so it's easier to parse files.

Another thought, since the number of files will be large, perhaps each project should be within a folder named YYYY-mm-dd_[handle], and under the hood, have:

YYYY-mm-dd_[handle 1]/
  - `00-config.json`
  - `01-log-[sub handle 1].log`
  - `01-log-[sub handle 2].log`
  - ...
  - `02-model-[sub handle 1].json`
  - `02-model-[sub handle 2].json`
  - ...
  - `03-post-[sub handle 1].json`
  - `03-post-[sub handle 2].json`
  - ...
YYYY-mm-dd_[handle 2]/
  - `00-config.json`
  - `01-log-[sub handle 1].log`
  - `01-log-[sub handle 2].log`
  - ...
  - `02-model-[sub handle 1].json`
  - `02-model-[sub handle 2].json`
  - ...
  - `03-post-[sub handle 1].json`
  - `03-post-[sub handle 2].json`
  - ...

test_local_e2e()

It could make sure that the job, whatever it is, saves the right information (or a message) to a logger, which could be the 01-log-[sub handle].log.

@natemcintosh
Copy link
Collaborator Author

@gvegayon I like the idea of having top level folder with a date and "run" name, e.g. YYYY-mm-dd_flu_and_covid, and then various files associated with this run live under it. Perhaps the run name could be a third top level field in the large input config file along with model and post_production.

Something we have been doing at the top level of the Rt pipeline is using two storage containers: /input/ and /output/. If we continued this practice, there would be a YYYY-mm-dd_flu_and_covid in both the input and output storage containers. I believe cfa-azure currently supports this structure.

However, if we think it would be more advantageous to store both input and output under the same run folder (which I think is what you are suggesting above?), it would certainly make it easier to see both the inputs and outputs for a model run in one place. I'm not 100% sure that cfa-azure currently supports bind mounting to a single storage container, but I'm sure it won't be too hard to figure out. If we do it this way, perhaps we could have a structure like

YYYY-mm-dd_[handle 1]/
  - input/
    - `00-config.json`
    - `01-model-[sub handle 1].json`
    - `01-model-[sub handle 2].json`
    - `02-post-[sub handle 1].json`
    - `02-post-[sub handle 2].json`
    - ...
  - output/
    - `log-[sub handle 1].log`
    - `log-[sub handle 2].log`
    - ...
YYYY-mm-dd_[handle 2]/
  - input/
    - `00-config.json`
    - `01-model-[sub handle 1].json`
    - `01-model-[sub handle 2].json`
    - `02-post-[sub handle 1].json`
    - `02-post-[sub handle 2].json`
    - ...
  - output/
    - `log-[sub handle 1].log`
    - `log-[sub handle 2].log`
    - ...

What do you think?

Copy link

codecov bot commented May 20, 2024

Codecov Report

Attention: Patch coverage is 84.12698% with 20 lines in your changes are missing coverage. Please review.

Project coverage is 90.89%. Comparing base (45deb0d) to head (b09fd5c).

Files Patch % Lines
pipeline/pipeline/submit_main.py 76.47% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #30      +/-   ##
==========================================
- Coverage   92.24%   90.89%   -1.35%     
==========================================
  Files          34       34              
  Lines         683      802     +119     
==========================================
+ Hits          630      729      +99     
- Misses         53       73      +20     
Flag Coverage Δ
unittests 90.89% <84.12%> (-1.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create simplest, running, Azure Batch kick-off script
1 participant