Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--model_dir is inconsistent, confusing, and unnecessary #340

Open
athewsey opened this issue Apr 7, 2020 · 4 comments
Open

--model_dir is inconsistent, confusing, and unnecessary #340

athewsey opened this issue Apr 7, 2020 · 4 comments

Comments

@athewsey
Copy link

athewsey commented Apr 7, 2020

Raising here (where I believe the implementation is?) as opposed to on SageMaker SDK - which as I understand just documents the functionality.

Every other SageMaker framework container that I've seen (see docs for MXNet, PyTorch, Scikit-Learn, even Chainer) advises determining the local path to store your model via a pattern like:

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])

...which is nice and flexible: I can consume all my parameters through argparse, and choose when debugging locally whether to specify via env var or CLI. I simply save my model to args.model_dir.

(FWIW the few other frameworks that I've checked closely don't seem to actually pass it in through the CLI: Just the env var default is getting used)

I don't understand why the TF container insists on passing in the slightly different --model_dir, which:

  • Can't easily be used together with --model-dir, because argparse shifts hyphens to underscores on the output object by default
  • Per SageMaker SDK issue #1355, reports the destination S3 URI for single-instance training but the local model folder for multi-instance training!

I'd argue this doesn't work well for power-users (because it's inconsistent with consensus in other frameworks), but also doesn't work well for beginners: who have to grapple with this "dir" parameter that isn't a directory at all but an S3 location, and will be confused into thinking they have to upload their model there when in fact SageMaker will do it for them.

For extra evidence of this, see also this repo's #115 (looks like a case of confusion caused) and #130 (user has specifically commented in code that the param is externally useless).

I don't really understand what the use case for the S3 URI would ever be in user code? Regardless of whether it's single- or multi-instance training, SageMaker can upload the final contents of SM_MODEL_DIR (you can just only save the model on your rank 0 instance for multi, right?).

Nothing in the amazon-sagemaker-script-mode samples seems to use the S3 URIs, and in many cases (e.g. A, B, C, D) they specifically implement a separate --model_output_dir to work around the issue.

aws/sagemaker-python-sdk/issues/1355 was resolved as a documentation change... To be clear I'm not claiming that the documentation is inaccurate or that there aren't other ways to achieve the same result. I'm asking for a (potentially API-breaking) implementation change, on the basis that the current API is bad for developers.

Would be great to hear what people are using --model_dir S3 URIs for if I'm wrong!

@athewsey athewsey changed the title --model_dir is inconsistent, confusing, and unnecessary --model_dir is inconsistent, confusing, and unnecessary Apr 7, 2020
@laurenyu
Copy link
Contributor

sorry for the delayed response here.

The main use case for an S3 model_dir is when using TensorFlow's support for S3 checkpointing during distributed training. TF (the framework itself) does its own thing with uploading to S3, which is why it has different behavior from the other frameworks.

I see your point with

parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])

but as you point out later, the argument name can be defined with whatever you supply, so if you're looking to write a framework-agnostic script that still runs on each of the images (not sure what the use case would be), you could always define it as model-output-dir or something else. As you also point out, none of the frameworks actually pass this in as a hyperparameter by default, so you could always save the value of os.environ['SM_MODEL_DIR'] to a variable of your choice in your script.

As for why the TF parameter is named "model_dir", I believe it's to match TF's RunConfig parameter name.

I can see how this has led to confusion, so I've added an item to our internal backlog to revisit this. It would be a breaking change, so it's something that would take some more consideration.

@athewsey
Copy link
Author

Thanks for explaining the background and considering it on the backlog! 😄

My $0.02 is still that the risk of beginners misunderstanding the basics (thinking they need to upload their model) outweighs the value of consistency with RunConfig - but as you say the change would be breaking so not to be taken lightly!

Also just FYI for anybody who arrives here and chooses to use argparse.parse_known_args() to omit writing unused parser.add_argument('--model_dir', ...) in their scripts: Do note the warning about prefix matching in the argparse doc.

@dslove
Copy link

dslove commented Oct 16, 2020

I am trying to use this arg to save trained TF models to the places I like but I'd say this arg doesn't behave as what I expect. Actually and honestly, I haven't yet figure out how it works and all the documents talking about it sounds confusing to me. Echo to OP.

@nonoesp
Copy link

nonoesp commented Dec 2, 2021

Definitely found this to be a source of headaches and bugs and errors as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants