Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets and Datastore clarification #9

Open
aminsaied opened this issue Dec 7, 2020 · 3 comments
Open

Datasets and Datastore clarification #9

aminsaied opened this issue Dec 7, 2020 · 3 comments

Comments

@aminsaied
Copy link
Collaborator

The current page on datasets and datastores needs clarifying:

  • Too many ways to do the same thing. Make it clear the recommended approach (datasets)
  • For adding a datareference to ScriptRunConfig we mention secret environment variables. Make it clear that the user does not need to know this to use DataReference in their script.
@aminsaied
Copy link
Collaborator Author

It would also be good to have OutputDatasetConsumptionConfig example, as well as the interaction between backing a FileDataset with a Datastore in the context of input / output data - and why those are different (i.e. ReadOnly vs ReadWrite)

@aminsaied
Copy link
Collaborator Author

Include example on using dataset.as_mount() with command like this:

command=["python train.py --training-data", dataset.as_mount()]

if u want to use environment variable. if you pass in dataset.as_named_input('env_varmane').as_mount()
azureml.data.abstract_dataset.AbstractDataset class - Azure Machine Learning Python | Microsoft Docs

@arnabbiswas1
Copy link

AML Documentation no longer recommends usage of DataReference Class (source):

It is no longer the recommended approach for data access and delivery in Azure Machine Learning. Dataset supports accessing data from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL through unified interface with added data management capabilities. It is recommended to use dataset for reading data in your machine learning projects.

May be it makes sense to remove DataReference (link) section from the cheat sheet. Otherwise, it's easy to get confused between DataSet and DataReference.

I personally was not aware of DataReference earlier and thought it is the recommended way now (after the API enhancement).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants