Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to do SLSA for ML and highlight gaps #978

Open
MarkLodato opened this issue Oct 9, 2023 · 1 comment
Open

Document how to do SLSA for ML and highlight gaps #978

MarkLodato opened this issue Oct 9, 2023 · 1 comment

Comments

@MarkLodato
Copy link
Member

There have been some questions as to what "SLSA for ML" looks like. This issue attempts to give a short synopsis so that we can hopefully agree and turn that into durable documentation.

First, Machine Learning (ML) models fit the SLSA Model at a high level:

  • Any transformation process is a "build", such as data cleaning or model training.
  • Training data and input models are "dependencies".

This is not obvious to most readers, so we should document it.

Second, ML highlights some gaps or challenges in SLSA that are not really specific to ML but may be a higher priority or more painful for ML. They include:

  • ML training processes often use specialized ML hardware, highly distributed training jobs, and/or highly iterative notebooks like Colab. These may be more work to adapt to a verifiable build architecture where there is a trusted control plane that the tenant cannot influence. Alternatively, a reproducible build architecture may be challenging due to non-determinism.
  • Training data (considered "dependencies") is critical to the ML training process, yet:
    • Standards for identifying and labeling datasets are less mature than they are for conventional software.
    • SLSA Build track does not yet level setting a minimum completeness of dependencies (Workstream: SLSA Build L4 #977).
    • SLSA does not yet have a transitive concept to describe properties of the entire ML supply chain, beyond a single build step. This is important for conventional software but critical for ML, since ML supply chains are very deep.

All of these are surmountable, but it's worth documenting.

Any thoughts in agreement or disagreement? I'll try to update this top post with the consensus. If you have other challenges, I can add them as well.

@joshuagl
Copy link
Member

Thanks for the initial thoughts on this. This seems worth documenting indeed. I agree strongly with the assertion that any transformation process --> build and that data inputs seem to map well to dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants