Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Unified Pluggable Scanner API #164

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

prahaladdarkin
Copy link
Contributor

Scope of change:

Introduced the Unified Pluggable Scanner API proposal based on the discussions in the Harbor Scanner Workgroup.

Signed-off-by: prahaladd prahaladd@vmware.com

@danielpacak danielpacak self-requested a review May 28, 2021 13:05
Copy link
Contributor

@danielpacak danielpacak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @prahaladdarkin This is great start. I left a few comments / asks for clarification. Overall I'd focus more on onboarding new tools and keep API structures as close to v1.1 as possible without breaking compatibility.

Adding more examples of a hypothetical adapter that supports both vulnerability scan and license information, including happy path and partial errors would be very helpful to understand the end to end integration.

@@ -0,0 +1,177 @@

# Proposal: Unified API for Pluggable Image Scanners
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this title suggests the ultimate goal of this proposal.

IIRC we wanted to leverage API v1.1 which is already defined as a minimal set of API endpoints and leverages MIME types. IMO the main focus should be around new MIME types and new scanners / security tools to onboard incrementally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielpacak The reason I thought about this title is because of the fact that we are presenting unified endpoints for scanning different types of data. The details about the MIME types being supported are still subject to discussion with respect to the names to be used etc. The importance of these MIME types has been highlighted in the section that describes how the client can retrieve the various types of results and also in the section Supporting Multiple Mime Types.
We can elaborate on these MIME types if required

"repository": "library/mongo",
"digest": "sha256:917f5b7f4bef1b35ee90f03033f33a81002511c1e0767fd44276d4bd9cd2fa8e"
}
"scanTypes": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the scanType property is a subset of produced MIME types advertised by the /metadata endpoint maybe we can think about different / consistent naming here. More importantly, I'd use an array of objects instead of a map with MIME types as a key to enable further extensibility:

{
  "scanTypes": [
    {
      "mimeType": "application/vnd.security.vulnerability.report; version=1.1",
      "params": {}
    },
    {
      "mimeType": "application/vnd.security.bom.report; version=1.0",
      "params": {
        "allowGNU": false,
        "format": "spdx"
      },
      "propertyAddedInTheFuture": {}
    }
  ]
}

I'd also justify why combining all mime types in a single request is better than multiple scan requests per mime type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the MIME type naming part. perhaps we can all agree upon some well defined names. These could also be specified within the OCI spec that @steven-zou started.

Additionally, I am in fact using an array of objects to specify scan types. The value of the mimeType field is used to activate a particular type of scanner. However, the activation can be controlled using the params object that can have scanner specific parameters or settings for fine-tuning.
With respect to the specification of MIME types in a request, it is up to the client to decide the data fetch mechanism. For e.g. the above structure allows the clients to make multiple API invocation each time specifying a single MIME type in the request. Another advanced client can on the other hand specify multiple MIME types and wait for the results of all being available. Hence this provides the required flexibility to the client who can decide the best approach to handle scan requests based on it's capabilities. I will add this part to the proposal


The response from the scanner are packaged in an uber-container structure again keyed by the MIME type. The object against each MIME type would contain the following
- Internal scan tracking identifier for the composite scan.
- The scan id and status for each MIME type. The status would be either one of started, completed, running, aborted, failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this complexity. It's non backward compatible and requires designing and implementing a workflow to rescan for a particular MIME types in case of partial errors. I believe that most of the time a scanner adapter will focus on one MIME types (e.g. existing vulnerability scanners). For scanners that support multiple MIME types it's probably acceptable to retry all scan types.

Beyond that, scanner adapters will have to maintain composite identifiers to implement state machine for each scan type. We'll have to upgrade all existing adapters as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give in some more thought to this aspect. Lets discuss this in a meeting.

* Create an uber-container data structure which will contain the scan results keyed by the corresponding MIME types and send the result to client.

### Scan API response handling
The scanning process would be executing multiple types of scans asynchronously and hence a sophisticated response handling mechanism would required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example above there's one scan request with multiple scanTypes indexed by mime type. On the other hand we'd be tracking multiple scan identifiers embedded in the scan response. Why not sending multiple scan requests and tracking them one by one? Or if we stick to one request with multiple scanTypes we could track all mime types as a whole

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Th reason I maintained per scan type based ID is because of the fact that some scans may take longer than expected to complete (with either success or failure). In such cases given that not all scan types finish at the same time we may want the client to start reporting status for each type of scan job separately thereby making it more response as opposed to waiting for all scans to complete. With such a workflow the scan ID would be needed to be per scan type since we would use the existing reporting mechanisms that fetch scan status based on scan identifiers.
If we all agree upon a workflow where-in the client would wait for all the component scans to finish before reporting a success/failure status then we can track it using a single scan Id.

### Scan API response handling
The scanning process would be executing multiple types of scans asynchronously and hence a sophisticated response handling mechanism would required.

For a given job, the multiplexer component would maintain a mapping between the composite scan Id and the granular status of each of the component scans as and when received from the scanner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some scanners keeping such granularly and multiple scan identifiers might be overly complex. Especially for existing scanners that support only one mime type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give in some more thoughts to this aspect. Lets discuss it in our meeting

- how is the error reported to the caller?
- what is the impact of an error on the overall scan job?

A clean, simple and efficient mechanism to handle errors would be as follows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return an array of errors. Each error might have a mimeType or scanType property to indicate which one is failing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what exactly I have mentioned below. We will have an array of objects (one per MIME type) and the corresponding results/error data. Did you imply that some other detail was missing?
I will add a response type example for such a situation.

2. Adopts a "what you send is what you receive" model.
3. Allows clients to control what data they want from the scanner adapter - when the same scanner supports multiple types of scan functionalities
4. Provides a mechanism to fine tune scanning operation.
5. Incremental changes on the existing API model exposed by Harbor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that complex structure of scan identifiers per mime type would required lots of changes in Harbor core / jobservice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss this. will also like to know what @steven-zou thinks about this part.
from what I understand - the existing Harbor job service would only have to implement a thin aggregation layer where-in the a mapping of the scan request to it's corresponding scan type jobs are maintained.
The underlying service would still have the same flow. I am however yet to gain some insights in this area and hence any inputs from @steven-zou and @heww would also help clarify

@steven-zou steven-zou added the area/interrogation-service Services like vulnerability scanning and compliance checking etc. label Jun 10, 2021
Signed-off-by: prahaladd <prahaladd@vmware.com>
Copy link
Contributor Author

@prahaladdarkin prahaladdarkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to some review comments with some clarifications. We can discuss in detail in the meeting and then I will make the required changes.

@@ -0,0 +1,177 @@

# Proposal: Unified API for Pluggable Image Scanners
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielpacak The reason I thought about this title is because of the fact that we are presenting unified endpoints for scanning different types of data. The details about the MIME types being supported are still subject to discussion with respect to the names to be used etc. The importance of these MIME types has been highlighted in the section that describes how the client can retrieve the various types of results and also in the section Supporting Multiple Mime Types.
We can elaborate on these MIME types if required

"repository": "library/mongo",
"digest": "sha256:917f5b7f4bef1b35ee90f03033f33a81002511c1e0767fd44276d4bd9cd2fa8e"
}
"scanTypes": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the MIME type naming part. perhaps we can all agree upon some well defined names. These could also be specified within the OCI spec that @steven-zou started.

Additionally, I am in fact using an array of objects to specify scan types. The value of the mimeType field is used to activate a particular type of scanner. However, the activation can be controlled using the params object that can have scanner specific parameters or settings for fine-tuning.
With respect to the specification of MIME types in a request, it is up to the client to decide the data fetch mechanism. For e.g. the above structure allows the clients to make multiple API invocation each time specifying a single MIME type in the request. Another advanced client can on the other hand specify multiple MIME types and wait for the results of all being available. Hence this provides the required flexibility to the client who can decide the best approach to handle scan requests based on it's capabilities. I will add this part to the proposal

* Create an uber-container data structure which will contain the scan results keyed by the corresponding MIME types and send the result to client.

### Scan API response handling
The scanning process would be executing multiple types of scans asynchronously and hence a sophisticated response handling mechanism would required.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Th reason I maintained per scan type based ID is because of the fact that some scans may take longer than expected to complete (with either success or failure). In such cases given that not all scan types finish at the same time we may want the client to start reporting status for each type of scan job separately thereby making it more response as opposed to waiting for all scans to complete. With such a workflow the scan ID would be needed to be per scan type since we would use the existing reporting mechanisms that fetch scan status based on scan identifiers.
If we all agree upon a workflow where-in the client would wait for all the component scans to finish before reporting a success/failure status then we can track it using a single scan Id.

### Scan API response handling
The scanning process would be executing multiple types of scans asynchronously and hence a sophisticated response handling mechanism would required.

For a given job, the multiplexer component would maintain a mapping between the composite scan Id and the granular status of each of the component scans as and when received from the scanner.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give in some more thoughts to this aspect. Lets discuss it in our meeting


The response from the scanner are packaged in an uber-container structure again keyed by the MIME type. The object against each MIME type would contain the following
- Internal scan tracking identifier for the composite scan.
- The scan id and status for each MIME type. The status would be either one of started, completed, running, aborted, failed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give in some more thought to this aspect. Lets discuss this in a meeting.

- how is the error reported to the caller?
- what is the impact of an error on the overall scan job?

A clean, simple and efficient mechanism to handle errors would be as follows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what exactly I have mentioned below. We will have an array of objects (one per MIME type) and the corresponding results/error data. Did you imply that some other detail was missing?
I will add a response type example for such a situation.

2. Adopts a "what you send is what you receive" model.
3. Allows clients to control what data they want from the scanner adapter - when the same scanner supports multiple types of scan functionalities
4. Provides a mechanism to fine tune scanning operation.
5. Incremental changes on the existing API model exposed by Harbor.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss this. will also like to know what @steven-zou thinks about this part.
from what I understand - the existing Harbor job service would only have to implement a thin aggregation layer where-in the a mapping of the scan request to it's corresponding scan type jobs are maintained.
The underlying service would still have the same flow. I am however yet to gain some insights in this area and hence any inputs from @steven-zou and @heww would also help clarify

| Mime Type | Type | Version | Status |
| --- | --- | --- | --- |
| application/vnd.security.vulnerability.report; version=1.1 | OS/Package vulnerability data | 1.1 | Released in Harbor 2.3.0 |
| application/vnd.security.bom.report; version=1.0 | Bill of Materials (BoM) | 1.0 | Proposed |
Copy link
Contributor

@zhill zhill Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several near-standard formats that would make sense for SBoM instead of a new proprietary one: SPDX and CycloneDX are the two with the most traction. Those will help with interoperability, and may include enough information for the FS verification type as well. SPDX in particular supports files and digests as long as you are ok with very large manifests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/interrogation-service Services like vulnerability scanning and compliance checking etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants