Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Findings on Discrepancy Assessments within the SBOM Ecosystem. #905

Open
dw763j opened this issue Apr 11, 2024 · 7 comments
Open
Milestone

Comments

@dw763j
Copy link

dw763j commented Apr 11, 2024

Assessments results on discrepancy of SBOM ecosystem and some suggestions

Background

As SBOM can be widely used in software software chain management, the capability and issues within SBOM ecosystem can influence the employment of users, thus accurately assessments of the current SBOM state is important. To this end, we have conducted a series of assessments on key characteristics in SBOM applications to reveal the potential discrepancies hindering usage.

Questions

We asked 3 questions:
1. Compliance: Do SBOM tools generate outputs that adhere to user requirements and standards?
2. Consistency: Do SBOM tools maintain consistency in transforming the produced SBOM?
3. Accuracy: How accurate are the SBOM produced by tools in reflecting the objective software?

Upon 9970 SBOM documents generated from 6 SBOM tools (sbom-tool, ort, syft, gh-sbom, cdxgen and scancode) in both SPDX and CycloneDX on 1162 GitHub repositories, we assess these questions. To evaluate accuracy, 100 repositories are annotated for benchmark, comprising 660 components and 4,000 data fields.

Results

This table shows average results across all the 6 tools, results are all in package level. Note that in the results for information of software itself is quite poor, for instance, we have 89.59% repositories contain licenses while only a minority are identified.

Attr. pkg_name version author purl license copyright
Compliance 79.61% 74.99% 17.84% 67.53% 32.34% 14.17%
Consistency 18.44% 22.24% 0.11% 24.99% 2.12% -
Accuracy 25.81% 10.66% 4.94% - 10.66% -

The findings indicate that while SBOM tools 100% support mandatory standards requirements (including Doc.: specVersion, License, Namespace, Creator; Comp.: Name, Identifier, downloadLocation, verificationCode), their performance in user case support is at 49.37% and the consistency within these supported use cases is on average of 17.63% (as the table shows). Accuracy assessments reveal significant discrepancies, with accuracy rates of 8.62%, 25.81%, and 12.3% as in software metadata, identified denpendent components, and detailed component information, underscoring substantial areas for improvement within the SBOM ecosystem.

Suggestions

  1. In component sections, some tools record the package name with their information sources like pip, maven, npm, etc., while others do not. In version tools varing in recording like whether add a 'V' before the version string this will lead to problems in utilizing SBOM from different SBOM tools. We suggest to require tools to specify their pattern in recoring information without the standard's explicit specification.
  2. The meaning of NOASSERTION ,NONE and Nonecould be confusing in specific data fields. For instance, version can naturally be empty in packages as the developers didn't record them in the software, tools deal empty ones into empty string or the three forms, which could lead to inconsistency for further exchange. We suggest to provide specific marks for these natually empty data fields.
  3. For hashes, we found that in different tools that using the same hash algorithm on the same single file have different checksums in SPDX, there is even no consistent checksums across all the software and packages. While in CycloneDX, the hashes even does not specify the object the hash is performed on. We suggest to demand tools in creating checksums explicitly illustrate their process for creating the checksums, e.g. salt value or other preprocessing.

We hope our findings can help promote the SBOM ecosystem, any questions or discussions are welcomed.

Fast check on code

We provide a fast check code at here based on part of our dataset.

Examples:

For checksum, here are examples that the file, hash algorithm are both matched, yet they still didn's get the same checksums:
72bbf30067c969df63b17327473ccd3
c136b07d260c9e95beb14d1c75fb80e

@kestewart kestewart added this to the 3.1 milestone Apr 14, 2024
@dw763j
Copy link
Author

dw763j commented Apr 16, 2024

We make some changes on this issue to specify the details and code.

@rnjudge
Copy link
Contributor

rnjudge commented Apr 24, 2024

Thanks for the analysis @dw763j - very interesting. This would be great analysis to have at our next SPDX DocFest since some of the analysis and suggestions seem more relevant to the specific tooling used to generate the SBOMs and not the SPDX spec itself. I have also tried to respond to some of your suggestions below:

In component sections, some tools record the package name with their information sources like pip, maven, npm, etc., while others do not. In version tools varing in recording like whether add a 'V' before the version string this will lead to problems in utilizing SBOM from different SBOM tools. We suggest to require tools to specify their pattern in recoring information without the standard's explicit specification.

This seems to me like a general ecosystem naming problem. Having hard naming requirements in SPDX would make SPDX too rigid and hard to adapt. Encouraging tools to try to use the same naming conventions would definitely be helpful, though. If you are ever able to attend the SPDX implementers call every other Wednesday morning, this would be a good topic to discuss there.

The meaning of NOASSERTION ,NONE and Nonecould be confusing in specific data fields. For instance, version can naturally be empty in packages as the developers didn't record them in the software, tools deal empty ones into empty string or the three forms, which could lead to inconsistency for further exchange. We suggest to provide specific marks for these natually empty data fields.

The package version field can be omitted if the tool finds nothing. It is not required to use NONE or NOASSERTION. As you mentioned, there are a variety of reasons the version might be empty (i.e. how the package was built, package manager used, tool used, etc). If you known and want to indicate the specific reason, I suggest using the package comment field to explain the empty/omitted field. If the field is empty or omitted, no inference can be made to the reason, which is likely unknown by the tool generating the SBOM. Regardless of the reason the field is empty/omitted, the effect is the same for anybody consuming an SBOM in that it is an empty/unknown value.

For hashes, we found that in different tools that using the same hash algorithm on the same single file have different checksums in SPDX, there is even no consistent checksums across all the software and packages. While in CycloneDX, the hashes even does not specify the object the hash is performed on. We suggest to demand tools in creating checksums explicitly illustrate their process for creating the checksums, e.g. salt value or other preprocessing.

In SPDX 3.0, for any Element you may have, there's an optional verifiedUsing property which you can use to provide an IntegrityMethod with which the integrity of an Element can be asserted. In this field you can provide a Hash and specify the HashAlgorithm used. SPDX can't force tools to provide this information, but we encourage tools to do so by having fields in the spec for them to communicate this information.

Thanks again for your detailed analysis. I'd love to continue discussion in the Implementers call or on this issue. ccing @goneall to see if he has other comments on the matter.

@dw763j
Copy link
Author

dw763j commented Apr 25, 2024

Thanks for your reply @rnjudge, we are interested in improving the applications of SBOMs. Let's discuss your reply in detail.

Having hard naming requirements in SPDX would make SPDX too rigid and hard to adapt.

Agreed. While more mandatory requirements could lead to acceptance issues, especially for tools, we can instead provide "suggested" use cases for tools and users to follow. Our analysis indicates that SBOMs can fully support the mandatory requirements from the standard, but only partially support certain use cases. Tools often have discrepancies in these partially supported use cases due to the lack of official, detailed suggestions for these scenarios. Providing "suggested" use cases can guide tool implementation and potentially enhance the performance of SBOMs in real-world applications.

If you are ever able to attend the SPDX implementers call every other Wednesday morning, this would be a good topic to discuss there.

I'm interested in participating in the next call 😊.

Regardless of the reason the field is empty/omitted, the effect is the same for anybody consuming an SBOM in that it is an empty/unknown value.

That's correct. In terms of consume, this is not an issue. However, in the context of transform, where consistency across tools is crucial, the distinction between None, NOASSERTION, and empty strings "" can lead to misunderstandings. For example, some tools make the fields as "" when they can't find any information, which should be recorded as None, NOASSERTION for consistency. Thus, we think the empty "" and unknown values should be distinguished.

In SPDX 3.0, for any Element you may have, there's an optional verifiedUsing property which you can use to provide an IntegrityMethod with which the integrity of an Element can be asserted.

As you mentioned, this is an excellent method to avoid discrepancies in fields like hash. However, developers may not be aware of this feature. Directly suggesting improvements based on use cases at the hash field could be beneficial. SPDX 3.0 has provided profiles such as Software, Security, Licensing, and more. Making specific use case suggestions could help tools focus their development efforts.

I'd love to continue the discussion in the Implementers call or on this issue.

Thank you for your reply and participation in the discussion. We hope our findings can be beneficial to the community. We are considering making a pull request to clarify our suggestions.

@goneall
Copy link
Member

goneall commented Apr 25, 2024

Having run across SBOMs with a wide range of quality in my day job, I'm quite interested in any efforts to reduce the inconsistencies.

I really appreciate the analysis done - this will really help put data behind some of the solution discussions.

I have several thoughts (too many to list here), but one thing I'd like to offer for consideration is the creation of specific profiles that change the mandatory field requirements if the produce claims to support profile. For example, requiring checksums on artifacts. This would make it easy for producers to set expectations on SBOM quality for the consumers and transform utilities.

I look forward to our next discussion / docfest where we can discuss real-time.

Just the package version alone is worth some time discussing. I've found wild inconsistencies in SBOMs - some are due to tooling omissions and some are due to the information just not being available. I'm thinking that taking into account the SBOM type and package primary could be used to identify if the version information "should" be available. Anyway - perhaps a better real-time discussion.

@dw763j
Copy link
Author

dw763j commented Apr 29, 2024

I see the arrangement from tech-team-meetings, yet how about the exact date🧐? @goneall @rnjudge

Implementers group meetings

@goneall
Copy link
Member

goneall commented Apr 29, 2024

@dw763j - This week - May 1.

@goneall
Copy link
Member

goneall commented Apr 29, 2024

BTW - I'll be 30 minutes late to this weeks call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants