Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow project / package "duplicates" with the same VCS location #6533

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sschuberth
Copy link
Member

Please have a look at the individual commit messages for the details.

This is analogous to `addPackages()` and will be used in an upcoming
test.

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
// Some source code repositories contain projects that are used as packages by other contained projects. So the
// same code is once seen as a project, and once as a package by ORT. Allow such "duplicates" as the ids
// actually refer to the same thing by requiring duplicates to refer to different VCS locations.
val realDuplicates = duplicates.filter { (_, duplicates) ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oss-review-toolkit/core-devs, what do you thing about this? Should we rather only keep one of these (either the project or the package) instead of keeping both of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically asking as I currently don't oversee all side-effects that might occur if we keep a project and a package with the same id in the result.

Copy link
Member

@fviernau fviernau Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's likely that assumption that Id's are unique are spread all over the place. To list a few, while guessing many further things can be found:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's likely that assumption that Id's are unique are spread all over the place.

Right. But some of these only assume ids to be unique within projects or packages, but not within the union of both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But some of these only assume ids to be unique within projects or packages, but not within the union of both.

Yeah, that's for sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISTR that there was also a problem with the WebApp report when duplicate IDs were present: The JavaScript part uses an index-based mapping to reference objects, and if there were projects and packages with the same ID, this mapping got broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered as alternative solution: Making sure that the Identifier.type for projects differs from Identifier.type for packages?

Actually, I was playing around with similar ideas 3 years (!) ago:

  • In the identifier-types branch I tried to turn an Identifier's type into a sealed class with ProjectIdentifier and PackageIdentifier sub-classes.
  • In the multiple-identifier-classes branch the Identifier class itself is tried to be turned in a sealed class.

Both approaches had the goal to distinguish project and package ids, and to iterate over all known id types (e.g. for when statements).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @fviernau that there are probably lots of places in the code which assume that identifiers are unique and it could be very hard to find all of them, and with @oheger-bosch that likely the WebApp report and probably other report formats would break.
There are also situations where the same package could appear twice with different metadata (e.g. if different repositories are used), I'm not sure if we currently fail in this case or just ignore it.
I think properly addressing this issue would be a big effort, another approach than the sealed class you suggested could also be to extend the identifier with properties like isProject and provider details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think properly addressing this issue would be a big effort

Indeed. That's why I never completed my branches mentioned above...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also situations where the same package could appear twice with different metadata (e.g. if different repositories are used), I'm not sure if we currently fail in this case or just ignore it.

I verify this. We have a repo in which ORT is failing because three or four packages appear twice due to different metadata.

@sschuberth sschuberth force-pushed the lenient-project-package-duplicates branch from d9327fc to 7b1c5c9 Compare February 22, 2023 11:07
@@ -46,9 +46,18 @@ class AnalyzerResultBuilder {

fun build(excludes: Excludes = Excludes.EMPTY): AnalyzerResult {
val duplicates = (projects.map { it.toPackage() } + packages).getDuplicates { it.id }
require(duplicates.isEmpty()) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some story maybe worth sharing here: I recently scanned ORT with ORT locally,
it exited complaining in about duplicates in this function. I debugged it only quick and had to stop at some point due to other tasks. If I didn't make a mistake the finding was that there were multiple packages (not projects) with the same id which were completely identical except for on attribute. IIRC it was the homepage URL.

That's just FYI, didn't look into this PR deeply yet .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had the case of multiple packages with the same identifier. This happened when a package was referenced from multiple package managers using the same package type (for instance GoMod and GoDep). If the package metadata is not exactly the same, multiple Package instances are created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my above case the full identifiers were identical, including the type.

@sschuberth sschuberth marked this pull request as ready for review February 23, 2023 13:50
@sschuberth sschuberth requested a review from a team as a code owner February 23, 2023 13:50
@sschuberth sschuberth requested review from a team and removed request for a team February 23, 2023 13:50
@sschuberth sschuberth marked this pull request as draft February 27, 2023 10:00
@gmathiou4
Copy link

The change have been tested and is fine from our side

@sschuberth
Copy link
Member Author

sschuberth commented Mar 30, 2023

The change have been tested and is fine from our side

Thanks for the feedback. However, this PR will more likely evolve into the direction of dropping the duplicate package in favor of only keeping the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants