Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cache failing to resolve dependencies with name collision #2286

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mcfedr
Copy link

@mcfedr mcfedr commented Dec 7, 2017

Fixes #2150

Changes the cache dir to owner_name which should prevent 99% of cache failures.

The problem happens when I had github "mcfedr/library" ~> 1 in one project, and github "someone/library" ~> 1 in another project, as they both get caches in the folder library.

Now they will be cached separately, and nothing will break.

Bonus, when testing this I noticed a few cases of url that might have issues where the name isnt correctly resolved, largely going to be edge cases as I imagine nearly all deps are coming from github anyway, but the sort of thing that would be annoying when you run into it.

@mdiep
Copy link
Member

mdiep commented Dec 8, 2017

Thanks for the PR! We always appreciate when people take the time to fix issues that they encounter.

Unfortunately, this isn't the solution we've been hoping for for this particular issue. As noted in #569:

This is “by design” because it allows all repositories in a GitHub network (i.e., forks and their parents) to share the repository cache, but I agree that the evidenced behavior is incorrect and surprising.

The best way to fix this would be to share the repository across all projects with the same name—but to use a different remote for each project. This allows projects with the same name to share the same object database, potentially saving a lot of disk space and preventing some expensive clones, but would correct the issue you've run up against.

@mcfedr
Copy link
Author

mcfedr commented Dec 12, 2017

I thought about that as an option, but it seems more risky and error prone, for not much gain.

As I see the key problems being

  • Its possible that a fork of a project will have overlapping branch names
  • Its possible to have overlapping tag names

That means that it will affect enumerations of versions, and semvar checking. It might be possible to use the remote name prefix for all the git commands, but it certainly complicates the process. When the gain in most cases is 10s of megabytes, and also, people are probably not changing between forks very much, and if they are its unlikely to be more than 1 fork.

So this seemed like a simple way to make the cache unique and avoid these kind of issues.

I am actually now thinking that the non-github types should probably just use the whole url as the cache key.

I think reliability is the most important part of the cache, it shouldn't give the wrong result. This is particularly true for CI systems that might be automatically building and uploading to the store.

@mdiep
Copy link
Member

mdiep commented Dec 14, 2017

Hmm… it seems you're right as far as tags name go. Branches aren't typically a problem since they can be uniqued by origin.

Maybe we can share objects between repositories? That leads me to think that the best namespacing would be {cache}/{ProjectName}/{OwnerOrURL}.

@mcfedr
Copy link
Author

mcfedr commented Dec 15, 2017

Looking at git clone, I guess what you are looking for is the --reference <repo> option, which causes object sharing.

It comes with this warning about using shared objects,

NOTE: this is a possibly dangerous operation; do not use it unless you understand what it does. If you clone your repository using this option and then delete branches (or use any other Git command that makes any existing commit unreferenced) in the source repository, some objects may become unreferenced (or dangling). These objects may be removed by normal Git operations (such as git commit) which automatically call git gc --auto. (See git-gc(1).) If these objects are removed and were referenced by the cloned repository, then the cloned repository will become corrupt.

In my mind, this just bring me back to my previous points

  • this is complicated to implement (how do we decide which is the 'root' repo? how to handle deleted objects? or root repo being deleted?)
  • introduces unreliable behavior
  • too little gain
  • This is something that could be implemented retrospectively, right now there is a bug with caching that is causing lots of build failures

@mcfedr mcfedr force-pushed the fork_cache_fix branch 2 times, most recently from 8f5c926 to c267774 Compare December 15, 2017 13:44
@mdiep
Copy link
Member

mdiep commented Dec 15, 2017

Okay, I'm on board with using separate repos without worrying about sharing objects.

But I think we need two things to finish this up:

  1. Use {ProjectName}/{OwnerOrURL} instead of {OwnerOrURL}_{ProjectName}. I think this hierarchy will be more intuitive. Project names stand out more than owners and it makes more sense to group forks together than to group disparate projects from the same owner.

  2. We should try to handle the case where an existing cache exists. Maybe just change dependencies to repositories? That's more accurate now that there's a binaries cache anyway.

@mdiep
Copy link
Member

mdiep commented Dec 15, 2017

@ikesyo Do you have any concerns about this?

@mcfedr
Copy link
Author

mcfedr commented Dec 15, 2017

I've been thinking about how change over the current cache, should be simple enough to detect folder with the old name and rename it.

Can I suggest that we stick to flat folder names

  • To make it easier to move over from existing cache to new - otherwise not sure how to detect that there is a folder named project that is the old cache, so it should be moved to project/owner - How about I name it project_owner?
  • Also dont have to make sure that we create the project folder before cloning into the subfolder.

@jdhealy
Copy link
Member

jdhealy commented Dec 15, 2017

It comes with this warning about using shared objects,

NOTE: this is a possibly dangerous operation; do not use it unless you understand what it does. If you clone your repository using this option and then delete branches (or use any other Git command that makes any existing commit unreferenced) in the source repository, some objects may become unreferenced (or dangling). These objects may be removed by normal Git operations (such as git commit) which automatically call git gc --auto. (See git-gc(1).) If these objects are removed and were referenced by the cloned repository, then the cloned repository will become corrupt.

From what I've read, seems this scenario can be mitigated by the --dissociate flag on git clone --reference=<over there> for git versions above v2.3.0:

git clone --reference=«over there» learned the --dissociate option to go with it; it borrows objects from the reference object store while cloning only to reduce network traffic and then dissociates the resulting clone from the reference by performing local copies of borrowed objects.

@mcfedr
Copy link
Author

mcfedr commented Dec 15, 2017

I read about --dissociate, sounds like by using it you lose nearly all the reason for using it, obviously, less network traffic, but storage is the same as two separate clones. Again, seems like it would be perfectly possible to add at a later date as an improvement, but shouldn't be a requirement for fixing the conflicts situation that currently exists.

@ikesyo
Copy link
Member

ikesyo commented Dec 16, 2017

Okay, I'm on board with using separate repos without worrying about sharing objects.

@mdiep Then how about revisiting #2185? Which one do you prefer?

@mdiep
Copy link
Member

mdiep commented Dec 17, 2017

Then how about revisiting #2185? Which one do you prefer?

I'm not sure I have a preference. What do you think?

@mdiep
Copy link
Member

mdiep commented Dec 21, 2017

I've been thinking about how change over the current cache, should be simple enough to detect folder with the old name and rename it.

Can I suggest that we stick to flat folder names

To make it easier to move over from existing cache to new - otherwise not sure how to detect that there is a folder named project that is the old cache, so it should be moved to project/owner - How about I name it project_owner?

I think this is a mistake. Caches are just caches; it's fine for them to move if it's very infrequent. I think not reusing/moving the old cache is fine in order to change to an improved layout that fixes a bug. I don't really want to maintain the logic of moving old caches going forward. It adds a continued maintenance cost; I'd rather pay a one-time user cost.

I also don't think the method used to move the cache in 43db3dd is appropriate. It adds disk access and side effects into what should remain a pure method.

Also dont have to make sure that we create the project folder before cloning into the subfolder.

This shouldn't really be burdensome. We already need to validate that the Cache directory as a whole exists.

I think that the best solution here is to move the cache to ~/Library/Caches/org.carthage.CarthageKit/repositories/{ProjectName}/{OwnerOrURL}. That cleans things up quite a bit and makes the cache more navigable and debuggable.

@mcfedr
Copy link
Author

mcfedr commented Dec 24, 2017

So, playing with this a bit - here is the commit, on alt branch for the moment, with the cache in sub folders, mcfedr@5d0b692, but now I come down to editing cloneOrFetch because now this needs to have lots of extra logic

  • for the case that the cache doesnt exist - it needs to detect if the parent folder exists, and is a git repo, and delete it (this is the old cache)
  • when the dir does exist, is this exact dir a repo (isGitRepository returns true if the parent dir is a repo), otherwise delete it
    • for example Carthage/Carthage might has a dir Carthage in the repo

Again, this just feels overly complicated, it would be much simpler to use a flat folder that doesn't conflict with existing cache dirs, the cache is a machine readable cache, its not for people.

@mdiep
Copy link
Member

mdiep commented Jan 2, 2018

for the case that the cache doesnt exist - it needs to detect if the parent folder exists, and is a git repo, and delete it (this is the old cache)
when the dir does exist, is this exact dir a repo (isGitRepository returns true if the parent dir is a repo), otherwise delete it
for example Carthage/Carthage might has a dir Carthage in the repo

I don't think any of this needs to be done. That's why I suggested using ~/Library/Caches/org.carthage.CarthageKit/repositories instead of the existing ~/Library/Caches/org.carthage.CarthageKit/dependencies. Just leave the old cache at the old path and we have a shiny new one where we can do the right thing.

@tmspzz
Copy link
Member

tmspzz commented Mar 20, 2018

ping

@mcfedr
Copy link
Author

mcfedr commented Mar 27, 2018

Ill be honest, unfortunately, due to time constraints, I'm unlikely to do any more work on this, so please feel to pick up my fork and finish it, I think its pretty close to being ready to go.

@tmspzz
Copy link
Member

tmspzz commented Mar 27, 2018

No problem at all. Your effort will not go wasted.

@stale
Copy link

stale bot commented Jun 30, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache failing to resolve dependencies with name collision
5 participants