Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving on Filesystem #77

Open
nochso opened this issue Jan 25, 2016 · 6 comments
Open

Improving on Filesystem #77

nochso opened this issue Jan 25, 2016 · 6 comments

Comments

@nochso
Copy link
Contributor

nochso commented Jan 25, 2016

Currently the Filesystem class is only used by JsonReporter.
However I'd like to suggest extracting an interface from it and letting any scanner related classes use that.
I think it will definitely come in handy with the git version. For example you could avoid checking out entire working directories before scanning them. Instead just implement a GitFilesystem that has access to a repository and implements at least read and fileExists.

That way, scanning huge histories should become quite easy. Use getModifiedFiles of Gitter\Repository to find changed files only. Then use GitFilesystem to retrieve only modified files.
Should there be files that were not changed but still need analyzing (like super classes), GitFilesystem can easily fetch them when this is implement: #73

I'm not too set on "not checking the repository" out, so an alternative could be letting GitFilesystem handle the checkout or clone to a temp directory.

What do you think?

@tomzx
Copy link
Owner

tomzx commented Jan 26, 2016

I'm not too sure I see how that example you gave would work. My guess is that checking out a full git branch is probably more easy/effective than pulling out a list of files from a specific commit.

Overall, I don't see what we'd gain. If you can expose your idea a little more, maybe I'll understand what you're trying to achieve with it.

@nochso
Copy link
Contributor Author

nochso commented Jan 26, 2016

Assuming that

  • you want to analyze long histories of active projects
  • generally fewer files are modified than there are stale/untouched files
  • repositories could be of any size or age, might contain binaries, documentation, assets etc.

Analyzing those in one go, that's a lot of disk writes that are wasting time and I/O.

For example here: https://github.com/tomzx/gitter/blob/php-semver-checker/lib/Gitter/Repository.php#L628-L639
You're already using this to prepare a list of modified files. But then you check out everything anyway, read the whole working directory and filter it again.

Keeping the assumptions in mind, I'm sure running git diff-tree and a few git shows is more effective as you're only reading and git is pretty good at that.

Adding --name-status to git diff-tree lets you know whether a file was added, modified or deleted.
If it was added, we only need to get it out of R2. Deleted: get it out of R1. If it was modified we retrieve both versions, then parse and forget them.

You could even analyze repositories without ever checking anything out, keeping a bare repository which is quite handy.

Added bonus: Users can analyze without modifying their working directory at all making it friendlier.

Anyway, I'm more interested in allowing to do that by using a tiny Filesystem interface/abstraction so that I'm not forced to work with actual files.

@tomzx
Copy link
Owner

tomzx commented Jan 27, 2016

First, the reason I ask for the list of modified files is not to save time on checking out (which is generally very fast with git). The goal is to avoid READING/SCANNING all the files in the repository, which would take a lot of time and is essentially why this was put in place. On project like Symfony it would save a considerable amount of time from not having to parse ~5k files per compare operation.

Your idea sounds interesting in theory, but I'm skeptical we'll see any performance improvement in practice. It also seems to put a lot of burden on us to have to code a system which will do all these git operations while we can "already" just checkout the desired commits.

I'm however open to the idea, but I'd have to see it implemented and benchmarked.

Although, reading your last sentence, it sounds like you want to replace the Filesystem wrapper by something like Flysystem. Is that correct?

@nochso
Copy link
Contributor Author

nochso commented Jan 27, 2016

Well, it doesn't need to be a whole abstraction layer like Flysystem. I'll think about it some more and mess around with the -git version to see what might make sense.

@nochso
Copy link
Contributor Author

nochso commented Jan 27, 2016

The goal is to avoid READING/SCANNING all the files in the repository, which would take a lot of time and is essentially why this was put in place.

But by checking them out, you are reading all the files from the repository AND writing them to disk. On a related note: IMO both findFromString calls in git-checker's CompareCommand could be dropped right now for example. Can be expensive without exclude-* configuration and it should be enough to rely on getModifiedFiles.

@tomzx
Copy link
Owner

tomzx commented Jan 27, 2016

I should probably make it more clear that when I said read/scan, I meant the process of actually opening the file for reading and using PHPParser to parse its content.

I do think you are right that there could probably be some improvement done in order not to scan all the files in the included directory, but only those that are modified. We currently rely on the Finder to take care of exclusion for us, since the modified files can obviously return us an immense list of files we have no interest in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants