Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size of repository 1.3GB #473

Open
juangpc opened this issue Dec 14, 2021 · 18 comments
Open

Size of repository 1.3GB #473

juangpc opened this issue Dec 14, 2021 · 18 comments
Assignees

Comments

@juangpc
Copy link
Collaborator

juangpc commented Dec 14, 2021

Hi all,

I'd like to bring up how big the brainstorm3 repository folder is (1.3Gigabytes). Not sure if this is not a problem for everybody, or if I should openly bring this up at all as an issue or otherwise. As always, I'm just trying to help out making bs (even) better. If this is not the time, please feel free to close the issue.

Screen Shot 2021-12-14 at 12 07 34 PM

This only became clear to me after cloning the repo from scratch. Previously, updating through pulls, it didn't became so obvious or problematic, so that is why perhaps some of you haven't noticed.
I think it is problematic in the sense that the actual code of the repo is 20 times smaller. Lighter codebase is preferable. 1Gb of data is not limiting in today's computers either. But I still think it is desirable to aim for a lighter repository, or at least worth a conversation. On top of that, the main reason for such a big repository is the existence of old versions of binary files in the repo memory is the cause for such a big size. Typically, binary files should be kept separated from the repo, we all know that, but I'd agree that brainstorm's nature is a bit particular in that sense. So I get it. 
I'd propose to generate a "vault-previous-to-2022" type of branch where the memory for these files could be kept as it is right now in master. And then alleviate master branch from such a big load.
Long-term solution would be to use the release functionality within github. If needed, I could help out with that.

Interesting link: https://stackoverflow.com/questions/11050265/remove-large-pack-file-created-by-git

Thanks for reading.

@Edouard2laire
Copy link
Contributor

Edouard2laire commented Dec 14, 2021

i just checked. my brainstorm folder is 3.37Go :o

du -h -d 1 brainstorm3/
 62M	brainstorm3//bin
 12K	brainstorm3//python
264K	brainstorm3//deploy
1,0M	brainstorm3//java
 24M	brainstorm3//defaults
 16M	brainstorm3//toolbox
696K	brainstorm3//doc
7,1M	brainstorm3//external
3,0G	brainstorm3//.git
3,2G	brainstorm3/

@ftadel
Copy link
Member

ftadel commented Dec 15, 2021

This is also something that had been bothering me for a while:
every time I want to clone someone else's development branch, it downloads more than 1Gb of data... which quickly gets painful over a 4G connection :)

I could very efficiently pack it down from 1.4Gb to 200Mb using BFG:
https://rtyley.github.io/bfg-repo-cleaner/

$ java -jar bfg-1.14.0.jar --strip-blobs-bigger-than 1M brainstorm3.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

The annoying part is that this rewrites the entire commit history, making all the existing clones obsolete.
Everybody must delete their repositories and make a new clones in order to get the new history...
Unless there is something that I did not understand?

We could notify our collaborators who have been committing to the brainstorm3 repo and they could simply update everything.
However, I'm wondering if this could be a problem for people who currently have branches with a lot of unmerged work in their forks (like @rcassani with the new database work)...

If I force push the current .git folder I have on my computer (200Mb), what would be the procedure to port all the development branches to this new compressed repo?

@ftadel
Copy link
Member

ftadel commented Dec 15, 2021

I moved the compiled distribution (bin/R2020a/brainstorm3.jar) away from GitHub:
d5e19b0

The packed brainstorm3.git folder is now 85Mb (45Mb code + 40Mb history).
And after we clean it, it won't start growing again.

Counterpart: now the compiled bin/R2020a/brainstorm3.jar must be copied manually to the neuroimage server via FTP, and then moved to /root/brainstorm/bin/.

@juangpc
Copy link
Collaborator Author

juangpc commented Dec 15, 2021

That is great. Thank you.

The annoying part is that this rewrites the entire commit history, making all the existing clones obsolete.
Everybody must delete their repositories and make a new clones in order to get the new history...
Unless there is something that I did not understand?

My experience with rewriting history is that clones will remain sufficiently independent. If they pull from upstream master they'll get that branch updated but not obsolete. Importantly, it is worth noting that the links to specific commits or lines of code (permalinks, etc..) will no longer work. That is inconvenient.

forks actively developing would probably need to rebase their work after you force push your pruned master branch, to be able to easily merge their work later on. But that is not too painful. They will rewrite their own history, but I don't see that too limiting.

I moved the compiled distribution (bin/R2020a/brainstorm3.jar) away from GitHub:

The way to go. 😄

Long-term, I'd suggest to use both github actions and releases functionalities which are here in github precisely to solve these problems we're having, and on top of that they are free for open-source projects.

Counterpart: now the compiled bin/R2020a/brainstorm3.jar must be copied manually to the neuroimage server via FTP, and then moved to /root/brainstorm/bin/.

If you want to keep the neuroimage server as the storage server for the binary files you can upload the file with a specific action. For instance, every time there is a new change in the repo, etc... can automatically trigger a script to run. server address, user name, pass, protocol and port can be stored in the Secrets section of this repository, to which only admins have access. Those secrets are environment variables automatically exported to any shell you are running in a github action.

But perhaps it is a good point to start storing the binary files in the release section. The url they are stored at is very predictable. Something like : https://github.com/brainstorm-tools/brainstorm3/releases/download/<>/<<name-of-file.extension>>

Or it even makes more sense to use the release functionality in brainstorm-tools/bst-java repository.

In any of the previous cases you can automatically compile the .jar file and upload it.

Another benefit is that, as long as I know, there is now limit to the number of files you can keep in the download area of an open source project. That is all provided by github, so you easily can keep a compiled file version for every single change in the java code.

@ftadel
Copy link
Member

ftadel commented Dec 16, 2021

forks actively developing would probably need to rebase their work after you force push your pruned master branch, to be able to easily merge their work later on. But that is not too painful. They will rewrite their own history, but I don't see that too limiting.

@rcassani @Edouard2laire @HosseinShahabi @Moo-Marc @tmedani @LeoNouvelle @DavideNuzzi
How do you feel about this?

Importantly, it is worth noting that the links to specific commits or lines of code (permalinks, etc..) will no longer work. That is inconvenient.

All the commits will change ID?...
Not only the ones involved in the pruning?

Long-term, I'd suggest to use both github actions and releases functionalities which are here in github precisely to solve these problems we're having, and on top of that they are free for open-source projects.

I don't see how github actions/releases can help shrinking our history...
We want to distribute a single .zip file including: the latest commits on the .m scripts + the latest compiled Brainstorm version (which I compile from time to time only). This is not possible on GitHub, unless we create a new release at every commit, which does not seem like a good idea.

If you want to keep the neuroimage server as the storage server for the binary files you can upload the file with a specific action.

Uploading the compiled Brainstorm package on GitHub is not much easier than on the web server.
Plus the packaging script on the Neuroimage server does some additional stuff: it copies the license file to the website, removes some of the files from the GitHub repository, sets the execution right on the .sh and .command scripts, makes backups, etc.
This is executed daily from a cron job, or manually whenever needed after posting some updates by calling a PHP script (one click in my firefox toolbar). It's easy to work with the way it is now.
I was mentioning this upload topic not to complain about it, but to explain to everybody what happened with the compiled .jar :)

Or it even makes more sense to use the release functionality in brainstorm-tools/bst-java repository.
In any of the previous cases you can automatically compile the .jar file and upload it.

The repository brainstorm-tools/bst-java is recompiled a few times per year only and creates a file that is 300Kb only (brainstorm3/java/brainstorm.jar). This is definitely not an issue.

The big .jar file that causes the dramatic history is the one that is created from the Matlab Compiler: the full Brainstorm compiled application. This is not something we can compile from GitHub.
https://neuroimage.usc.edu/brainstorm/Tutorials/Scripting#How_to_compile_Brainstorm

@juangpc
Copy link
Collaborator Author

juangpc commented Dec 16, 2021

I see. Thanks for the detailed explanation.

I've been discussing this with Gabriel (@gabrielbmotta) and some of the ideas I share come from that conversation.

Two different problems: (1) reduce the repo size and (2) avoid this problem in the future.

Issue 1 is quite stimulating because it kind of goes against git's principles. Very cool.

Regarding your post.

Not only the ones involved in the pruning?

I'm applying logic here, so I'm not sure. But, I'm assuming that if you modify a commit sha all the 'dependent' commits will also require modification. But there might be something I'm not understanding.

I don't see how github actions/releases can help shrinking our history...

It doesn't. I brought it up to try to solve 2.

We want to distribute a single .zip file including: the latest commits on the .m scripts + the latest compiled Brainstorm version (which I compile from time to time only). This is not possible on GitHub, unless we create a new release at every commit, which does not seem like a good idea.

Yes it is possible! You can have a job run on anything, e. gr. a defined schedule (example Mon-Wed-Friday :)), not only based on a commit. It could be once a month, or you can have a specific named branch which, when pushed to, triggers a runner that can do this for you, or a specific text tagging a commit, etc... One problem is that github-hosted runners don't have matlab installed so that's a pain because you'd have to install matlab and then run the compilation script. But, for a project the size and complexity of bs, I think it is worth in the long-term to invest the effort into providing a self-hosted runner. You have your own machine with matlab installed and also a client app installed that lets github trigger a script to run on that machine. That way you can have matlab installed there no problem. Being an open source project we have to be careful with security. But that self-hosted runner can, only when you decide (not every commit)!, automatically checkout master branch, run compile_bst.m script to generate the complete compiled brainstorm application and then push it as an "asset" to the release section of the master branch. Think about it... one less thing to think of. :) (More than happy to help with that if needed).

Now... regarding the reduction of the size of the repo.

See this interesting discussion. Apparently it is not completely clear you can seamlessly reduce a repo size without rewriting history. https://stackoverflow.com/questions/17470780/is-it-possible-to-slim-a-git-repository-without-rewriting-history

Think about the first solution I mentioned. Create a "vault"/"hibernate" repository in brainstorm-tools which clones master branch as it is. And start cleaner ±40Mb master branch without old dependencies, either the way you proposed or the way it is proposed in the previous link. Check this tool referenced in github docs https://rtyley.github.io/bfg-repo-cleaner/ (you also shared it!)
One way to link two repositories would be this very neat trick: https://git-scm.com/book/en/v2/Git-Tools-Replace

Hope it helps.

@ftadel
Copy link
Member

ftadel commented Dec 20, 2021

Yes it is possible! You can have a job run on anything, e. gr. a defined schedule (example Mon-Wed-Friday :)), not only based on a commit. It could be once a month, or you can have a specific named branch which, when pushed to, triggers a runner that can do this for you, or a specific text tagging a commit, etc... One problem is that github-hosted runners don't have matlab installed so that's a pain because you'd have to install matlab and then run the compilation script. But, for a project the size and complexity of bs, I think it is worth in the long-term to invest the effort into providing a self-hosted runner. You have your own machine with matlab installed and also a client app installed that lets github trigger a script to run on that machine. That way you can have matlab installed there no problem. Being an open source project we have to be careful with security. But that self-hosted runner can, only when you decide (not every commit)!, automatically checkout master branch, run compile_bst.m script to generate the complete compiled brainstorm application and then push it as an "asset" to the release section of the master branch. Think about it... one less thing to think of. :) (More than happy to help with that if needed).

But what is the advantage of having this done on GitHub rather on our server?

As a reminder: we want to keep this idea that general users need to register on our website rather than download/clone it directly from github. The github distribution is mostly for developers, not for general users. The registration and download statistics are major arguments for requesting public funding for this project :)

Think about the first solution I mentioned. Create a "vault"/"hibernate" repository in brainstorm-tools which clones master branch as it is. And start cleaner ±40Mb master branch without old dependencies, either the way you proposed or the way it is proposed in the previous link.

As I reference the commits in the forum posts related with bug fixing, it would be good to keep the links to all the old commits working.
Could we keep all the URLs working with this solution?

@juangpc
Copy link
Collaborator Author

juangpc commented Dec 21, 2021

But what is the advantage of having this done on GitHub rather on our server?

Simplicity? I mean, since we're all using github already, if we use github's (free) service you could give anyone in the development team the capacity to generate (from anywhere) these bst compilation files without any extra work.

Going back to the use of the “release” section in github. Regardless of the way to actually compile the binary files, if we were to use this "release" section of github we can automatically get a publication d.o.i. number for each stable release. Releases can be generated, as explained previously, manually or automatically, on a schedule or triggered by some event. This can help gain ease of use while referencing in the academic context. There is a description of the process even in github's documentation. See:
https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

In relation to this. I think it is a mistake to continue uploading the compiled java code to brainstorm3/java. Regardless of how small the size of the files are, or how infrequently the files are updated. brainstorm-tools/bst-java is java code that generates some compiled code that can be stored in the release section of that repository. Code repositories I think should be kept for text files containing code and code related stuff, like documentation etc...

Going back to the actual problem we’re dealing with here: the size of the repo. Let me recapitulate:

It is clear by now that you cannot add/delete/modify any file in a repository without modifying its history. And what we want is to try to have a slim repository without the file bin/R2020a/brainstorm3.jar or any reference to its history.
While keeping valid any permalink to code or file already shared in the tutorials, forum, etc...

We have few options:

  • Create a new repository: I hear brainstorm4?
    brainstorm3 could be archived, this would render it read-only, but any link would be kept alive. By being archived, we would minimize confusion between what repository is in "active-development". See: https://docs.github.com/en/repositories/archiving-a-github-repository/archiving-repositories
    In the new repo we could start with a default “main” branch to be a little bit more modern than using the current “master”.
    To me, this seems a bit too much effort.

  • Another option: create a new "master" branch, rewrite its history and make it the default branch.
    In github's api, it seems like links to files, lines of code, etc... seem to depend on 3 parameters: user/organization, repository, commit-sha. So the branch name doesn't matter. In github, Hash collision probability seems to be is 1 in 10^48, so no problem having multiple branches sharing sha pool.
    We could create a new default "main" branch and then rename the current "master" as "main-vault" which would keep all the history of editions to each file. As long as you don't do git fetch --all you will not have to deal with the 1.3GB of back reflog data, and all the links will continue to work. If someone downloads the "vault" branch, a simple git branch --delete vault-main will do. On top of that, if for any reason you need the history to blame some crash on someone, you can easily fetch and checkout the vault branch.
    So we could try something like the following:

  1. Create a new branch: for instance “main” on top of current HEAD.
  2. Rewrite its history using this bfg-repo-cleaner tool.
  3. Make that branch the default one.
  4. Rename the current “master” branch as “main-vault”.
    All old links will continue to work, even with the new name.
  5. If we keep text files and only text files in the code repositories, this should never have to happen again.

@Moo-Marc
Copy link
Collaborator

Thanks @juangpc for suggesting this and working on a solution. Unfortunately I'm not a Git guru like @martcous, so I don't have much to contribute, but I just wanted to show my support for this effort. I think it's well worth it.

@martcous
Copy link
Member

Great initiative Juan, and good idea to remove the JARs from the repository and its history!

I don't think there's any way to do this without rewriting the git history which would indeed fork the code base. I would create a new stable branch, and I believe a simple git merge --allow-unrelated-histories should update any dangling branch without significant conflicts (I just tried it in a simple repo, this should probably be tested and validated first though). Also, this should not affect most of our users that are installing Brainstorm from NeuroImage and updating it through the GUI.

Finally, like Juan alluded to, it could be a good time to rename the stable branch from master to main like the rest of the community is doing: https://github.com/github/renaming I don't think keeping the obsolete master branch is a good idea long term though, as that would not reduce the size of the repository.

@juangpc
Copy link
Collaborator Author

juangpc commented Dec 23, 2021

Thank you @martcous !

After thinking about it for a while I think I can offer a few specific ideas. I've created a testing repo and I think it could work.
First of all, I think that this is a quite important change so I think it is a good idea to be cautious.

I've been reading about the capacity to merge from unrelated histories, but I'm not sure how this can help. I think we can achieve a smaller repo size, while considering the following:

  1. Don't enter user-space. The main objective is to reduce the size of the repository, so that it is more convenient for the developer. Great. But we have to find a way so that the end-user is not affected. (here, by end-user we understand both the user of the forum who might click on a link to a certain commit, fix, section of the code, etc... as well as the developer who chooses not to deal to much with version control matters).
  2. Make life easier for the developer, ok, but without too much interference. Let's not change too many things.

Option A. Two branches: master & history
The worst part of this option is that you end up with two branches. I kind-of agree with this comment by @martcous :

I don't think keeping the obsolete master branch is a good idea long term though, as that would not reduce the size of the repository.

But I think that given git and github's apis, it is either we stick to 1.3GB, or we modify the history. And in order to modify the history while keeping the links and references valid, we need two branches.
So this is how to achieve it.

  1. clone brainstorm3.git (prefer ssh)
git clone git@github.com:brainstorm-tools/brainstorm3.git 
  1. switch to master branch.
  2. create a new branch with a name: history or main-before-2022 or main-with-big-binary-files.
git switch -c history
git push origin history
  1. delete all the branches in our local repo except main branch. This step is important because in the next step we will rewrite history. But by deleting the branch we will keep the tool we use from rewriting that history.
git switch master
git branch --delete history
git remote remove origin
  1. Use the git filter cleaner tool. As mentioned, this will modify the history and affect all local branches. Thus the ones that are not downloaded/fetched will not be affected.
cd ..
bfg --delete-files brainstorm3.jar brainstorm3
cd brainstorm3
git reflog expire --expire=now --all && git gc --prune=now --aggressive
  1. Add remote repository and push the modified master branch.
git remote add origin git@github.com:brainstorm-tools/brainstorm3.git
git push -f origin master

Interestingly, because of the way bfg works, github will think that changes have been made in the history branch.
7. At this point all the developers may work perfectly fine with master branch.
history branch is there to support previous links, references, etc. in the forum, issue section, etc...
As mentioned, these links depend on the repository and commit, so there is no way around it.
If a developer fetches by mistake the history branch, it can be deleted either by removing and adding again the remote.

git remote remove origin
git remote remote add git@github.com:brainstorm-tools/brainstorm3.git

or removing the references

git update-ref -d refs/remotes/origin/history
git gc --prune=now 
  1. In order to avoid confusion among developers. I would add a set of protection rules to the history branch. (i) Only contributions through PR. (ii) Also admins must go through this rule.

Option B. Two branches: main and history but resetting all previous contributions to this point to a single commit. This will reduce the history to practically zero and the repository size (if you only fetch the master branch) to ~35MB. Follow this tutorial. https://git-scm.com/book/en/v2/Git-Tools-Replace

Notes.

  1. I'm not sure if the following bfg --strip-blobs-bigger-than 1M brainstorm3 would affect some anatomy files.
    If you check with the following command (macos users have to brew install coreutils first) (source)
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 500 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

You'll see how we have blobs bigger than 1M non related to the java compiled file.

7e278956df14  646KiB external/CEDS64ML/x64/msvcp110.dll
72607432cd6d  756KiB defaults/anatomy/ICBM152/tess_cortex_pial_low.mat
d6e7b6896d68  756KiB defaults/anatomy/ICBM152/tess_cortex_pial_low.mat
dd484a5885c1  829KiB external/CEDS64ML/x64/msvcr110.dll
b7edb94384c4  2.8MiB defaults/anatomy/ICBM152/tess_aseg.mat
84cc80038f60  5.3MiB defaults/anatomy/ICBM152/subjectimage_T1.mat
f9d2566f5a02  5.3MiB defaults/anatomy/ICBM152/subjectimage_T1.mat
d84eb11f623a  5.3MiB defaults/anatomy/ICBM152/subjectimage_T1.mat
0e05f66254c0   14MiB defaults/anatomy/ICBM152/tess_cortex_pial_high.mat
fbb5f70957fc   32MiB bin/R2015b/brainstorm3.jar

I think that these .mat files would be better off in a different location (Neuroimage? or assets section, for instane). It would be the perfect time to also get rid of them. And restart the repo with only minimal non-code content.
2. With bfg running focused only on brainstorm3.jar related blobs I get the repo to be 88MB.
3. I totally agree with @martcous. It is a good time to update the default branch from master to main. It will probably save headaches in the future.
4. All feature branches will have to rebase to master after this. But that should not be too problematic.

Whatever you decide, i'm sure it will be the best option.

@ftadel
Copy link
Member

ftadel commented Dec 23, 2021

Thanks for all this research!
This repository was initially a crude copy of the the 1990's SVN repository, I did not think that far when I first put all the Brainstorm files on GitHub...

New branch vs. new repository

We loosely discussed about moving to brainstorm4 when releasing the new database system on which @martcous (before) and @rcassani (now) have been working. The management of the database on the hard drive will change substantially, from a simple file storage with all the data structures loaded in memory and this unprotected protocol.mat file at the root of the data folder, to a SQLite database protocol.db allowing a safe concurrent access of multiple users to the same database.
This will lead to some major code refactoring (currently in a branch managed by @rcassani), and the database folders will not be fully compatible with the older version (new possibilities, like having subfolders, will not be compatible with brainstorm3).

This major update will not be handled as the regular updates. Changing the "version" number (and logo/splash screen, and installation folder), would be a clear way to show the users that something changed.

If we want to create a new repository brainstorm4, that would be a good time.
I guess this would happen sometime next year.

java/brainstorm.jar

I thought initially that we would need to link the correct version of the brainstorm3/java/brainstorm.jar to each commit in the in brainstorm3 repository. But looking at what has happened in the past years, we've only been adding classes and icons. I think a new version of brainstorm.jar should work with old versions of Brainstorm (as long as we keep compiling it with an old version of the JDK). So we could probably consider it as an external library for which we don't need to keep a full history in the brainstorm3 repository.

What would be your suggestion for having the brainstorm.jar always available when you clone the repository, without having it directly saved in the repository? (like the other libraries in the brainstorm3/java folder)

Can we easily create a dependency link from the brainstorm3 repo to the bst-java/brainstorm repository?
And to the releases of that repository? (to grab directly a .jar file, not the release .zip)
The brainstorm.jar file is not part of the bst-java repository, it needs to be recompiled after committing to the bst-java repository (only the subfolder brainstorm), which could probably done automatically, together with the creation of a new release, as Juan mentioned at the beginning of this thread. But then can we easily link the new compiled brainstorm.jar from bst-java to brainstorm3/java/brainstorm.jar?
Note that setting up the compilation on the github servers might be a pain as well, as for the moment it is only available as a netbeans project. I'm not sure I want to spend a week on that right now.
Maybe we could compile it manually on our computers using netbeans, and then create a release manually to upload on github the resulting dist/brainstorm.jar?

(Note that in the future, we might not have much more that 5 commits per year for this file, so 330Kb*5=1.6Mb added every year. I would tend to consider this negligible and keep it as it is now, with the brainstorm.jar committed directly as the other .jar libraries, but if you guys really want to make it properly, I'd follow you)

Other libraries

Now we have this plugin interface that allows the dynamic download of files on the fly, we could move most of the third-party I/O libraries from the external folder outside of the main repository, and transform them into plugins. In the external folder, we could get rid of (total ~5Mb):

  • CEDS64ML
  • easyh5
  • edfimport-1.0.4
  • eeprobe
  • mne
  • son-2.32
  • ricoh
  • yokogawa

Additionnally, we will soon be able to decommission the support for JOGL completely (needed for the old connectivity graph display). This is causing trouble on many computers, and the new graph display seem to be working OK (we haven't had bug reports about this in a while).

ICBM152 template

I think that these .mat files would be better off in a different location (Neuroimage? or assets section, for instane). It would be the perfect time to also get rid of them. And restart the repo with only minimal non-code content.

The ICBM152 template that is copied to all the new protocols, we want it completely versioned, fully linked with all the commit history. A minor change in these files (eg. the number of vertices of the cortex surface used for source reconstruction) would impact the results of any group study. It would be good to have it copied to any clone instead of downloaded on the fly.

I could move these files to the neuroimage server, and having the daily cron job copy them into the .zip package available on the download page. But it would make it more complicated for any user to reconstruct the software at any given point in history...

Can we have git automatically clone files from external links (copy files from the neuroimage server for example)?
And could the history of these links be kept?
(I could keep all the old versions of the ICBM152 on the neuroimage server available, with different URLs, but we need the correspondence to each commit to be kept)

Sorry, this is a bit drafty, I thinking out loud... I'm OK with moving these away from the GitHub repo, but I'm not sure in which direction I should go.

Reproducibility considerations

For full reproducibility of an analysis, we want the users to archive the version of the software they used for processing together with the dataset (and get a DOI attached to the whole thing). The ICBM152 template should always be included in the brainstorm3 cloned folder, but all the depending plugins as well.
This discussion reminded me that I should add a menu somewhere for users to create this full software package to be released with any article (brainstorm3 + anatomy template + downloaded plugins in $HOME/.brainstorm).

@juangpc
Copy link
Collaborator Author

juangpc commented Dec 23, 2021

Thank you. I see. Thanks for all the cover info.

If we want to create a new repository brainstorm4, that would be a good time.
I guess this would happen sometime next year.

That sounds great. And it makes a lot of sense. Brainstorm is now a mature project with thousands of things occurring all at once. I'm very happy to be witnessing this. Through this conversation I'm gaining insight into what version-control good practices might mean. So thank you for this. While doing this, allow me for a minute to share my thoughts: this is brainstorm (without external/ though): 16mb of hard work...

Screen Shot 2021-12-23 at 3 46 47 PM

Perhaps only slightly related to the current issue, but I think it is worth thinking about how 16mb can turn into 1.3GB. the conversation might be helpful in the near future, with either the new brainstorm4 repo or whatever you decide to do, but specially related to the new database changes. So I'd not push database bin files, enforce push to main branch to always go through a pr+review process, avoid pushing very long files updates.txt or log files to the repo.

...

But, going back to the actual issue at hand. If you go for this new repo option, this would allow to solve the repo size no problem. You could (1) archive brainstorm3 and therefore all links would be preserved you would avoid confusion, (2) rewrite history on brainstorm4 so that the repo goes back to anywhere in the ~90MB or less. At some point you'll face this decision: whether to keep all the commits or to squash them into one single commit. You could set an initial commit that squashes all contributions to brainstorm3 in a single commit in brainstorm4. That squash of all the history would probably make the repo go down to the ~25-30MB area. I'm not sure which is the best option. I'm just sharing some decisions that would have to be made eventually. Personally I'd keep the back history, but avoid any non-text file and probably the repo would shrink to the ~30MB area, which for a repo with the moderate importance of brainstorm seems fair.

What would be your suggestion for having the brainstorm.jar always available when you clone the repository, without having it directly saved in the repository? (like the other libraries in the brainstorm3/java folder)

First thing I think of is git's submodules. But I wouldn't go that route. Different languages (matlab vs java) and difficulty to operate with git for not experienced users, makes me think just keep them separated. Forgive me for replying with a question. Why would you want a binary file with compiled code to be available always when you clone the repo? I think we need to enforce that a version control system is a great tool for that. But maybe not so much for deployment of an application. I don't think you need compiled java code when you want to maintain versions of a matlab codebase. You might need it to run the application. So then, make the brainstorm.m script download it from a url (webread and so on). And I'd make a design decision and then stick to it always for simplicity purposes. If Neuroimage server is the way to go. Then Neuroimage server for all downloads and github releases sections (in brainstormN and bst-java repos) as backup for whenever Neuroimage goes down.

In order to download from github you can either drag and drop or use a script (which can also be automatically triggered by a github action, either on a calendar schedule or whenever there is an event. Just created this release for demo purposes. Here you can see how we upload a tar file to the release section through a github action.

Can we easily create a dependency link from the brainstorm3 repo to the bst-java/brainstorm repository?
And to the releases of that repository? (to grab directly a .jar file, not the release .zip)
But then can we easily link the new compiled brainstorm.jar from bst-java to brainstorm3/java/brainstorm.jar?

See the linked release I just created. The release related urls are immutable and direct you to the jar file. But not only that, you can link to a specific release or to a latest stable release and the url will get resolved to whatever release is the latest.

Maybe we could compile it manually on our computers using netbeans, and then create a release manually to upload on github the resulting dist/brainstorm.jar?

Drag and drop in github's page. See: https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository

Related to ICBM152 template. Perhaps making a brainstorm-tools/anatomy repository? I'm not sure. It is clear my opinion by now, but it might not be the better option. To any good-practice rule there is a suitable exception. You sure know better.

Can we have git automatically clone files from external links (copy files from the neuroimage server for example)?
And could the history of these links be kept?

You can make github run a script on a schedule, and make that script do that, for sure. Here is an example of a script which downloads some files from dropbox (see) Or, you can implement a git post-checkout hook to run another script. But the hooks are complicated because (i) depend on each user's configuration and (ii) difficult for not experienced users. So I'd go actions.

Related to reproducibility considerations. I would add to your proposition, to have some sort of report similar to matlab's version report (ver) where you get all the version numbers, dates and all the commits hash-codes (and if compiled language, the build date) of the application and all the plugins/dependencies. I think that this is not an easy task. During the past year I learned how up until recently, python's deployment of projects was difficult and problematic, and only somewhat solved with the concept of "environment" (by anaconda, for ex). Or think about software container's industry. So this track of versions of dependencies is really a complex issue but also a necessity in mature projects. Really worth to think about it long-term.

@ftadel and all, thanks. It's been great to find possible solutions to this issue.

@ftadel
Copy link
Member

ftadel commented Dec 27, 2021

Thank you for all these suggestions!

I started implementing some of them in order to move all the binary files out of the github repository:

  • I removed the JOGL support (and therefore the old connectivity graph display) as this was going to cause quickly more and more maintenance problems, and it was already a pain to deploy to the HBP servers: https://neuroimage.usc.edu/forums/t/brainstorm-on-ubuntu-20-04-bug-with-matlab-jogl/31558/8
  • I moved the brainstorm.jar file to the bst-java repository. We'll think about cleaning the bst-java repository some other time if we ever need it... At the moment, its size is much less critical.
    • The file brainstorm-tools/bst-java/dist/brainstorm.jar must be copied into brainstorm-tools/brainstorm3/java/brainstorm.jar
    • For the Brainstorm package downloaded from the website: this is done by a cron job on the neuroimage server
    • For a git clone: brainstorm.m now downloads brainstorm.jar at startup if it is missing
  • I removed the years in the license notice at the top of all the files, so that we don't have to update this every year anymore. This massive recommitting of all the files I just did will be the last one, now we'll rely only on the authors information+date.

The next things on my todo list:

  • Move the ICBM152 anatomy files to separate repository (and maybe all the eeg default positions): I need to think a bit more about this, because I need to update it at the same time (I reprocessed the ICBM152 brain with FreeSurfer 7.2 and I'll add more volume and surface atlases to the basic template)
  • Replace some of the external subfolders with new plugins.

For the actual repository cleaning (new repo vs. new branch) we'll discuss with @rcassani and the rest of the team at our next weekly meeting in January.

I would add to your proposition, to have some sort of report similar to matlab's version report (ver) where you get all the version numbers, dates and all the commits hash-codes (and if compiled language, the build date) of the application and all the plugins/dependencies.

I already started something like this with the first version of the plugin manager. I need to find a way to get the last commit hash, instead of github-master. Do you know how I can do this using only web interfaces by any chance? (as we can't expect git to be available on a regular Windows computer)
image

@martcous
Copy link
Member

That was quick, great work, François!

I need to find a way to get the last commit hash, instead of github-master. Do you know how I can do this using only web interfaces by any chance? (as we can't expect git to be available on a regular Windows computer)

You could use GitHub's public API and parse the JSON result:
GET https://api.github.com/repos/<owner>/<repository>/commits/<branch>
Field sha contains the latest commit hash. Get the first 7 characters to get the shortened hash.
You can also send in header Accept: application/vnd.github.VERSION.sha and then the above URL will only return the hash. I'm not sure we can do that with Matlab however.

Example: https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/master

@juangpc
Copy link
Collaborator Author

juangpc commented Dec 27, 2021

Yes, apparently it is as easy as @martcous points out. You just do a http request

uri = matlab.net.URI("https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/master")
request = matlab.net.http.RequestMessage;
r = send(request,uri)

r.Body.Data

shows...

>> r.Body.Data

ans = 

  struct with fields:

             sha: '3545064183474cfb859d7a142062373b64372498'
         node_id: 'C_kwDOBEMw89oAKDM1NDUwNjQxODM0NzRjZmI4NTlkN2ExNDIwNjIzNzNiNjQzNzI0OTg'
          commit: [1×1 struct]
             url: 'https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/3545064183474cfb859d7a142062373b64372498'
        html_url: 'https://github.com/brainstorm-tools/brainstorm3/commit/3545064183474cfb859d7a142062373b64372498'
    comments_url: 'https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/3545064183474cfb859d7a142062373b64372498/comments'
          author: [1×1 struct]
       committer: [1×1 struct]
         parents: [1×1 struct]
           stats: [1×1 struct]
           files: [1×1 struct]

so, then

sha = r.Body.Data.sha

gives you the string.

I'm not sure we can do that with Matlab however.

uri = matlab.net.URI("https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/master")
request = matlab.net.http.RequestMessage
request = request.addFields(matlab.net.http.HeaderField("Accept", "application/vnd.github.VERSION.sha"))
r = send(request,uri)
sha = r.Body.Data

In the docs pages it is said that matlab.net.http.ResponseMessage class was introduced in 2016.

@ftadel
Copy link
Member

ftadel commented Dec 28, 2021

Thanks!

Implemented here: 101e36f

image

image

@ftadel
Copy link
Member

ftadel commented Feb 28, 2022

While this is still a work in progress, I just wanted to share the fact that it is possible to clone a repository without its history (and avoid the download of gigabytes of useless and outdated binary files). By adding --depth=1, you get only the tip a given branch. The command below downloads only 74Mb:
git clone --depth=1 https://github.com/brainstorm-tools/brainstorm3.git

This is definitely useful for testing pull requests...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants