Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML manuscript not updating #1177

Open
agitter opened this issue Oct 5, 2022 · 7 comments
Open

HTML manuscript not updating #1177

agitter opened this issue Oct 5, 2022 · 7 comments

Comments

@agitter
Copy link
Collaborator

agitter commented Oct 5, 2022

The last version of the HTML manuscript is from August 17, 2022. gh-pages branch commits 471a614 and c269dc0 failed to update. Both have build logs that end with the messages

Current status: deployment_in_progress
Current status: 
Timeout reached, aborting!
Error: Timeout reached, aborting!
Deployment cancelled with https://api.github.com/repos/greenelab/covid19-review/pages/deployment/cancel/471a61443203f082187a2a4e05b491674264152d

The last successful HTML deployment had a similar failure but without the cancellation message

Current status: deployment_in_progress
Timeout reached, aborting!
Error: Timeout reached, aborting!

I'm not sure how to debug this, so I asked for advice in the GitHub community discussions. My initial guess is that the content on our gh-pages branch has gotten too large.

@rando2
Copy link
Collaborator

rando2 commented Oct 5, 2022

Oh no!!! If that's the case, I wonder if it's time to move to true book-like HTML formatting, with a TOC and "read next section" links at the bottom. It is so hard to load anyways... in terms of how to do that, I can play around with see how the suggestion from this post generates locally!

@agitter
Copy link
Collaborator Author

agitter commented Oct 5, 2022

I don't think the problem is the size of a single manuscript file. Rather, the branch also tracks archived versions of every historical version of the manuscript we've deployed before so we can maintain permalinks. Check out the contents of https://github.com/greenelab/covid19-review/tree/gh-pages/v We have about 600 past copies of the manuscript archived there.

Bringing in @dhimmel in case the permalink and archiving is a possible cause.

@rando2
Copy link
Collaborator

rando2 commented Oct 5, 2022

They really need a 😱 react for posts like this... that makes sense and would definitely explain the time outs!

@dhimmel
Copy link
Collaborator

dhimmel commented Oct 5, 2022

Looking at the raw CI logs for this build:

2022-09-26T18:46:40.7358609Z Created deployment for c269dc06246f43081bfbfb4e8ae789a0f745d01b
2022-09-26T18:46:40.7360771Z {"page_url":"https://greenelab.github.io/covid19-review/","status_url":"https://api.github.com/repos/greenelab/covid19-review/pages/deployment/status/c269dc06246f43081bfbfb4e8ae789a0f745d01b","preview_url":""}
2022-09-26T18:46:40.7363072Z 
2022-09-26T18:46:46.0259664Z Current status: deployment_in_progress
...
2022-09-26T18:56:40.2791010Z Current status: deployment_in_progress
2022-09-26T18:56:45.5190618Z Current status: 
2022-09-26T18:56:45.5191563Z Timeout reached, aborting!
2022-09-26T18:56:45.5240173Z ##[error]Timeout reached, aborting!
2022-09-26T18:56:45.8961141Z Deployment cancelled with https://api.github.com/repos/greenelab/covid19-review/pages/deployment/cancel/c269dc06246f43081bfbfb4e8ae789a0f745d01b

So the deployment_in_progress step is likely limited to 10 minutes. What about creating a branch from gh-pages to preserve the existing versioned outputs. Then edit gh-pages to delete most of the versions?

@agitter
Copy link
Collaborator Author

agitter commented Oct 5, 2022

https://github.com/orgs/community/discussions/35197 provides more details. The artifacts grew to 10 GB in size, which leads to the 10 min timeout @dhimmel detected. We can monitor the artifacts size from the actions pages such as https://github.com/greenelab/covid19-review/actions/runs/3129872205

Archiving the old versions of the manuscript files isn't too hard. We already have a Zenodo repository linked to releases of this GitHub repository, so we could create a release from the gh-pages branch and then delete most of the versions.

That would destroy our old permalinks, which is unfortunate. We could manually try to preserve the old versions that correspond to releases (e.g. the arXiv preprints), but would miss some. I don't see a general solution though if we are going to continue hosting the manuscript on GitHub pages.

Maybe we set the old permalinks to redirect to the Zenodo DOI? That would be better than a 404.

@dhimmel do you think this is a general issue for large Manubot projects worth discussing in the rootstock repo, or does this review just push the Manubot workflow to the extreme?

@dhimmel
Copy link
Collaborator

dhimmel commented Oct 11, 2022

That would destroy our old permalinks, which is unfortunate

Slightly unfortunate, but you could do something in between like just delete the images directory.

We could manually try to preserve the old versions that correspond to releases

Yeah, I don't think Manubot creates permalinks for git tags, but that would be a nice feature if it did.

do you think this is a general issue for large Manubot projects worth discussing in the rootstock repo

possibly, I think it's a reason to recommend embedding images by link if you plan to have large images and many commits. The insights from the discussions#35197 might be valuable in USAGE. As well as what you end up deciding in terms of pruning things.

@agitter
Copy link
Collaborator Author

agitter commented Oct 22, 2022

I'm documenting my process to prune the gh-pages branch here.

Checkout gh-pages locally, confirm I have the output from the last commit, and create a local copy for safekeeping.

$ git checkout origin/gh-pages
$ ls v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
images/  index.html  index.html.ots  manuscript.pdf  manuscript.pdf.ots
$ cp -R . ../gh-pages-archive-2022-10-22

Our repository also had Zenodo archiving enabled, so make a tag and release to archive the gh-pages contents before pruning.

$ git tag -a gh-pages-2022-10-22 -m "Archive gh-pages branch 2022-10-22"
$ git push origin gh-pages-2022-10-22

Zenodo created an archive of the release that is 8.2 GB compressed. (I also noticed at https://help.zenodo.org/ that Zenodo now supports metadata in a .zenodo.json file in the GitHub repo, which was always one of my gripes with archiving GitHub releases on Zenodo and something we may want for this repo) I downloaded the zip and checked that a few of the versioned PDFs look good. Time to start deleting!

I start by checking the size of the contents and iterative delete until it is back to a reasonable size.

$ du -sh .
20G     .
$ rm v/*/images/*
$ du -sh .
14G     .
$ rmdir v/*/images
$ rm v/*/*.pdf
$ du -sh .
9.9G    .

It's still huge even after removing images and pdfs. Time to remove entire manuscripts arbitrarily.

$ rm -rf v/0*
$ rm -rf v/1*
$ du -sh .
9.8G    .

Removing those HTML files is a reminder the disk usage must be elsewhere. It's in the .git subdirectory, which I am not touching.

$ du -sh v/
827M    v/
$ du -sh .git/
8.9G    .git/

Let's blast a few more HTML manuscripts. My favorite number is "5" so it stays.

$ rm -rf v/2*
$ rm -rf v/3*
$ rm -rf v/4*
$ rm -rf v/6*
$ rm -rf v/7*
$ rm -rf v/8*
$ rm -rf v/9*
$ du -sh v/
377M    v/
$ ls -l v/*/*.html | wc -l
249

I'm stopping here. If we address the problem below we have a reasonable artifact size and many past version of manuscripts left (HTML only though).

I can restore complete archives from my local copy, and anyone could do this by downloading the zip from Zenodo. I'm only restoring the two versions we refer to in the manual references for now and the latest version. I could restore more later that correspond to releases or other special versions.

$ cp -R ../gh-pages-archive-2022-10-22/v/910dd7b7479f5336a1c911c57446829bef015dbe v/910dd7b7479f5336a1c911c57446829bef015dbe
$ ls v/910dd7b7479f5336a1c911c57446829bef015dbe
$ cp -R ../gh-pages-archive-2022-10-22/v/32afa309f69f0466a91acec5d0df3151fe4d61b5 v/32afa309f69f0466a91acec5d0df3151fe4d61b5
$ ls v/32afa309f69f0466a91acec5d0df3151fe4d61b5
images/  index.html  index.html.ots  manuscript.pdf  manuscript.pdf.ots
$ cp -R ../gh-pages-archive-2022-10-22/v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3 v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
$ ls v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
images/  index.html  index.html.ots  manuscript.pdf  manuscript.pdf.ots
$ du -sh v/
478M    v/

I noticed I broke the symbolic links for v/latest.

git status excerpt
        deleted:    v/latest/images/4.2-summary-R-M-smallMoleculeDrugs.pdf
        deleted:    v/latest/images/4.3-summary-R-M-biologicsDrugs.pdf
        deleted:    v/latest/images/4.3.1-summary-R-M-moreTocilizumab.pdf
        deleted:    v/latest/images/4.3.2.1-summary-R-U-moreMonoclonal.pdf
        deleted:    v/latest/images/4.3.4.1.1-summary-L-M-DNAVaccine.pdf
        deleted:    v/latest/images/4.3.4.1.2-summary-L-L-RNAVaccine.pdf
        deleted:    v/latest/images/FIgX1.jpg
        deleted:    v/latest/images/N000-overview.pdf
        deleted:    v/latest/images/N000-overview.png
        deleted:    v/latest/images/N001-LifeCyclePlusDrugs.pdf
        deleted:    v/latest/images/N001-LifeCyclePlusDrugs.png
        deleted:    v/latest/images/N002-Vaccines.pdf
        deleted:    v/latest/images/N002-Vaccines.png
        deleted:    v/latest/images/SARS_CoV_2.png
        deleted:    v/latest/images/Summary.pdf
        deleted:    v/latest/images/cell-lines-moi-partB.afdesign
        deleted:    v/latest/images/cell-lines-moi.afdesign
        deleted:    v/latest/images/cell-lines-moi.jpg
        deleted:    v/latest/images/covid-19-review-workflow-figure.pdf
        deleted:    v/latest/images/covid-19-review-workflow-figure.png
        deleted:    v/latest/images/covid-19-review-workflow-figure.svg
        deleted:    v/latest/images/covid-19-review-workflow-horizontal-cropped.pdf
        deleted:    v/latest/images/covid-19-review-workflow-horizontal.pdf
        deleted:    v/latest/images/covid-19-review-workflow-horizontal.png
        deleted:    v/latest/images/covid-19-review-workflow-horizontal.svg
        deleted:    v/latest/images/diagnostics.png
        deleted:    v/latest/images/ebmdatalab-trials-original.png
        deleted:    v/latest/images/genome-structure.png
        deleted:    v/latest/images/github.svg
        deleted:    v/latest/images/interests.png
        deleted:    v/latest/images/orcid.svg
        deleted:    v/latest/images/summary-M-M-Covid19Mechanism.pdf
        deleted:    v/latest/images/thumbnail.png
        deleted:    v/latest/images/twitter.svg
        deleted:    v/latest/manuscript.pdf

I restored those and then commit the other changes. Had to do the taboo git add . because of problems with my other attempts to add by pattern. I made an absolute mess of the commits and pushing them to origin because I hadn't checked things out locally properly. Eventually, the commit made it.

$ git checkout v/latest/*
$ git checkout v/latest/images/*
$ git add .
$ git commit -m "Prune most old versioned manuscripts"
$ git log
commit 62720cec39d92945ce6733925bb35218947541e4 (HEAD)
Author: Anthony Gitter <gitter@biostat.wisc.edu>
Date:   Sat Oct 22 16:29:34 2022 -0500
    Prune most old versioned manuscripts
$ git checkout --track origin/gh-pages
$ git branch prune-gh-pages 62720cec
$ git checkout prune-gh-pages
$ git branch --set-upstream-to origin/gh-pages prune-gh-pages
Branch 'prune-gh-pages' set up to track remote branch 'gh-pages' from 'origin'.
$ git push origin HEAD:gh-pages
Enumerating objects: 499, done.
Counting objects: 100% (499/499), done.
Delta compression using up to 8 threads
Compressing objects: 100% (250/250), done.
Writing objects: 100% (250/250), 15.17 KiB | 706.00 KiB/s, done.
Total 250 (delta 249), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (249/249), completed with 249 local objects.
To https://github.com/greenelab/covid19-review.git
   483a8dde..62720cec  HEAD -> gh-pages

If you browse gh-pages you'll see the pruned versioned manuscripts. And now the GitHub Pages deploy process works again so https://greenelab.github.io/covid19-review/ shows our latest manuscript!

We still should do this before closing the issue or merging too many more changes to the manuscript:

embedding images by link if you plan to have large images and many commits

Every time we push to gh-pages, we are creating a new copy of all the images in content/images. For this project that is a lot of copies of a lot of images. We could move these to external-resources even though they are not really external. @rando2 could you work on that? It may be a while before I can do manuscript maintenance again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants