Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Add a CI check to rebuild documentation from a cached documentation build #16161

Closed
namurphy opened this issue Mar 5, 2024 · 19 comments
Closed

Comments

@namurphy
Copy link
Contributor

namurphy commented Mar 5, 2024

What is the problem this feature will solve?

It takes a long time to build Astropy's documentation from scratch. From recent builds on Read the Docs, it takes ∼1000 s.

One of the most important qualities of continuous integration testing is to get rapid feedback. One thing that has happened a lot to me over at PlasmaPy has been that I run into a cryptic Sphinx error, and then have to wait ∼500 s between iterations trying to get the reStructuredText syntax right. (We have a saying over at PlasmaPy that "Sphinx rabbit holes often have dragons in them." 🐰 🕳️ 🐉)

Describe the desired outcome

We could use the GitHub Action for caching to cache a documentation build from main, as well as the Python environment used to build it. This cache could be recreated daily.

We could then add a workflow to be run in CI that would rebuild the documentation from the cached build (using the -j auto flag to have the sphinx-build be done in parallel while auto-detecting the number of processes to use). We could potentially reduce the documentation build time in CI to ≲ 3 min.

Additional context

This probably should not be a required check, at least not at first. I suspect there may be some occasional situations where a rebuild might fail or succeed incorrectly, such as when adding a Sphinx extension or modifying the configuration file. It might help to have a label for skipping the documentation cache, and/or for the cache key to use hashFiles on conf.py and maybe also pyproject.toml.

I've been trying to do this in PlasmaPy/PlasmaPy#2553 by using a tox environment to do this, but it seems that it rebuilds the documentation build from scratch (though caching .tox between runs did speed things up by a minute or two). It may be necessary to use sphinx-build directly rather than creating a tox environment for this.

Edited to add: Locally, it took my computer ∼7 minutes to do a fresh doc build for Astropy, and ∼2.4 min to do a repeat documentation build. This was with make html -j 2 for two processors, which is what each GitHub Action allows.

@namurphy
Copy link
Contributor Author

namurphy commented Mar 5, 2024

Incidentally, PlasmaPy's documentation guide has a troubleshooting guide which describes a whole bunch of cryptic Sphinx errors and possible ways to fix them. If the documentation build in CI fails, it prints out a link to the troubleshooting guide.

@namurphy namurphy changed the title Add a CI check to rebuild documentation from a cached documentation build DOC: Add a CI check to rebuild documentation from a cached documentation build Mar 5, 2024
@pllim
Copy link
Member

pllim commented Mar 5, 2024

Over here, doc build CI is handled directly by RTD via https://docs.readthedocs.io/en/stable/pull-requests.html because the HTML preview is really essential in some PRs. To add another CI via tox to essentially do the same build would be a waste of resources. I wonder if RTD will be able to offer caching directly. Maybe @humitos can advice or point us to someone who can. 🙏

@namurphy
Copy link
Contributor Author

namurphy commented Mar 5, 2024

Thank you for the feedback! My view is that it would be a worthwhile use of resources since it has the potential to shorten the feedback loop for the documentation build by possibly ten minutes or so. This would only get us the Sphinx error messages, so the time for the HTML preview (which I agree is absolutely essential!) would remain unchanged.

(Relatedly, addressing #15149 might reduce the documentation build time by several minutes across platforms.)

@pllim
Copy link
Member

pllim commented Mar 5, 2024

Hmm ok. I added this to 2024-03-07 dev telecon agenda. Thanks!

@humitos
Copy link

humitos commented Mar 6, 2024

Hello 👋🏼 . Thanks for pinging me on this issue.

In the past, Read the Docs used to cache the Git repository cloned, the virtualenv with all the requirements installed and the Sphinx environment to be able to re-use it between builds. However, this brought a lot of inconsistencies in the documentation (e.g. documentation not updated with latest changes, different build times when using/not using the cache, etc) and plenty of confusions to our users (e.g. how to purge the cache). Due to this, we decided to remove this feature and make all the builds consistent between them. This has helped a lot to remove these inconsistencies and users confusions 👍🏼

Later, with the introduction of the readthedocs.yaml file (https://docs.readthedocs.io/en/stable/config-file/) we thought we could add a cache key (readthedocs/readthedocs.org#9183) and expose this feature to users. However, we didn't get too much traction on it and we weren't able to prioritize either due to it's complexity.

Even if we implement the cache here, I suppose that it will bring the confusion to users again increasing our support requests with questions like "Why my documentation doesn't reflect the latest changes in my repository?" and similar ones. IMO, it's a pretty advanced feature that requires a good understanding on how the build process works to take into account all the cases/file changes that could affect the output and consider all of them to generate a hash for the cache correctly. It's a double-edged sword ⚔️

I'm open to receive more feedback here or on the linked issue where we are having the cache key conversation 👨🏼‍💻

@namurphy
Copy link
Contributor Author

namurphy commented Mar 7, 2024

@humitos — Thank you for the insightful response! Given your response, along with the scale and tradeoffs to consider for the RTD ecosystem, it makes much more sense how caching doc builds would be really difficult there!

My inclination would be for Astropy's CI to include a quick rebuild of the docs on GitHub Actions, in addition to the much more reliable RTD build with a preview. Being able to get rapid feedback is essential for CI, even if the feedback is partial. GitHub Actions also lets workflows be run on two processors, which itself might be able to reduce fresh doc build times by ∼25-40% compared to a serial run.

Perhaps the next step would be for me to try this out for PlasmaPy to see how much time caching the doc build would actually save in CI, and then compare it to other possibilities for speeding up doc build times.

@humitos
Copy link

humitos commented Mar 7, 2024

GitHub Actions also lets workflows be run on two processors, which itself might be able to reduce fresh doc build times by ∼25-40% compared to a serial run.

This a good point. I just wanted to note that Read the Docs uses AWS instances with 2 CPU cores as well. However, the default Sphinx command to build the documentation 1 does not use -j auto anymore. We implemented this option as a feature flag some time ago in readthedocs/readthedocs.org#7128 2 but we found there many Sphinx extension that don't support building in parallel and raise an error/warning message making the build to fail (in particular if using sphinx.fail_on_warning)

Because of this, together with not a lot of benefits in most of the projects (a few seconds faster) and the low adoption of the feature flag, we removed it in readthedocs/readthedocs.org#10423.

I'd be curious of the performance/optimization numbers you may get in case you do this research with GitHub Action or even with Read the Docs using build.commands and running a custom sphinx-build command with -j auto to build the documentation.

Let me know if I can be useful here in something else 👍🏼

Edit: another interesting issues

Footnotes

  1. python -m sphinx -T -W --keep-going -b html -d _build/doctrees -D language=en . $READTHEDOCS_OUTPUT/html

  2. note that @pllim asked for enabling this feature flag on this project when we released the feature 😅

@namurphy
Copy link
Contributor Author

namurphy commented Mar 7, 2024

I just did some timings on freshly building PlasmaPy's docs locally with make html, make html -j 2, make html -j 12. All were within a few seconds of 3 min. So...I take it back that parallelizing sphinx-build will likely help.

The biggest successes I've had with speeding up doc builds in CI for PlasmaPy have been caching .tox between runs so that the Python environment need not be recreated every time, and using sphinx-globalsubs (#16162).

Before anything else, I should do some timings to see how much caching the doc build will help after making a minor change in the docs in between.

@namurphy
Copy link
Contributor Author

namurphy commented Mar 7, 2024

After doing some timings and having discussions here and at the astropy dev telecon, I'm now more inclined to support an additional fresh "lite" build of the docs that skips evaluating example Jupyter notebooks.

Best case scenario: someone rewrites Sphinx (or at least the most time-consuming parts of it) in Rust. I wish Astral had a bat-signal... 🦇

@hamogu
Copy link
Member

hamogu commented Mar 7, 2024

We discussed this in the astropy-dev telecon. I'm trying to summarize the discussion here, so what I write is not necessarily my opinion:

  • We do need to keep the RTD job so we get a preview of the docs and so we know authoritatively that the build works.
  • Any GHA CI would be in addition to not instead of the long-running RTD job.
  • Any GHA CI needs to be maintained, so even if it's a fast jobs, there is a cost to it.
  • What would it mean is GHA fails, but RTD passes?
  • Main goal is quicker feedback for doc changes, but that can be done locally. A simple "make html" in docs/ should work (even without using tox). The first build will be slow, al later builds will use local cache and be much faster (except for very special cases like changing the conf.py or working in different PRs with substantial docs changes at the same time, but if those changes invalidates the cache, caching on CI won't help either).
  • Another apporach would be (suggested by @eteq ) to allow "partial doc builds" of just a few sphinx pages of interest. However, internal of sphinx are not easy to hack and not clear if that is worth the effort.

Personally, I suggest to not add another redundant CI job; instead we can look if the docs for how to build the docs locally can be improved. We can also do other speedups e.g. replace the rst epilogue with some other dictionary mechanism that is faster, if there are easy ways to do them.

@pllim
Copy link
Member

pllim commented Mar 7, 2024

At the dev telecon today (2024-03-07), @hamogu and I voiced concerns for this request:

  • Two different jobs (one in Actions via tox, another in RTD) doing a similar build would cause confusion more than help when one fails but the other doesn't. This is because tox might pull an outdated cache while RTD always builds from scratch.
  • Given RTD does not understand [ci skip] and has its own spam and auto-cancel policies, if the goal of a faster separate job is to be able to push a lot of little commits to fix docs, it would end up hammering RTD instead.

Instead of re-adding this doc CI check (we used to have one before RTD PR build was a thing), these alternatives were discussed:

  1. DOC: Move sphinx-gallery examples out of core (this) repository #16175
  2. Improve dev docs to help contributors build docs locally because then contributor can work off their local cache as needed. Maybe start here https://docs.astropy.org/en/latest/development/docguide.html#building-the-documentation-from-source ?

@pllim
Copy link
Member

pllim commented Mar 7, 2024

Oh oops, I didn't refresh the browser tab, so I missed the other comments posted at the time of dev telecon, sorry if I duplicated some stuff in my comment above. For completeness...

Nick also mentioned in the dev telecon that he tried using parallel build feature but didn't notice any speed up.

"partial doc builds"

This is a very old, still open issue at #10470 . Our docs are so intertwined with each other, no one could think of a simple solution to make it happen.

@humitos
Copy link

humitos commented Mar 7, 2024

Given RTD does not understand [ci skip] and has its own spam and auto-cancel policies, if the goal of a faster separate job is to be able to push a lot of little commits to fix docs, it would end up hammering RTD instead.

@pllim I'm sorry I'm chiming in again here 😅 , but I just wanted to note that we've built a feature to skip a build some time ago: "if there is any command that exits with code 183, the whole build is skipped". We talked about this in the past in readthedocs/readthedocs.org#9649 (comment) and I suggested a way to check for [ci skip] in the commit message and return code 183. This was the command:

case `git --no-pager log --pretty="tformat:%s" -1` in *"skip ci"*) exit 183;; *);; esac

There is a variation of that command that is:

(git --no-pager log --pretty="tformat:%s -- %b" -1 | grep -viq "skip rtd") || exit 183

Not sure what's the best one and which work for your use case, but I you can get the idea from there and adapt as you need. Let me know if that helps 👍🏼

@hamogu
Copy link
Member

hamogu commented Mar 7, 2024

Thanks got @humitos to point out this feature of RTD.

@pllim
Copy link
Member

pllim commented Mar 7, 2024

@humitos , yes, I actually implemented that for another package here:

https://github.com/spacetelescope/jdaviz/blob/c731eef855a0a65a6172fe53fca499ff6a98b3d9/.readthedocs.yaml#L13

However, implementing it for astropy needs more thoughts:

  • We used to have a [docs only] directive that is basically what [skip ci] does now (skips the Actions but not RTD). I (ab)use this feature sometimes.
  • To keep the above, we can introduce [skip rtd] like I did with the other package. Problem is because it is a completely new directive, most of the time I forgot to use it or too lazy to type more stuff into the commit. So it is not used a lot in the end.

@namurphy
Copy link
Contributor Author

namurphy commented Mar 7, 2024

I just did some benchmarks for PlasmaPy's documentation in PlasmaPy/PlasmaPy#2562 and found that:

  • Building the docs in parallel was slightly slower than building the docs on one processor.
  • Skipping the execution of example notebooks took 83% as long as a full doc build. (Some of our notebooks are already pre-executed.)

I did another benchmark on my workstation and found that:

  • Rebuilding the docs after changing a .rst file took almost as long as building the docs from scratch. 🥲 Edited to add: The timings below show this does not apply to Astropy.

Given this final result, I'm going to close this issue and go back to praying that Sphinx gets re-implemented in Rust. Edited to add: The timings below show that caching could significantly speed up the doc build, but there were wasn't much support at the dev meeting for a GitHub Action to rebuild docs, so I'll leave it closed.

The caveat here is that there might be something about our Sphinx config over there which is leading to longer rebuild times, but I've jumped into enough Sphinx dragon holes for the time being... 🐉 🕳️

Here's a summary of what actually has helped with speeding up PlasmaPy's doc build in CI:

@namurphy namurphy closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2024
@pllim
Copy link
Member

pllim commented Mar 7, 2024

Thanks for investigating! Really helpful insights here.

Building the docs in parallel was slightly slower than building the docs in parallel.

I have trouble understanding this conclusion though. 😅

@namurphy
Copy link
Contributor Author

namurphy commented Mar 7, 2024

Oops! I fixed it above.

@saimn
Copy link
Contributor

saimn commented Mar 8, 2024

For astropy, locally a full build take ~6 min and a rebuild a bit more than 1 min.

For rebuilds, most of the time is spent writing files again, I think mostly due to always writing all API files, since automodapi will always regenerate those. Not sure if there is a way to avoid that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants