Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The CI improvements we need [tracking issue] #1886

Open
10 of 16 tasks
thedavidprice opened this issue Feb 27, 2021 · 20 comments
Open
10 of 16 tasks

The CI improvements we need [tracking issue] #1886

thedavidprice opened this issue Feb 27, 2021 · 20 comments

Comments

@thedavidprice
Copy link
Contributor

thedavidprice commented Feb 27, 2021

Goal

The process of drafting and publishing a new release (with high quality confidence) should take no more than 2-3 hours.

Current Problems

  1. QA Bottlenecks: too much manual QA that happens too infrequently (and mostly only when ready for release process)
  2. Misc tasks-required-to-release Bottlenecks: merged PRs often produce other task requirements to complete before release (e.g. new docs or updates to existing, manual QA tests, Release Note copy e.g. Upgrade Instructions or Breaking explanations)
  3. David P.: I’ve become the keeper of QA and to-dos related to each release, which makes me the bottleneck

Improvements and Solutions

  1. Improve the integrity of each PR (add Cypress E2E check) and add additional CI automation (e.g. deploy)
  2. Improve the process when meging PRs: e.g. better organization + communication amongst maintainers
  3. Collaboration: improve documentation and figure out the collaborative tools to use

Cypress E2E runner

Current Cypress test covers Tutorial parts 1 - 6. It's available to run manually against either the current branch or against a specific project using the packages installed in the project:

note: I'm not convinced we need to add every part of Tutorial I and Tutorial II. But we do need to make sure each functional element is working.

Improvements

  • Create a GitHub Action to run Cypress as a required PR Check
  • matrix run: Node and OS [Windows, Linux]
    • currently only targeting Node v14; need to add 16 sometime before October
    • need to add Windows to E2E
  • test CSS styling on Scaffold UI (this was almost missed last time; currently requires manually watching runner)
  • test misc CLI commands, e.g.: dataMigrate, setup auth [netlify], setup deploy [netlify] (note: just need to confirm commands run and complete; not sure if we need to then test functionality of each at this point)

Test and Storybook Commands

For full automation, ideally, we would create a project using the E2E runner. If passing, we would then run these checks:

  • Test: run rw test against a project that has passing Web and API tests
  • Storybook: run rw storybook against a project with Cells, pages, Layout, etc. and then use Cypress to confirm one of each component is displaying correctly (including states for Cells)

Prerender

Need to determine testing for pretender. Both during dev and deploy:

  • TO DO

Next Priorities (additions to original list)

Deployment + API/DB Connection

I'm currently running manual Netlify and Vercel deploys, which requires 1) checking build logs for clean build and 2) checking deployed site to confirm API connection and DB are working accordingly.

We need to automate deploy and API/DB tests:

  • when canaries are published, could trigger a workflow to install, modify, and deploy a project (note: modification needs to be automated as well for consistency across a growing number of deploy targets)
  • would need DBs and, likely, resetting a DB before each deploy
  • use Pingdom or some other cron tasks to consistently check API endpoints
  • alerts+notifications: recommend create a notification channel in the Core Team's Slack Workspace

Release Notes

At this time, I don't believe we need specific automation for drafting the Release Notes themselves. We have a good process, and manually reviewing and writing feels like helpful practice at this point in the project.

What would be helpful is consolidating where we draft Release Notes and possibly automating distribution (e.g. GitHub, Forums, and Twitter).

  • needs discussion

Why do PR checks take so much time to complete?

We currently have the following PR checks:

  • yarn install
  • yarn build
  • yarn lint
  • yarn test

And we run a matrix against Node.js [v12, v14] && OS [Windows, Linux]. So there are 4 runners in parallel.

Linux runners complete in 3-4 minutes. ✅
Windows runners complete in 9-10 minutes. ⛔️

Why, Windows? WHY?

TS

TS build takes time. It's not terrible though

Windows: Caching and Connection Timeouts

Take a look at a Windows run, which is often around 10 minutes in total. The bulk of the time (4 minutes) is yarn install with the yarn build and yarn test tieing for 2nd place at 2 minutes.

Windows node_modules cache is unreliable

We are able to use the GitHub Action node_modules cache on Linux. However, last time we tried the Windows cache was unreliable and would often have false negatives when checking if cache needed to be updated -- this led to stale node_modules being used, which threw confusing errors (e.g. if a package was added or upgraded, the runner wouldn't be using the latest).

However, even without the cache, I was surprised to find out the real problem adding to total time is a connection error.

"There appears to be trouble with your network connection..."

I've watched and timed the installation process, and it turns out a connection error takes 1-2 minutes of time per installation:

info There appears to be trouble with your network connection. Retrying...

You can see it here:
Screen Shot 2021-02-27 at 12 57 51 PM

About 3-4 months ago, I spent at least 4 hours attempting to diagnose and resolve this connection error. Unsuccessfully. I determined it was either a bug with Yarn on Windows or it's an issue with GitHub's networking for the Windows runners.

I owe 🍻 to whoever can figure this one out.

@thedavidprice thedavidprice pinned this issue Feb 27, 2021
@thedavidprice
Copy link
Contributor Author

I intentionally left out the topic of "Performance: Benchmarks and Checks". I have some ideas and examples to draw from. For me, this feels like the next step after the kinds of CI improvements I mentioned above.

@Tobbe
Copy link
Member

Tobbe commented Feb 28, 2021

Regarding Windows

I'm seeing info There appears to be trouble with your network connection. Retrying... too, when running locally on Windows. I've just always assumed it was an actual problem with my flaky connection, but maybe it isn't.

Windows' file system is slower than Linux's, so when reading or writing lots of small files, Windows is going to see a lot worse performance than Linux. And guess what yarn install is doing...

Googling around a bit, I found this thread https://travis-ci.community/t/yarn-network-troubles/333 The suggest disabling Windows Defender. Combining the info there, with these Windows Defender instructions for Github Actions could be worth exploring. actions/runner-images#855

@peterp
Copy link
Contributor

peterp commented Mar 1, 2021

Here's my e2e (cypress) GitHub workflow:

name: E2E Tests
on: [push]
jobs:
  cypress-run:
    runs-on: ubuntu-16.04

    services:
      postgres:
        image: postgres:11.5
        ports: ["5432:5432"]
        options: --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5
        env:
          POSTGRES_PASSWORD: postgres

    steps:
      - name: Checkout
        uses: actions/checkout@v1
      - name: Install modules
        run: yarn install --frozen-lockfile

      - name: Migrate test database
        run: yarn rw prisma migrate deploy
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost/postgres

      - name: Start server in background
        run: yarn rw dev &
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost/postgres

      - name: Cypress run
        uses: cypress-io/github-action@v2
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost/postgres
          CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          wait-on: 'http://localhost:8910'
          record: true
          working-directory: scripts/e2e

@Tobbe
Copy link
Member

Tobbe commented Mar 1, 2021

Cypress on Windows has a bug that we'd have to work around if we want to get cypress tests running on both ubuntu and windows cypress-io/cypress#789

@thedavidprice
Copy link
Contributor Author

@Tobbe

Bummer all around regarding these Windows issues. ☹️

Googling around a bit, I found this thread https://travis-ci.community/t/yarn-network-troubles/333 The suggest disabling Windows Defender. Combining the info there, with these Windows Defender instructions for Github Actions could be worth exploring. actions/runner-images#855

I previously did try implementing several of the settings mentioned in this thread, which did not resolve the connection issue for me. Definitely worth a second-round attempt with better documentation this time. Also, I could only test via the GitHub action since I’m on Mac — testing on your local machine first would be a very helpful.

@Tobbe
Copy link
Member

Tobbe commented Mar 15, 2021

Was getting errors on the "windows-latest | Node 14" GH actions run for a PR, and found this issue that matches the error I saw yarnpkg/yarn#8242

The error I was seeing is error An unexpected error occurred: "https://registry.yarnpkg.com/date-fns/-/date-fns-2.16.1.tgz: ESOCKETTIMEDOUT".
https://github.com/redwoodjs/redwood/pull/1967/checks?check_run_id=2111565278

And the this is why yarn closed the issue on their end

Closing as fixed in v2 where the timeout logic is less susceptible to this sort of issue

So, not "fixed", but "less susceptible".

@thedavidprice
Copy link
Contributor Author

thedavidprice commented Mar 17, 2021

Next Priorities

added to OP

@thedavidprice
Copy link
Contributor Author

@Tobbe re: Cypress on Windows bug mentioned by you above:

Based on this comment in the Issue you linked, it seems possible for us to add a Windows runner to our Cypress CI, yes? Or am I misunderstanding the related Windows issues.

@Tobbe
Copy link
Member

Tobbe commented Apr 3, 2021

@thedavidprice I haven't tried. Latest version of cy I've tested with is 6.4.0. That comment you linked is about 6.5.0. But looking at the changelog for v6.5.0 I don't see anything that should affect this issue. What makes me cautiously optimistic is that she says it works using cmd and ps, but is broken on git-bash. I've only tried git-bash. So maybe it would have worked for me as well had I been using ps.

Do we know what shell the Windows runner is using?

Could be worth it to just try, and see how/where/if it breaks.

@thedavidprice
Copy link
Contributor Author

@Tobbe looks like we have three options. Not sure how we choose, but I'm optimistic this makes it possible. I'd definitely like to give it a go as the feels like a huge hole in our CI right now:
https://github.com/actions/virtual-environments/blob/main/images/win/Windows2019-Readme.md#shells

@thedavidprice
Copy link
Contributor Author

If we build the script in this RFC #2218, I think we are only a few steps away from being able to automate deployment testing.

@thedavidprice
Copy link
Contributor Author

Update: added , Scaffold CSS, and Toast checks here #2164

@Tobbe
Copy link
Member

Tobbe commented Apr 6, 2021

looks like we have three options. Not sure how we choose, but I'm optimistic this makes it possible.

All three seems to be based on bash. The shells Sandra said it worked in were TS and Cmd, i.e. not bash based.

I'll take a look tonight at rewriting those cy.exec calls to something that works on Windows too.

@thedavidprice
Copy link
Contributor Author

@Tobbe check out this (almost) successful test PR: #2227

I possibly verified that the GitHub Action cypress-io/github-action@v2 does run on Windows. What we need to adjust is the bash commands in the steps leading up to the Cypress run in our Workflow E2E file.

  1. I explicitly set the shell to bash (would default to Powershell and fail 'cause Linux commands)
  2. Using bash, I ran into a strange directory failure on the step running rw dev
  3. I then removed the rw dev step just to see if Cypress would start successfully, which it seemed to do!

FYI: currently I've re-added the step to start rw dev

Could you take a stab at diagnosing the issue with the rw dev step? Also, I welcome alternatives to setting 'bash' and/or using a conditional to run Windows-specific commands.

@Tobbe
Copy link
Member

Tobbe commented Apr 7, 2021

I'm afraid getting Cypress to start is the easy part. What's going to be difficult is getting the tests to pass. I tried here: #2228. They're super flaky, but maybe we can improve that. What really got me stuck though is deleting the dev db file. File locks don't work the same way on Linux and Windows.

@thedavidprice
Copy link
Contributor Author

Services to manage updating and deploying projects

Prisma has a great CI + Actions set up to test deploys and automatically update versions of project:

@janpio
Copy link

janpio commented Apr 9, 2021

Happy to help out with any Renovate or Prisma e2e-tests related questions - we are pretty happy with this setup and it helps us catch loads of things on our side and on the side of partners as well 👍 (@janpio on shared Slack or Prisma Slack)

@thedavidprice
Copy link
Contributor Author

thedavidprice commented Apr 30, 2021

  • add CI for yarn rw serve

Added to OP

@thedavidprice
Copy link
Contributor Author

thedavidprice commented Feb 23, 2022

Made good progress on Deploy Automation in #4484

Remaining TO DO (post v1):

  • Serverless CD: auth via GH secret
  • Render: Have no idea how to set up CD for monorepo and workspaces
  • Rebuild projects: ideally when breaking changes happen we'd have automation to rebuild each project

@thedavidprice
Copy link
Contributor Author

Update: Deployment Automation

This CI project is complete! Repo here: https://github.com/redwoodjs/deploy-target-ci

What it does
Each time a new canary is published, the following events are triggered:

  1. Renovate updates the @redwoodjs/...@canary packages for all projects
  2. CI does a build and test run for each project; Netlify and Vercel previews run
  3. If CI passes, there’s another Action to Automerge (every 5 minutes)
  4. Merge triggers CD for all 4 providers
  5. Deployment success/fail notifications are sent to Slack channel
  6. Pingdom is continually monitoring the deploy URLs: home page, graphql function, and query

🚀

@jtoar jtoar removed the v1/priority label May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

5 participants