Skip to content

Commit

Permalink
Run all non-k8s go tests with bazel (#9626)
Browse files Browse the repository at this point in the history
This runs all of the Go non-k8s tests with Bazel.

Our whole test suite runs in 5 minutes now, and faster if you didn't
edit something that makes a lot of tests run. (Pretty much every test
depends on everything, though, because of how realenv works.) If you run
the tests, notice a test fail, and fix the test, it should be less than
two minutes to get CI green now.

Everything is now hermetic; tests that depend on `match` and `pachctl`
get a binary injected into the test runfiles and the tests use that. You
don't need to "go install" either of those while running the test suite
locally.

I fixed a bug where `pachctl diff file` hangs forever if `less` is not
installed. It happens to not be installed in the CI image, and so I
fixed that bug. (If $PAGER can't run, then we just fall back to
io.Copy.)

I dramatically sped up the tests by running the postgres container with
fsync off. With it on, tests running in parallel take more than a minute
to create test database. With it off, it takes a few hundred
milliseconds. This allows all the tests to run in parallel on a 32 core
machine in less than 5 minutes. (Previously, 20 minutes * 6 machines.)

There are two configurations being added here; one Docker-based CI run
for most of the tests, and then a machine-based run for the fuse tests,
which require privileges not available on CircleCI's docker instances.
Both test runners run under Docker, however, so changing the
`/etc/ci-image` image will affect both tests in the same way.

I noticed a lot of contention for creating the Postgres database
container at the start of testing, so I split that code out into a
separate binary, `//src/testing/cmd/dockertestenv`, so that can run
contention-free. This greatly improves the reliability of the tests; in
my various runs I haven't seen any flakes.

I did some test refactoring to split unrelated tests into separate test
files, so that gazelle can generate separate test targets for each, and
they can run in parallel. I've also added some `shard_count` to larger
test packages so that tests inside the suite can run in parallel. This
mostly affects reruns during flakes; only 1/shard_count tests need to be
rerun. I'm targeting each test running in about 60 seconds. There are
some tests that can't be parallelized; mostly src/server/pfs/fuse,
because of the design of the mount server. We need to go fix those.

I have tested this on Linux and ARM64 Mac OS. On OS X, all the tests
pass now, but not necessarily the first time with a "go test ..."
invocation. Too CPU constrained; things time out if they run on the
efficiency cores, I think. I think everyone has a faster Mac than my
first-gen M1, though, so they probably won't have any problems. Because
of test caching, the second Bazel invocation gets the stragglers. (We
probably want `--local_test_jobs=4` on that machine. Haven't tried yet.)

One thing that's missing is sending test results to our Pachyderm
instance. There is some Bazel flag I need to get it to download the test
XML results for cached tests so they can be analyzed on the CI machine
and sent away. We might not need the analysis anymore, though...
buildbuddy tracks test time and flakiness for us.

Test coverage still works; there are some missing lines in the coverage
report below because `pachctl`-only code doesn't get covered (never did,
but we should do it!), and we don't run `dockertestenv` code during the
tests very much anymore (because it runs in the cmd before the tests,
and that's not contributing to coverage).

Notes to reviewers:

@pachyderm/pfs I split up some of your tests into multiple files.

@pachyderm/int Not sure why you're required for this review ;)

@djanicekpach Let me know which stuff in the commented-out test-go you
want me to preserve in addition to what's here now.

Part of CORE-2080
  • Loading branch information
jrockway committed Jan 16, 2024
1 parent 75e7441 commit 7407504
Show file tree
Hide file tree
Showing 124 changed files with 1,052 additions and 523 deletions.
28 changes: 25 additions & 3 deletions .bazelrc
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,36 @@ build:remotecache --nolegacy_important_outputs
# This is only relevant for remote caching, but using the same value of modify_execution_info allows
# sharing the analysis cache between local and remotecache builds, saving time when you switch the
# flag.
build --modify_execution_info=PackageTar=+no-remote,GoLink=+no-remote-cache,OCIImage=+no-remote-cache
build --modify_execution_info=PackageTar=+no-remote,ConvertStatusToJson=+no-remote,Jq=+no-remote,CopyFile=+no-remote,OCIImage=+no-remote-cache

# Common build options.
build --workspace_status_command=$(pwd)/workspace_status.sh
build --stamp

# Common options.
common --test_env=ENT_ACT_CODE
common --test_env=ENT_ACT_CODE # Tests need this to active Pachyderm enterprise.
common --test_env=SKIP_DOCKER_POSTGRES_CREATE
common --test_env=DOCKER_MACHINE_NAME
common --test_env=AMAZON_CLIENT_ID
common --test_env=AMAZON_CLIENT_SECRET
common --test_env=AMAZON_CLIENT_BUCKET
common --test_env=AMAZON_CLIENT_REGION
common --test_env=ECS_CLIENT_ID
common --test_env=ECS_CLIENT_SECRET
common --test_env=ECS_CLIENT_BUCKET
common --test_env=ECS_CLIENT_CUSTOM_ENDPOINT
common --test_env=GOOGLE_CLIENT_BUCKET
common --test_env=GOOGLE_CLIENT_CREDS
common --test_env=GOOGLE_CLIENT_HMAC_ID
common --test_env=GOOGLE_CLIENT_HMAC_SECRET
common --test_env=GOOGLE_CLIENT_BUCKET
common --test_env=GOOGLE_CLIENT_REGION
common --test_env=MICROSOFT_CLIENT_ID
common --test_env=MICROSOFT_CLIENT_SECRET
common --test_env=MICROSOFT_CLIENT_CONTAINER
common --test_env=GO_TEST_WRAP_TESTV=1 # This makes go tests run in -v mode.

# Cover all packages.
coverage --instrumentation_filter="^//" --combined_report=lcov

# --config=race for running tests with the race detector.
test:race --@rules_go//go/config:race
122 changes: 66 additions & 56 deletions .circleci/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,57 +69,6 @@ executors:
- image: cimg/python:3.10-node

jobs:
test-go:
machine:
image: << pipeline.parameters.machine_image >>
resource_class: large
parallelism: 5
environment:
TEST_RESULTS: /tmp/test-results
steps:
- checkout
- go/install:
version: << pipeline.parameters.go-version >>
- run: mkdir ${TEST_RESULTS}
- restore_cache:
keys:
- pach-go-unittest-mod-cache-v1-{{arch}}-{{ checksum "go.sum" }}
- run: go mod download
- run:
name: Collect tests to run
command: GOMAXPROCS=2 PACHYDERM_LOG_LEVEL="debug" go run ./src/testing/cmds/harness/test-collector/main.go -file=tests_to_run_unit.csv -procs=16
background: true
- run: go install gotest.tools/gotestsum@latest
- run: CGO_ENABLED=0 GOMAXPROCS=2 go install ./src/server/cmd/pachctl
- run: go install ./src/testing/match
- run: etc/testing/circle/install.sh
- run:
name: Run Go Unit Tests
no_output_timeout: 20m
command: |-
GOMAXPROCS=7 GOMEMLIMIT=8000MiB go run src/testing/cmds/harness/test-runner/main.go \
-file=tests_to_run_unit.csv \
-shard=$CIRCLE_NODE_INDEX \
-total-shards=$CIRCLE_NODE_TOTAL \
-threads=2 \
-gotestsum-args="--format=testdox" \
-gotest-args="-parallel=6 -timeout=20m -cover -test.gocoverdir=\"$TEST_RESULTS\" -coverprofile=${TEST_RESULTS}/coverage.txt -covermode=atomic -coverpkg=./..."
- run:
name: Format code coverage
when: always
command: etc/testing/circle/format-coverage.sh
- store_artifacts: # upload test summary for display in Artifacts
path: "/tmp/test-results/"
destination: raw-test-output
- store_test_results: # upload test results for display in Test Summary
path: "/tmp/test-results/circle"
- codecov/upload:
file: /tmp/test-results/coverage.txt
- collect-test-results
- save_cache:
key: pach-go-unittest-mod-cache-v1-{{arch}}-{{ checksum "go.sum" }}
paths:
- /home/circleci/go/pkg/mod
static-go-checks:
resource_class: xlarge
machine:
Expand Down Expand Up @@ -1601,20 +1550,81 @@ jobs:
- run: git checkout MODULE.bazel.lock # For OSX-only changes.
- run: git diff --exit-code
- run: bazel test --config=remotecache --remote_header=x-buildbuddy-api-key=${BUILDBUDDY_API_KEY} //:buildifier_test
- run: cp -a bazel-testlogs/ /tmp/bazel-testlogs # store_test_results can't handle bazel-testlogs being a symlink
- run:
command: cp -a bazel-testlogs/ /tmp/bazel-testlogs # store_test_results can't handle bazel-testlogs being a symlink
when: always
- store_test_results:
path: /tmp/bazel-testlogs/
bazel-tests:
resource_class: xlarge
resource_class: 2xlarge+
docker:
- image: << pipeline.parameters.ci_runner_image >>
steps:
- checkout
- setup_remote_docker
- run: bazel test --config=remotecache --remote_header=x-buildbuddy-api-key=${BUILDBUDDY_API_KEY} //oci:all
- run: cp -a bazel-testlogs/ /tmp/bazel-testlogs # store_test_results can't handle bazel-testlogs being a symlink
- run: |
DOCKER_MACHINE_NAME=172.17.0.1 \
bazel run --config=remotecache --remote_header=x-buildbuddy-api-key=${BUILDBUDDY_API_KEY} \
//src/testing/cmds/dockertestenv
- run: |
SKIP_DOCKER_POSTGRES_CREATE=1 \
DOCKER_MACHINE_NAME=172.17.0.1 \
bazel \
coverage \
--flaky_test_attempts=3 \
--config=remotecache \
--remote_header=x-buildbuddy-api-key=${BUILDBUDDY_API_KEY} \
...
- run:
command: cp -a bazel-testlogs/ /tmp/bazel-testlogs # store_test_results can't handle bazel-testlogs being a symlink
when: always
- store_test_results:
path: /tmp/bazel-testlogs/
- codecov/upload:
file: bazel-out/_coverage/_coverage_report.dat
bazel-privileged-tests:
resource_class: large
machine:
image: << pipeline.parameters.machine_image >>
steps:
- checkout
- run: cd /home/circleci/project
- run: mkdir /home/circleci/out
# Build the thing to start the database containers.
- run: |
docker run \
-v /home/circleci/out:/home/circleci/out \
-v /home/circleci/project:/home/circleci/project \
<< pipeline.parameters.ci_runner_image >> \
bazel \
--output_base=/home/circleci/out \
build \
--config=remotecache \
--remote_header=x-buildbuddy-api-key=${BUILDBUDDY_API_KEY} \
//src/testing/cmds/dockertestenv
# Start database containers.
- run: /home/circleci/out/execroot/_main/bazel-out/k8-fastbuild/bin/src/testing/cmds/dockertestenv/dockertestenv_/dockertestenv
# Run tests without needing to talk to the Docker daemon during the tests.
- run: |
docker run \
-e ENT_ACT_CODE \
-e SKIP_DOCKER_POSTGRES_CREATE=1 \
-e DOCKER_MACHINE_NAME=172.17.0.1 \
-v /home/circleci/out:/home/circleci/out \
-v /home/circleci/project:/home/circleci/project \
--privileged \
<< pipeline.parameters.ci_runner_image >> \
bazel \
--output_base=/home/circleci/out \
coverage \
--flaky_test_attempts=3 \
--config=remotecache \
--remote_header=x-buildbuddy-api-key=${BUILDBUDDY_API_KEY} \
//:privileged_tests
- store_test_results:
path: /home/circleci/out/execroot/_main/bazel-out/k8-fastbuild/testlogs/
- codecov/upload:
file: /home/circleci/out/execroot/_main/bazel-out/_coverage/_coverage_report.dat

workflows:
functional-tests:
Expand All @@ -1623,8 +1633,8 @@ workflows:
- build-docker-images
- build-docker-images:
enableCoverage: true
- test-go
- bazel-tests
- bazel-privileged-tests
- integration-tests
linting:
when: << pipeline.parameters.run-core-jobs >>
Expand Down
86 changes: 70 additions & 16 deletions BAZEL.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,15 @@
# Bazel

You may have noticed some `BUILD.bazel` files throughout the repository. We are in the process of
building everything here with [Bazel](https://bazel.build/), so that day 1 setup is simplified, and
so that test tooling can be shared between languages.
building everything here with [Bazel](https://bazel.build/), so that day 1 setup is simplified and
so that test tooling can be shared between languages. We also want to cache tests so that editing
README.md doesn't involve running all of our Go tests.

Right now, only proto generation requires Bazel. In the future, everything will be possible to work
on with normal Go tools, but it will also be possible to use Bazel. It is likely that most
Pachyderm-specific development tasks, like deploying a dev environment and syncing your code changes
to it, will require Bazel. We'll probably also use it for CI, because of the test result caching.
But, because people depend on `github.com/pachyderm/pachyderm/v2` with "go get", we can never fully
convert to only using Bazel as the build system. Rather, we will use both Bazel and go.mod for the
foreseeable future. Hence, we check in all of our generated code.

Additionally, the relase process will also use `goreleaser` for a short while.

TL;DR: Bazel is strictly an overlay unless you actively work on this codebase day-to-day.
Currently, Bazel is only for environment setup, `make proto`, and Go tests that don't require
Kubernetes. Because people depend on installing `github.com/pachyderm/pachyderm/v2` with "go get",
we can never fully convert to only using Bazel as the build system. Rather, we will use both Bazel
and `go.mod` for the foreseeable future. Hence, we check in all of our generated code, and you can
use `go test` for everything if you hate Bazel for some reason.

## Setup

Expand All @@ -35,6 +30,13 @@ Note: you will need a C++ compiler installed. `apt install build-essential` on D
xcode dance on Mac OS. The C++ compiler is used to build some internal Bazel tools like the shell
script wrapper.

Note: Every time you update Mac OS, you need to run
`/usr/sbin/softwareupdate --install-rosetta --agree-to-license`. Some of our dependencies don't
build `darwin_arm64` binaries, and we fall back to `darwin_amd64` binaries. Rosetta is what lets
M1/M2/M3 Macs run `amd64` binaries. You will see `Bad CPU type in executable` if Rosetta isn't
working. (You may need to run `bazel shutdown` after installing Rosetta, though this is only
suggested by Apple and doesn't appear to be required 100% of the time.)

Note: you will need a Python 3 interpreter installed as `python3`. `apt install python3` on Debian.
This is because `rules_python` uses the host `python3` to find the version of Python installed via
`rules_python` in `MODULE.bazel`.
Expand Down Expand Up @@ -82,6 +84,12 @@ needs the cert in `/etc/ssl/certs`.

## Use

## Write Go code

Writing Go code works like it always did. Your editor can understand the entire source tree without
any help. Whenever you add or delete files, or change dependencies, run `bazel run //:gazelle` to
update the associated BUILD files for Bazel.

## Regenerating protos

Right now, `make proto` calls out to `bazel run //:make_proto` to generate the Go protos. When you
Expand Down Expand Up @@ -111,6 +119,26 @@ this into a registry to pull with `skopeo copy oci:bazel-bin/oci/pachd_image ...

Very soon, pushing to a dev environment will be automated.

## Run Tests

`bazel test ...` will run all tests. Test caching means tests whose results couldn't have changed
since the last time they were run will not actually be run, so if you're just editing one file it's
not much slower to run this. You can also pick the test you want to run, like
`bazel test //src/internal/archiveserver:archiveserver_test`. `--test_output=streamed` will show you
the messages from the test as they run (overwhelming if `...`), otherwise, bazel will print the path
of the test logs for failing tests. They are also available in the BuildBuddy UI.

For more Go details, see
https://github.com/bazelbuild/rules_go/blob/master/docs/go/core/rules.md#go_test

In the past, many tests had dependencies on a freshly-built `pachctl` or `match` binary. This is not
the case if you run the tests with Bazel; Bazel automatically includes the version of those binaries
based on your working tree in the test's environment.

### Run one test case

`bazel test //the:test --test_filter=TestCaseIWantToRun`

## Tools

### Gazelle
Expand Down Expand Up @@ -148,9 +176,6 @@ To run pachctl, `bazel run //:pachctl`. You can also build the binary and copy i
like. After `bazel run //:pachctl` or `bazel build //:pachctl`, it will be in
`bazel-bin/src/server/cmd/pachctl/pachctl_/pachctl`.

This version of pachctl has the current version number, based on `workspace_status.sh`, baked into
it. Thus, `pachctl version` on the binary will lead you to the exact code it was built with!

## Hints

`bazel run //target -- ...` prevents `...` from being interpreted as an argument to Bazel, which is
Expand Down Expand Up @@ -217,3 +242,32 @@ whether or not it affected anything.

Pulumi eventually depends on github.com/cloudflare/circl, which will require this workaround:
https://github.com/bazelbuild/bazel-gazelle/issues/1421#issuecomment-1424075874

### Profiling

Sometimes you want a whole-system profile of some tests running or something. Build the Go binaries
in debug mode with `-c dbg` to get the symbols, and run the tests outside of the Bazel sandbox with
`--spawn_strategy=local` so that `hotspot` can later find the binaries to inspect.

For example, to see what is making a database-based test slow, pick one at random and run 64 copies
at the same time:

perf record -F 99 -e cpu-clock -ag -- bazel test //src/server/auth/server/testing:auth_test -c dbg \
--test_filter=TestListRepoNotLoggedInError --runs_per_test=64 --jobs=64 --local_test_jobs=64

Then look at `perf.out` with `hotspot`:

hotspot perf.out

This did not help with debugging database speed, but I wanted to document it anyway.

### ssh-ing to CI

Sometimes your tests will fail in CI, but not on your workstation. Traditionally, you would click
"Rerun with SSH" in Circle. You can still do that, but you can also `bazel run //etc/ci-image:run`
to be placed in a shell that is configured identically to CI; CI uses the same Docker image. Your
current working copy of Pachyderm will be mounted in `/home/circleci/project`, just like CI. You can
edit files on your workstation and they are immediately there; it's just a read-only bind mount. You
can run `git` or `bazel` inside this environment as well. While converting to Bazel, I've had a lot
of weird failures on CI, and this has always solved them without wasting time waiting for CI to
restart!
6 changes: 6 additions & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,9 @@ alias(
name = "pachctl",
actual = "//src/server/cmd/pachctl",
)

test_suite(
name = "privileged_tests",
tags = ["manual"],
tests = ["//src/server/pfs/fuse:fuse_test"],
)
1 change: 0 additions & 1 deletion MODULE.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,6 @@ use_repo(
"com_github_fsouza_go_dockerclient",
"com_github_go_logr_zapr",
"com_github_go_sql_driver_mysql",
"com_github_gogo_protobuf",
"com_github_golang_protobuf",
"com_github_google_btree",
"com_github_google_go_cmp",
Expand Down

0 comments on commit 7407504

Please sign in to comment.