Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify job is taking >15 min #1849

Closed
alculquicondor opened this issue Mar 14, 2024 · 21 comments
Closed

verify job is taking >15 min #1849

alculquicondor opened this issue Mar 14, 2024 · 21 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be cleaned:

Verify is taking too long

https://testgrid.k8s.io/sig-scheduling#pull-kueue-verify-main&width=20

Would increasing the number of CPUs help? Otherwise, what is being particularly slow?

Why is this needed:

This adds more time to the release process.

@alculquicondor alculquicondor added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 14, 2024
@alculquicondor
Copy link
Contributor Author

@Vandit1604 is this something you would be interested in invesitgating?

@tenzen-y
Copy link
Member

@alculquicondor
Copy link
Contributor Author

@Vandit1604 please write /assign in a comment

@Vandit1604
Copy link
Contributor

/assign
I'll take a look at this

@Vandit1604
Copy link
Contributor

Vandit1604 commented Mar 17, 2024

The delay is because of shell-lint rule. I updated it taking https://github.com/kubernetes/kubernetes/blob/0a31504ee114e363df9d69158c2e639858b8273b/hack/verify-shellcheck.sh as reference.

Here is the diff with the changes

diff --git a/hack/verify-shellcheck.sh b/hack/verify-shellcheck.sh
index c566f772..b4850dbf 100755
--- a/hack/verify-shellcheck.sh
+++ b/hack/verify-shellcheck.sh
@@ -17,11 +17,44 @@
 # allow overriding docker cli, which should work fine for this script
 DOCKER="${DOCKER:-docker}"
 
+
 SHELLCHECK_VERSION="0.9.0"
 SHELLCHECK_IMAGE="docker.io/koalaman/shellcheck-alpine:v0.9.0@sha256:e19ed93c22423970d56568e171b4512c9244fc75dd9114045016b4a0073ac4b7"
 
+# common arguments we'll pass to shellcheck
+SHELLCHECK_OPTIONS=(
+  # allow following sourced files that are not specified in the command,
+  # we need this because we specify one file at a time in order to trivially
+  # detect which files are failing
+  "--external-sources"
+  # include our disabled lints
+  "--exclude=${SHELLCHECK_DISABLED}"
+  # set colorized output
+  "--color=${SHELLCHECK_COLORIZED_OUTPUT}"
+)
+
 # Currently disabled these errors will take care of them later
-DISABLED_ERRORS="SC2002,SC3028,SC3054,SC3014,SC3040,SC3046,SC3030,SC3010,SC3037,SC3045,SC3006,SC3018,SC3016,SC3011,SC3044,SC3043,SC3060,SC3024,SC1091,SC2066,SC2086,SC2034,SC1083,SC1009,SC1073,SC1072,SC2155,SC2046"
+SHELLCHECK_DISABLED="SC2002,SC3028,SC3054,SC3014,SC3040,SC3046,SC3030,SC3010,SC3037,SC3045,SC3006,SC3018,SC3016,SC3011,SC3044,SC3043,SC3060,SC3024,SC1091,SC2066,SC2086,SC2034,SC1083,SC1009,SC1073,SC1072,SC2155,SC2046"
+
+scripts_to_check=("$@")
+if [[ "$#" == 0 ]]; then
+  # Find all shell scripts excluding:
+  # - Anything git-ignored - No need to lint untracked files.
+  # - ./_* - No need to lint output directories.
+  # - ./.git/* - Ignore anything in the git object store.
+  # - ./vendor* - Vendored code should be fixed upstream instead.
+  # - ./third_party/*, but re-include ./third_party/forked/*  - only code we
+  #    forked should be linted and fixed.
+  while IFS=$'\n' read -r script;
+    do git check-ignore -q "$script" || scripts_to_check+=("$script");
+  done < <(find . -name "*.sh" \
+    -not \( \
+      -path ./_\*      -o \
+      -path ./.git\*   -o \
+      -path ./vendor\* -o \
+      \( -path ./third_party\* -a -not -path ./third_party/forked\* \) \
+    \))
+fi
 
 # Download shellcheck-alpine from Docker Hub
 echo "Downloading ShellCheck Docker image..."
@@ -29,6 +62,9 @@ echo "Downloading ShellCheck Docker image..."
 
 # Run ShellCheck on all shell script files, excluding those in the 'vendor' directory
 echo "Running ShellCheck..."
-"${DOCKER}" run --rm -v "$(pwd):$(pwd)" -w "$(pwd)" "${SHELLCHECK_IMAGE}" \
-  find .. -type f -name '*.sh' -exec shellcheck --exclude="$DISABLED_ERRORS" {} +
+"${DOCKER}" run \
+  --rm -v "$(pwd)" -w "$(pwd)" \
+    "${SHELLCHECK_IMAGE}" \
+  shellcheck "${SHELLCHECK_OPTIONS[@]}" "${scripts_to_check[@]}" >&2 || res=$?
+
 echo "Shellcheck ran successfully"
diff
root@vandit-box:/home/vandit/Github-Forks/kueue# time make verify
go mod tidy
git --no-pager diff --exit-code go.mod go.sum
/home/vandit/Github-Forks/kueue/bin/golangci-lint run --timeout 15m0s
./hack/verify-shellcheck.sh
Downloading ShellCheck Docker image...
docker.io/koalaman/shellcheck-alpine@sha256:e19ed93c22423970d56568e171b4512c9244fc75dd9114045016b4a0073ac4b7: Pulling from koalaman/shellcheck-alpine
Digest: sha256:e19ed93c22423970d56568e171b4512c9244fc75dd9114045016b4a0073ac4b7
Status: Image is up to date for koalaman/shellcheck-alpine@sha256:e19ed93c22423970d56568e171b4512c9244fc75dd9114045016b4a0073ac4b7
docker.io/koalaman/shellcheck-alpine:v0.9.0@sha256:e19ed93c22423970d56568e171b4512c9244fc75dd9114045016b4a0073ac4b7

What's Next?
  View a summary of image vulnerabilities and recommendations → docker scout quickview docker.io/koalaman/shellcheck-alpine:v0.9.0@sha256:e19ed93c22423970d56568e171b4512c9244fc75dd9114045016b4a0073ac4b7
Running ShellCheck...
Unknown value for --color. Valid options are: auto, always, never
Shellcheck ran successfully
./hack/verify-toc.sh
Checking table of contents are up to date...
Cleaning up...
/home/vandit/Github-Forks/kueue/bin/controller-gen \
	crd:generateEmbeddedObjectMeta=true output:crd:artifacts:config=config/components/crd/bases\
	paths="./apis/..."
/home/vandit/Github-Forks/kueue/bin/controller-gen \
	rbac:roleName=manager-role output:rbac:artifacts:config=config/components/rbac\
	webhook output:webhook:artifacts:config=config/components/webhook\
	paths="./pkg/controller/...;./pkg/webhooks/...;./pkg/util/cert/...;./pkg/visibility/..."
go mod download
/home/vandit/Github-Forks/kueue/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./apis/..."
./hack/update-codegen.sh go
Generating defaulter code for 2 targets
Generating conversion code for 2 targets
Generating openapi code for 3 targets
Generating applyconfig code for 3 targets
Generating client code for 3 targets
Generating lister code for 3 targets
Generating informer code for 3 targets
SED=/usr/bin/sed ./hack/update-helm.sh
cd  /home/vandit/Github-Forks/kueue/site/genref/ && /home/vandit/Github-Forks/kueue/bin/genref  -o /home/vandit/Github-Forks/kueue/site/content/en/docs/reference
I0317 20:44:00.067350   25439 main.go:142] Parsing go packages in sigs.k8s.io/kueue/apis/kueue/v1beta1
I0317 20:44:04.056445   25439 main.go:325] Output written to /home/vandit/Github-Forks/kueue/site/content/en/docs/reference/kueue.v1beta1.md
I0317 20:44:04.056472   25439 main.go:142] Parsing go packages in sigs.k8s.io/kueue/apis/config/v1beta1
I0317 20:44:32.410099   25439 main.go:325] Output written to /home/vandit/Github-Forks/kueue/site/content/en/docs/reference/kueue-config.v1beta1.md
I0317 20:44:32.410127   25439 main.go:142] Parsing go packages in sigs.k8s.io/kueue/apis/kueue/v1alpha1
I0317 20:44:35.994253   25439 main.go:325] Output written to /home/vandit/Github-Forks/kueue/site/content/en/docs/reference/kueue-alpha.v1alpha1.md
sed -r 's/v[0-9]+\.[0-9]+\.[0-9]+/v0.6.1/g' -i README.md -i site/config.toml
/home/vandit/Github-Forks/kueue/bin/yq e '.appVersion = "v0.6.1"' -i charts/kueue/Chart.yaml
git --no-pager diff --exit-code config/components apis charts/kueue/templates client-go site/

**real	2m34.515s
user	3m42.774s
sys	0m55.102s**

the time came down significantly

after the change, the output time of make verify was ⬇️

root@vandit-box:/home/vandit/Github-Forks/kueue# time make verify
go mod tidy
git --no-pager diff --exit-code go.mod go.sum
/home/vandit/Github-Forks/kueue/bin/golangci-lint run --timeout 15m0s
./hack/verify-shellcheck.sh
Downloading ShellCheck Docker image...
Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp [2600:1f18:2148:bc01:571f:e759:a87a:2961]:443: connect: network is unreachable
Running ShellCheck...
Shellcheck ran successfully
./hack/verify-toc.sh
Checking table of contents are up to date...
Cleaning up...
/home/vandit/Github-Forks/kueue/bin/controller-gen \
	crd:generateEmbeddedObjectMeta=true output:crd:artifacts:config=config/components/crd/bases\
	paths="./apis/..."
/home/vandit/Github-Forks/kueue/bin/controller-gen \
	rbac:roleName=manager-role output:rbac:artifacts:config=config/components/rbac\
	webhook output:webhook:artifacts:config=config/components/webhook\
	paths="./pkg/controller/...;./pkg/webhooks/...;./pkg/util/cert/...;./pkg/visibility/..."
go mod download
/home/vandit/Github-Forks/kueue/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./apis/..."
./hack/update-codegen.sh go
Generating defaulter code for 2 targets
Generating conversion code for 2 targets
Generating openapi code for 3 targets
Generating applyconfig code for 3 targets
Generating client code for 3 targets
Generating lister code for 3 targets
Generating informer code for 3 targets
SED=/usr/bin/sed ./hack/update-helm.sh
cd  /home/vandit/Github-Forks/kueue/site/genref/ && /home/vandit/Github-Forks/kueue/bin/genref  -o /home/vandit/Github-Forks/kueue/site/content/en/docs/reference
I0317 21:59:20.484140   67613 main.go:142] Parsing go packages in sigs.k8s.io/kueue/apis/kueue/v1beta1
I0317 21:59:23.050714   67613 main.go:325] Output written to /home/vandit/Github-Forks/kueue/site/content/en/docs/reference/kueue.v1beta1.md
I0317 21:59:23.050729   67613 main.go:142] Parsing go packages in sigs.k8s.io/kueue/apis/config/v1beta1
I0317 21:59:44.958189   67613 main.go:325] Output written to /home/vandit/Github-Forks/kueue/site/content/en/docs/reference/kueue-config.v1beta1.md
I0317 21:59:44.958211   67613 main.go:142] Parsing go packages in sigs.k8s.io/kueue/apis/kueue/v1alpha1
I0317 21:59:47.173487   67613 main.go:325] Output written to /home/vandit/Github-Forks/kueue/site/content/en/docs/reference/kueue-alpha.v1alpha1.md
sed -r 's/v[0-9]+\.[0-9]+\.[0-9]+/v0.6.1/g' -i README.md -i site/config.toml
/home/vandit/Github-Forks/kueue/bin/yq e '.appVersion = "v0.6.1"' -i charts/kueue/Chart.yaml
git --no-pager diff --exit-code config/components apis charts/kueue/templates client-go site/

**real	1m33.119s
user	1m54.290s
sys	0m28.404s**

@alculquicondor
Copy link
Contributor Author

Hi, any progress on this?

Given how slow it is, I would even be in favor of dropping some linters.

@Vandit1604
Copy link
Contributor

Vandit1604 commented Mar 26, 2024

I have been investigating the rules in make verify but no luck to find which rule is taking how much time. But we have the build time of verify job around ~15 mins even before February.
The CPU usage is 0% in the link provided in the comments https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&from=now-7d&to=now&var-org=kubernetes-sigs&var-repo=kueue&var-job=pull-kueue-test-integration-main&var-build=All&refresh=30s. According to that, increasing the number of cores might not be beneficial or help decreasing the time.

Which linters should we remove?

@alculquicondor
Copy link
Contributor Author

Could you try disabling some, running and seeing which ones make the most impact?

@Vandit1604
Copy link
Contributor

.PHONY: verify
verify: gomod-verify ci-lint fmt-verify shell-lint toc-verify manifests generate update-helm generate-apiref prepare-release-branch
	git --no-pager diff --exit-code config/components apis charts/kueue/templates client-go site/

gomod-verify : it go mod tidy's the project
ci-lint : it install and use the golangci-linter on the project
fmt-verify : formats golang code
shell-lint : checks any errors or warnings in shell-lint
toc-verify : It checks and generates if the TOC is latest. (I don't know where this TOC exists maybe https://kueue.sigs.k8s.io/docs/)
manifests: Need to dive deeper in this one.
generate: generate: Generate code containing DeepCopy, DeepCopyInto, and DeepCopyObject method implementations and client-go libraries.
update-helm: updates helm,Automatically sync RBAC files in helm chart.
generate-apiref: Generate and publish API reference documentation for Kueue.
prepare-release-branch: We use this to update version in README.md, site, kustomize and chart. I think this can be removed from verify.

@Vandit1604
Copy link
Contributor

toc-verify -> It checks and generates if the TOC is latest. (I don't know where this TOC exists, maybe https://kueue.sigs.k8s.io/docs/)

generate-apiref-> Generate and publish API reference documentation for Kueue.

I think these two can be removed from verify since we only need to check these before every release. I'm saying this on the basis of assumption the TOC and API reference gets updated after every release of Kueue.

@alculquicondor
Copy link
Contributor Author

How long do you estimate each of them to take?

Running them for each PR has the intent of preventing changes from merging if the authors forgot to run those commands before committing.

TOC is for /keps

@alculquicondor
Copy link
Contributor Author

OTOH, it looks like some builds are using 2 CPUs (even a little bit more)

image

maybe it does make sense to increase the limits and GOMAXPROCS?

@Vandit1604
Copy link
Contributor

That's what I was thinking we should increase the limits and GOMAXPROCS and see if it makes any difference.

@alculquicondor
Copy link
Contributor Author

I'm ok with a brute force approach: send the PR, let's see if the time changes. If it doesn't, we revert and look into other options.

@Vandit1604
Copy link
Contributor

Vandit1604 commented Mar 29, 2024

https://github.com/kubernetes/test-infra/blob/master/docs/eks-jobs-migration.md#known-issues

Should we set GOMAXPROCS to automaxprocs to see if it makes any difference.

@kannon92
Copy link
Contributor

kannon92 commented Apr 4, 2024

Nice detective work @Vandit1604!

I looked at the job now and it looks to be around 8-9 minutes.

@alculquicondor
Copy link
Contributor Author

Since we have achieved the goal of being under 10 min, I'm happy to close this.

Thanks @Vandit1604!

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

Since we have achieved the goal of being under 10 min, I'm happy to close this.

Thanks @Vandit1604!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Vandit1604
Copy link
Contributor

Your Welcome @alculquicondor
Since, I have been contributing to Kueue for some time. Now, I would like to dive in a Kueue-specific Golang issue.
All I see in issue tabs are marked as feature-requests. Is there anything In which I can contribute.
If you could only point me to something.

@alculquicondor
Copy link
Contributor Author

If you have any experience with helm, perhaps you can take this #1798

@Vandit1604
Copy link
Contributor

Thanks, I do have worked with helm charts previously.
I'll take a look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

No branches or pull requests

5 participants