Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] clean source directory at the beginning of every Azure DevOps build #6416

Merged
merged 26 commits into from
Apr 19, 2024

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Apr 15, 2024

Related to #6316

#6407 appeared to fix the CI issues with self-hosted agents that are used for all of the Linux * CI jobs running on Azure DevOps.

But I've been seeing significant failures on those runners over the last few days. Not just the "docker not found" issue detailed at the top of #6316, but some others as well.

From investigation in this PR, it seems to me that there are 2 issues:

This fixes those things in the following ways:

  • switching tasks that run shell scripts to the Bash@3 task from Azure DevOps (docs)
  • adding a task that runs git clean in the source directory of every run

It also adds a set -e -E -o pipefail at the top of some of the CI scripts, to make debugging of such issues a bit easier in the future. More complete make-the-scripts-stricter work is in progress over in #6266.

Notes for Reviewers

What exactly caused this?

I don't know.

I don't see anything that seems relevant in the release notes from the April 10, 2024 Azure DevOps release (the most recent one as of this writing): https://learn.microsoft.com/en-us/azure/devops/release-notes/2024/sprint-237-update.

I saw several fully-successful runs after we switched to the new pool of runners, e.g. on #6407.

I found the root cause of "there are build artifacts being left behind" by adding a ls -alF ./build at the top of .ci/test.sh here. You can see that files are being left behind by looking at one of the "Clean source directory" tasks in Azure DevOps for Linux jobs.

Screenshot 2024-04-16 at 12 26 53 AM

(build link)

LightGBM's CI jobs write files into the same directory the source code is checked out to, and that directory's contents are preserved on the agent across runs via a Docker volume mount. I'm only observing that on Linux jobs that run on self-hosted runners, so I think the root cause must be specific to those runners.

The timing makes me think this is somehow related to the introduction of the new set of runners in #6407. For example, maybe the previous pool of runners had some cleanup logic that ran between jobs that isn't there on the new runners.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Apr 15, 2024

I think that.... maybe there's been a breaking change in Azure DevOps YAML syntax? It seems like nothing in .ci/setup.sh is actually being run.

I see this in logs

Generating script.
Script contents:
/__w/1/s/.ci/setup.sh

And none of the log statements I tried to write with echo statements in that script.

(build link)

@shiyu1994
Copy link
Collaborator

I can see that all agents are properly run the jobs being scheduled to them. At least the reasonable job run time does not indicate any issue with the agents.
截屏2024-04-16 00 24 06

@jameslamb
Copy link
Collaborator Author

Thanks @shiyu1994 ! I think maybe this is not related to the new agents at all, and to some other recent change in Azure DevOps (the service itself, not our configuration for this specific project)? I'll continue investigating.

@jameslamb
Copy link
Collaborator Author

😱 it looks like a CMake build directory is being left behind between builds!

Added this at the top of .ci/tests.sh:

ls -alF ./build

And was surprised to see output like this in logs.

drwxr-xr-x  6 AzDevOps_azpcontainer AzDevOps_azpcontainer  4096 Apr 16 03:14 ./
drwxr-xr-x 24 AzDevOps_azpcontainer AzDevOps_azpcontainer  4096 Apr 16 03:20 ../
drwxr-xr-x  2 AzDevOps_azpcontainer AzDevOps_azpcontainer  4096 Apr 11 07:05 bin/
-rw-r--r--  1 AzDevOps_azpcontainer AzDevOps_azpcontainer 22397 Apr 16 03:14 CMakeCache.txt
drwxr-xr-x 10 AzDevOps_azpcontainer AzDevOps_azpcontainer  4096 Apr 11 07:53 CMakeFiles/
-rw-r--r--  1 AzDevOps_azpcontainer AzDevOps_azpcontainer  4697 Apr 11 07:05 cmake_install.cmake
drwxr-xr-x  5 AzDevOps_azpcontainer AzDevOps_azpcontainer  4096 Apr 11 07:05 _deps/
drwxr-xr-x  2 AzDevOps_azpcontainer AzDevOps_azpcontainer  4096 Apr 11 07:05 lib/
-rw-r--r--  1 AzDevOps_azpcontainer AzDevOps_azpcontainer 63678 Apr 11 07:05 Makefile

(build link)

I think that CMakeCache.txt being left behind is at least somewhat a root cause of the CI failures we've been seeing! Here's the top of it.

# This is the CMakeCache file.
# For build in directory: /__w/1/s/build
# It was generated by CMake: /usr/bin/cmake

Looking at the "Initialize Containers" stage, I see Azure is doing this:

/usr/bin/docker create \
    --name ubuntu-latest_ubuntu2204_07fc25 \
    --label b0e97f \
    --network vsts_network_277dfc2d2e79409fab81db297f5d5ed5 \
    --name ci-container \
    -v /usr/bin/docker:/tmp/docker:ro \
    -v "/var/run/docker.sock":"/var/run/docker.sock" \
    -v "/agent/_work/1":"/__w/1" \
    -v "/agent/_work/_temp":"/__w/_temp" \
    -v "/agent/_work/_tasks":"/__w/_tasks" \
    -v "/agent/_work/_tool":"/__t" \
    -v "/agent/externals":"/__a/externals":ro \
    -v "/agent/_work/.taskkey":"/__w/.taskkey" \
    ubuntu:22.04 \
    "/__a/externals/node/bin/node" -e "setInterval(function(){}, 24 * 60 * 60 * 1000);"

(build link)

So I think this volume mount is the problem:

-v "/agent/_work/1":"/__w/1" \

Azure DevOps is cloning the repo to /__w/1/ in the container's filesystem, which is /agent/_work/1 on the agent's filesystem. That leaves open an opportunity for the CMake cache file(s) to be left behind and survive across jobs. That explains how, for example, a cpp-tests job which doesn't define -DUSE_MPI=ON is getting errors about MPI not being found.

-- Could NOT find MPI_C (missing: MPI_C_LIB_NAMES MPI_C_HEADER_DIR MPI_C_WORKS) 
-- Could NOT find MPI_CXX (missing: MPI_CXX_LIB_NAMES MPI_CXX_HEADER_DIR MPI_CXX_WORKS) 
CMake Error at /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND)

Here's an example of what that looked like in a previous, successful run from #6407 on April 9. Looks like all the same volume mounts, so I don't think the root cause is something like "Azure changed what is mounted by default on these runs".

/usr/bin/docker create \
    --name linux-artifact-builder_lightgbmvstsagentmanylinux_2_28_x86_64_d292d1 \
    --label b0e97f \
    --network vsts_network_fe607b351cfb4d48ab85a0f9ba8dd62e  \
    -v "/var/run/docker.sock":"/var/run/docker.sock" \
    -v "/agent/_work/1":"/__w/1" \
    -v "/agent/_work/_temp":"/__w/_temp" \
    -v "/agent/_work/_tasks":"/__w/_tasks" \
    -v "/agent/_work/_tool":"/__t" \
    -v "/agent/externals":"/__a/externals":ro \
    -v "/agent/_work/.taskkey":"/__w/.taskkey" \
    lightgbm/vsts-agent:manylinux_2_28_x86_64 \
    "/__a/externals/node/bin/node" -e "setInterval(function(){}, 24 * 60 * 60 * 1000);"

@jameslamb jameslamb changed the title investigate Linux CI failures [ci] clean source directory at the beginning of every Azure DevOps build Apr 16, 2024
@jameslamb jameslamb marked this pull request as ready for review April 16, 2024 05:47
@jameslamb
Copy link
Collaborator Author

Alright @shiyu1994 I think I found the issue and have an approach that'll fix it 😁

@@ -65,13 +86,22 @@ jobs:
echo "##vso[task.prependpath]/usr/lib64/openmpi/bin"
echo "##vso[task.prependpath]$CONDA/bin"
displayName: 'Set variables'
- script: |
git clean -d -f -x
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 we are only running one agent per machine in the linux pool for Azure DevOps, right?

I saw that it's possible to run multiple (Azure Pipelines Agents docs).

If we were running multiple agents on the same virtual machine, then destructive actions like git clean like this might cause conflicts between different jobs.

@jameslamb
Copy link
Collaborator Author

@shiyu1994 could you please review this so we can unblock CI?

@jameslamb
Copy link
Collaborator Author

@jmoralez or @borchero could you review this?

I would still like to get @shiyu1994 's opinion (especially on #6416 (comment)), but we don't have to wait on that to merge this.

I really want to get CI working again here so we can merge already-approved stuff and so we don't lose momentum with other contributors. @mayer79 in particular has been stuck for weeks on #6397 and related work on the R package because of the long periods of CI unavailability here.

I think these changes would be easy to revert and the risk of being wrong about them is very low.

Copy link
Collaborator

@jmoralez jmoralez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock, but I don't have any experience with Azure DevOps. Hopefully @shiyu1994 can review as well.

@jameslamb
Copy link
Collaborator Author

Thanks @jmoralez !

I'm very confident in this change, and even more confident that it'd be easy to revert if @shiyu1994 comes back to this conversation and sees some issue with the changes.

I'm going to merge this and start getting the other already-approved PRs merged.

@jameslamb jameslamb merged commit 0901aa6 into master Apr 19, 2024
42 checks passed
@jameslamb jameslamb deleted the ci/more-fixes branch April 19, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants