Tests fail on ARM runner #450

danielhollas · 2024-05-08T18:24:27Z

The test_create_conda_environment seems to be failing quite often on arm build. Not sure what is happening.

@unkcpz can you investigate if you can reproduce this locally? I am seeing this both on main and in #439.

See e.g. https://github.com/aiidalab/aiidalab-docker-stack/actions/runs/9006100849/job/24744034987


aiidalab_exec = <function aiidalab_exec.<locals>.execute at 0x10696aa20>
nb_user = 'jovyan'

    def test_create_conda_environment(aiidalab_exec, nb_user):
>       output = aiidalab_exec("conda create -y -n tmp", user=nb_user).strip()

tests/test_base.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/conftest.py:73: in execute
    out = docker_compose.execute(command, **kwargs)
../../../../.venv/aiidalab-runner/lib/python3.11/site-packages/pytest_docker/plugin.py:140: in execute
    return execute(command)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

command = 'docker compose -f "stack/docker-compose.full-stack.yml" -p "pytest36426" exec -T --user=jovyan aiidalab conda create -y -n tmp'
success_codes = (0,)

    def execute(command: str, success_codes: Iterable[int] = (0,)) -> Union[bytes, Any]:
        """Run a shell command."""
        try:
            output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
            status = 0
        except subprocess.CalledProcessError as error:
            output = error.output or b""
            status = error.returncode
            command = error.cmd
    
        if status not in success_codes:
>           raise Exception(
                'Command {} returned {}: """{}""".'.format(command, status, output.decode("utf-8"))
            )
E           Exception: Command docker compose -f "stack/docker-compose.full-stack.yml" -p "pytest36426" exec -T --user=jovyan aiidalab conda create -y -n tmp returned 137: """Collecting package metadata (current_repodata.json): ...working... done
E           Solving environment: ...working... done
E           """.

../../../../.venv/aiidalab-runner/lib/python3.11/site-packages/pytest_docker/plugin.py:35: Exception
=========================== short test summary info ============================
FAILED tests/test_base.py::test_create_conda_environment - Exception: Command docker compose -f "stack/docker-compose.full-stack.yml" -p "pytest36426" exec -T --user=jovyan aiidalab conda create -y -n tmp returned 137: """Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
""".

The text was updated successfully, but these errors were encountered:

danielhollas · 2024-05-08T18:34:27Z

Maybe memory issue? https://www.google.com/search?client=firefox-b-lm&q=command+return+137

This is currently blocking release. Maybe some memory needs to be freed on the Mac runner?

unkcpz · 2024-05-08T18:40:34Z

This is currently blocking release. Maybe some memory needs to be freed on the Mac runner?

I guess so, but didn't know how to do that. Will check in my laptop.

danielhollas · 2024-05-09T15:20:49Z

Just a note that this is currently still blocking release, ARM tests are failing consistently now.

unkcpz · 2024-05-09T15:59:05Z

It is more of a qeapp test in arm64 rather than the pure architecture issue. So I bring up again that having integration test on qeapp may not be a good idea.
I feel the same with you, I am not comfortable that the release of aiidalab docker stacks fail the qeapp. But in the end the problem usually happened from QeApp side rather than here. However, the fixes and changes usually should made from downstream, nothing too much can be done from this repo.
We did encounter twice the problem:

when we want to move to aiida-core==2.5 with pydantic v2. The changes in the end were made from qeapp to use ipyoptimade to support pydantic v2.
The problem happens now. I think it is a dependency issue in qeapp (seems the compile of pymatgen) that make the arm64 installation of fail the full-stack test here.

Logically, the full-stack is the upstream of qeapp. It makes less sense to have failing tests block the change from docker stack.

danielhollas · 2024-05-09T16:51:24Z

It is more of a qeapp test in arm64 rather than the pure architecture issue.

This is not true though, qeapp integration tests are not the only ones that are failing now, see e.g. https://github.com/aiidalab/aiidalab-docker-stack/actions/runs/9005715615/job/24742373269

I don't think the tests are at fault here, it's an issue with the ARM64 runner. @mbercx could you try restarting the machine?
(or ideally, investigate what is happening there. Is there enough free RAM?)

Logically, the full-stack is the upstream of qeapp. It makes less sense to have failing tests block the change from docker stack.

Yes, those tests should not block a release, which is why I have separated them into a separate CI job in #439 (which was the original design). If this job fails, it will not block the others.

unkcpz · 2024-05-09T17:35:23Z

Yes, those tests should not block a release, which is why I have separated them into a separate CI job in #439 (which was the original design). If this job fails, it will not block the others.

It is nice that the CI job is decoupled. But the publish job still depend on the test-arm64, see here

aiidalab-docker-stack/.github/workflows/main.yml

Lines 73 to 74 in 251e6ba

    
           publish-ghcr: 
        
               needs: [build, test-amd64, test-arm64]

If I understand correctly, this means when we make a new release, the image will not push to registries since the publish job will be blocked by the failed test. Am I miss something?

danielhollas · 2024-05-09T18:02:38Z

test-arm64 tests do not include integration tests. Those are tested separately in test-integration, using the -m integration pytest marker

mbercx · 2024-05-10T17:14:10Z

I don't think the tests are at fault here, it's an issue with the ARM64 runner. @mbercx could you try restarting the machine?
(or ideally, investigate what is happening there. Is there enough free RAM?)

I'm currently on holiday until the 21st, so won't be able to look into this anytime soon. I doubt there is a memory issue on my work station though. I'm not running anything there and it has 128 GB. @unkcpz should also have access to the ARM runner.

danielhollas assigned unkcpz May 8, 2024

danielhollas changed the title ~~Flaky test~~ Flaky tests on ARM May 9, 2024

danielhollas changed the title ~~Flaky tests on ARM~~ Tests fail on ARM runner May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests fail on ARM runner #450

Tests fail on ARM runner #450

danielhollas commented May 8, 2024 •

edited

danielhollas commented May 8, 2024

unkcpz commented May 8, 2024

danielhollas commented May 9, 2024

unkcpz commented May 9, 2024

danielhollas commented May 9, 2024

unkcpz commented May 9, 2024 •

edited

danielhollas commented May 9, 2024

mbercx commented May 10, 2024

Tests fail on ARM runner #450

Tests fail on ARM runner #450

Comments

danielhollas commented May 8, 2024 • edited

danielhollas commented May 8, 2024

unkcpz commented May 8, 2024

danielhollas commented May 9, 2024

unkcpz commented May 9, 2024

danielhollas commented May 9, 2024

unkcpz commented May 9, 2024 • edited

danielhollas commented May 9, 2024

mbercx commented May 10, 2024

danielhollas commented May 8, 2024 •

edited

unkcpz commented May 9, 2024 •

edited