Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger downstream liboqs-python CI is failing #1789

Closed
dstebila opened this issue May 10, 2024 · 25 comments
Closed

Trigger downstream liboqs-python CI is failing #1789

dstebila opened this issue May 10, 2024 · 25 comments
Assignees
Labels
bug Something isn't working; high priority to fix

Comments

@dstebila
Copy link
Member

Describe the bug
In recent liboqs CI builds on CircleCI, the "Trigger liboqs-python CI" step is failing.

To Reproduce
See https://app.circleci.com/pipelines/github/open-quantum-safe/liboqs/3710/workflows/34731b55-1e34-4510-bb20-bfdd484fa5d6/jobs/29103

@dstebila dstebila changed the title Trigger downstreadm liboqs-python CI is failing Trigger downstream liboqs-python CI is failing May 10, 2024
@dstebila
Copy link
Member Author

I'm guessing it's somehow related to the changes involving oqs-bot and things not being configured correctly in https://github.com/open-quantum-safe/liboqs/blob/main/.circleci/config.yml#L264.

@ryjones Do you have any ideas about this?

Possibly it would be easier if we switched this job (to trigger downstream CI) over Github Actions...?

@dstebila dstebila added the bug Something isn't working; high priority to fix label May 10, 2024
@ryjones ryjones self-assigned this May 10, 2024
@ryjones
Copy link
Contributor

ryjones commented May 10, 2024

The issue is enterprises don't allow PATs to work like they used to. You have to create a GitHub app with a webhook. I'm looking into how to get this done.

@dstebila
Copy link
Member Author

Thanks Ry!

@ryjones
Copy link
Contributor

ryjones commented May 10, 2024

Also, if circle-ci doesn't offer anything over GitHub actions, it would make life easier if you moved it over.

@baentsch
Copy link
Member

Also, if circle-ci doesn't offer anything over GitHub actions, it would make life easier if you moved it over.

Agreed: We have a long-standing issue on this that no-one found time to work on (particularly, getting us ARM runners that were the sole reason why we didn't move off CCI): open-quantum-safe/oqs-provider#248 (oqs-provider typically leads liboqs in infrastructure updates which is why I create such issues first in that sub project as a "proving ground"). If you'd have time to work on this, we'd surely be happy. In that case, please also take a look at #1780 and all dependents.

@ryjones
Copy link
Contributor

ryjones commented May 11, 2024

OQS doesn't (yet) have access to ARM64. I don't have authorization to spend money on large runners, so I will need to huddle with Naomi and Hart to figure out what is authorized for this.

@baentsch
Copy link
Member

I don't have authorization to spend money

Thanks for the clear statement of limitation, @ryjones.

@dstebila Please help to have the PQCA-powers-that-be authorize this before the 0.11.0 release (created open-quantum-safe/tsc#25 to track): This stops the project from streamlining to GH actions (as recommended by LF employee and desired by OQS since a long time to be more efficient), otherwise requiring unnecessary work:

In order to deliver the 0.11.0 milestone, #1780 will need to support ARM64 CI as per PLATFORMS.md. Given the missing authorization above, the only way to facilitate that seems to be again investing in bespoke ARM64 CCI code.

Unpaid volunteers could consider it unfair or unsavory to do such inefficient or "throw-away work" to save money to an alliance funded by multi-billion-dollar-profit companies.

I personally found it OK to do such "work-around code" while OQS was a pure research project carried by voluntary contributors, but am unwilling to put in such effort to retain a mirage of a well-funded professional alliance, particularly as I'm personally annoyed seeing LF/PQCA processes forced onto OQS without any immediately visible offsetting benefits such as such suitable CI funding authorizations: I'd really be happy if PQCA were willing to spend a healthy portion of its funding on supporting development and not most on lawyers, marketing and executive travel.

FWIW, I did complete open-quantum-safe/ci-containers#84 to lay the foundation for hitting the 0.11.0 goal but for the reasons above will not write further CCI code going forward (beyond the one in the PR above to test the Dockerfile).

@ryjones
Copy link
Contributor

ryjones commented May 12, 2024

To be clear, access to the ARM64 runners is blocked by two things: money and approval from GitHub. I will push on the GitHub angle.

@planetf1
Copy link
Contributor

planetf1 commented May 13, 2024

My understanding is that with our current pqca structure we could raise the request for funding of ARM runners at the PQCA TAC, then potentially they could raise a request with the governing board for funding?

I don't know exactly what the scope is here, but there should be some budget? For our projects, access to supported arm64 runners would seem to be very beneficial in reducing workload, and I wouldn't imagine the usage is too intense.

Is it worth figuring out how much resource we think we might need so that we could provide some kind of cost estimate based on github's published figures?

Given we have arm code in pq-code-package too it could be useful there (currently using QEMU) - I can float the idea there.

Is using a regular running with QEMU a viable fallback? (can be very slow...)

@baentsch
Copy link
Member

Given we have arm code in pq-code-package too it could be useful there (currently using QEMU) - I can float the idea there.

We also have the goal to not destroy earth's resources uselessly. Using QEMU is a clear case of that: Why run CPUs for hours if you can do the same thing in seconds on "proper" CPUs?

For the purposes of showing that all would work on GH, I already implemented this as "proof of concept", e.g., see test run in action here -- but with a very bad ecological conscience as per the above.

For our projects, access to supported arm64 runners would seem to be very beneficial in reducing workload, and I wouldn't imagine the usage is too intense.

Completely agree. Should be a no-brainer. (The promise for) Getting this (access to such resources) was also one of the reasons why I withdrew my objections to the LF take-over of OQS.

@ryjones
Copy link
Contributor

ryjones commented May 13, 2024

I have requested that pqcp and oqs get access to the ARM runners. The issue is they enter public beta in a few weeks, so they have been slow to approve new access requests.
Here is a copy of the request I raised yesterday.

Please add two orgs to the beta; please add three users to support them

Please add these orgs of which I am an owner:
https://github.com/pq-code-package
https://github.com/open-quantum-safe

Please add these users to the beta org:
baentsch
bhess
SWilson4
planetf1

@bhess
Copy link
Member

bhess commented May 14, 2024

The same issue appears when triggering oqs-provider downstream tests (using Github Actions):
https://github.com/open-quantum-safe/liboqs/actions/runs/9076079554/job/24938031071

@planetf1
Copy link
Contributor

@ryjones thanks for requesting access again. I had assumed there will still be fees for using the arm runners once public. Maybe that concern is misplaced and some usage will be supported on the free tier. Do we know any more yet?

@baentsch
Copy link
Member

some usage will be supported on the free tier

As I wrote above, "some usage" may already be working for non-commercial projects. It's just taking ages to complete: 10min for x64 and 100mins for aarch64 as per the log I referenced. Possibly using QEMU I added to be safe should the ARM64 runners not, well, run. But conceptually the "test GH job" I have created for that purpose should use real HW (unless I did sth real wrong -- please check).

@ryjones
Copy link
Contributor

ryjones commented May 14, 2024

In a stroke of good fortune, the PQCA board call is right after the PQCA TAC call next week. Given the data @baentsch has provided, I should be able to have a reasonable request to make.

For example, at Hyperledger, we spend about $2000 a month (more or less) on GitHub large runners, including arm. I imagine PQCA as a whole will be less than that for at least a year or two.

@ryjones
Copy link
Contributor

ryjones commented May 14, 2024

Having looked at all available CircleCI data, OQS would have spent ~$82 since June of 2023 on ARM64 runners, had they been available. All of the other usage seems to fall in the free tier for GitHub.

@baentsch
Copy link
Member

Thanks for this assessment @ryjones -- but please note that OQS has been skipping constant time testing on ARM64. This is a very debatable limitation that IMO should be improved on given ARM64 is now a formally supported tier 1 platform and --unlike Hyperledger-- OQS conceptually is a security software library that should have such (time-intensive) testing, particularly as/if people should begin to trust it in real world applications also on that platform. In addition, OQS is currently not doing a lot of other time-intensive testing that it should (fuzzing, etc.).

All told, I hope you can put (substantially) more than $82 into your annual budget for this: It would save (at least myself) quite a bit of effort to continue to work around this limitation. Also please do not (have LF/PQCA) consider offsetting my work at 0-cost given I am "0-cost"/a volunteer....

@ryjones
Copy link
Contributor

ryjones commented May 14, 2024

I plan to ask for $2000 a month, to cover workload expansion. With the exception of the ARM64 jobs, I think GitHub's current free runners should be able to do substantially all of the CI work; you could move them over at your leisure.

@ryjones
Copy link
Contributor

ryjones commented May 15, 2024

Even if we don't get into the beta, one option would be to sign up for BuildJet, which Hyperledger used for a while.

@planetf1
Copy link
Contributor

planetf1 commented May 17, 2024

My interpretation of the sequence leading up to the CI failure (github): (cc: @ryjones )

The test that fails is triggered by

oqs-provider-release-test:
(well, in main).

this then seems to generate an event on the liboqs repo

https://github.com/open-quantum-safe/liboqs/blob/a5ec23cf19763d36a558b8358345823ae45d57e5/scripts/provider-test-trigger.sh

This is a manual ‘dispatches’ event, but against the oqs-provider repo — so it’s effectively triggering tests there

The workflow https://github.com/search?q=repo%3Aopen-quantum-safe%2Foqs-provider%20liboqs-release&type=code is then run

which then run tests https://github.com/open-quantum-safe/oqs-provider/blob/main/scripts/release-test-ci.sh

@SWilson4
Copy link
Member

My interpretation of the sequence leading up to the CI failure (github): (cc: @ryjones )

The test that fails is triggered by

oqs-provider-release-test:

(well, in main).

this then seems to generate an event on the liboqs repo

@planetf1 Apologies if I'm misinterpreting what you wrote, but just to clarify: the downstream tests are not failing. The failures are due to permissions issues with the token that we use to trigger the downstream tests. Even if the downstream tests were failing, it would not cause the upstream workflows to "go red": the upstream workflow checks the GitHub API response code, which only indicates whether the downstream workflow was triggered successfully, not whether it completed successfully.

The infrastructure that's currently failing is mostly my work (#1507, open-quantum-safe/liboqs-python#65, open-quantum-safe/oqs-provider#345, #1654). My understanding is that it broke when the OQS GitHub account was upgraded to "Enterprise", which changed what we can and can't do with personal access tokens. @ryjones Please let me know if there's anything I can do (within the permissions I have) to help with getting this to work again. I think I have a pretty good understanding of the moving parts involved with the different workflows.

@ryjones
Copy link
Contributor

ryjones commented May 17, 2024

@bhess @dstebila would it be OK if I forked the two repos within the oqs org so I can test out some actions? they would have different names, and be deleted after I'm done with them

@dstebila
Copy link
Member Author

Go for it!

@ryjones
Copy link
Contributor

ryjones commented May 18, 2024

@SWilson4
Copy link
Member

The CI failures were occurring because oqs-bot didn't have sufficient permissions. (I'm guessing its permissions were lowered silently during the move to Enterprise or some other recent change.)

After https://github.com/open-quantum-safe/tsc/pull/30/files, liboqs main CI is green and the oqs-provider release test trigger works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working; high priority to fix
Projects
None yet
Development

No branches or pull requests

6 participants