Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are s390x/ppc jobs still valuable? #344

Open
dprotaso opened this issue Feb 6, 2024 · 38 comments
Open

Are s390x/ppc jobs still valuable? #344

dprotaso opened this issue Feb 6, 2024 · 38 comments
Milestone

Comments

@dprotaso
Copy link
Member

dprotaso commented Feb 6, 2024

I believe originally the ppc/s390x jobs were added to test knative on different architectures with hardware supplied by IBM.

I though this was to the benefit to the IBM CodeEngine folks. Confirming with @psschwei CodeEngine doesn't use these architectures (anymore?).

The other bit we don't have anyone really looking at the tests and fixing them
https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-contour-tests

Furthermore - it's not clear if users can even run Knative on s390x with OSS - eg. kourier & istio envoy images are only arm and amd64.

I'm thinking we should just drop testing these architectures, remove the prow jobs and inform IBM that we no longer need those prow clusters.

@upodroid
Copy link
Member

upodroid commented Feb 6, 2024

+1 to removing s390x/ppc64le jobs

@dprotaso
Copy link
Member Author

dprotaso commented Feb 6, 2024

Sorta related I'll bring this up with TOC - to even consider dropping s390x/ppc support in our releases.

I don't think our releases work on those arch's anyway - given kourier/istio envoy images don't support it (https://explore.ggcr.dev/?image=envoyproxy%2Fenvoy%3Av1.29.0)

I'm sourcing some data from the mailing lists
https://groups.google.com/g/knative-users/c/ORwp3KlFbds
https://groups.google.com/g/knative-dev/c/D-UkD3xPtFA

@rishikakedia
Copy link

rishikakedia commented Feb 7, 2024

We as part of enabling OpenShift Serverless for s390x and ppc64le architectures are actively working on knative upstream release to keep them updated. The members here working actively are @dilipgb (for s390x) and @valen-mascarenhas14 (for pp64le)

With respect to istio envoy images - we leverage maistra/envoy packages (midstream of istio envoy packages) for testing knative functionalities. There is active work happening on maintaining maistra/envoy for s390x and ppc64le architectures.

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

@rishikakedia Were you able to upstream the changes required to support s390x/pp64le architectures to Envoy and Istio?

I believe IBM/RH are key maintainers of Istio(not sure about Envoy)

@dilipgb
Copy link
Contributor

dilipgb commented Feb 7, 2024

@upodroid we pick the maistra/envoy images that are needed for knative upstream and patch the code through our ci scripts before we run the tests (refer here: https://github.com/knative/infra/blob/main/prow/jobs_config/knative/serving.yaml#L186).

There are some of tests arbitrarily failing for contour and Kourier and we are also trying to debug those issues. It takes some more time for us to figure this out. For example, in s390x today we have kourier job passed but it was failed yesterday, similarly we had contour run successfully on Monday. We need some more time to fix these issues.

Also when cron schedule for latest and main conflicts (when release happens), we will see lot of failure in our CI because jobs will compete for same resource to run tests. We make adjust cron schedule to fix them.

@valen-mascarenhas14
Copy link
Contributor

@upodroid
Recently, we've implemented significant changes to our testing infrastructure on the ppc64le side. This included migrating all Knative workloads to a different workspace within IBM Cloud. As a result, modifications were necessary, such as updating secrets, adjusting cronjob timings, and refining ppc64le-specific scripts. These changes led to a few failures during the transition period.
However we have successfully addressed these issues, and the system is now functioning smoothly.
Although we encountered some intermittent failures during the transition, we have diligently resolved them, ensuring that the platform is now performing as expected.

@cardil
Copy link
Contributor

cardil commented Feb 7, 2024

I agree with @upodroid. This work would be better utilized when done on Istio/Envoy directly, by adding a proper support for P/Z architecture there.

Doing it on Knative level is always going to be chasing a moving target...

@rishikakedia
Copy link

So, there are recent discussions started on having P/Z teams enabling upstream CI to publish images.

@ghatwala
Copy link

ghatwala commented Feb 7, 2024

Seems like this openshift CI - https://github.com/openshift/release/tree/master/ci-operator/step-registry/servicemesh is being used to run e2e tests.

@rishikakedia
Copy link

We are enabling istio/envoy under the hood of maistra/envoy for s390x and ppc64le architectures. There is roadmap discussion to enable envoy based on openssl for these architecture to be compatible with upstream.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 7, 2024

I agree with @upodroid

From my perspective none of our releases work on ppc/s390x without these patches. So I don't really see the utility of these jobs being in our CI from an OSS perspective. There's no benefit to end-users of Knative who consume the releases we produce.

We as part of enabling OpenShift Serverless for s390x and ppc64le architectures

Would it make more sense to add these tests to the RH/IBM midstream repos rather than here?

@rishikakedia
Copy link

Here is the associated PR for enabling envoyproxy/envoy to be openssl based for s390x: envoyproxy/envoy-openssl#128

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

Fyi, what you need to do is get s390x/ppc64le binaries added to https://github.com/envoyproxy/envoy/releases/tag/v1.29.0

@clnperez
Copy link

clnperez commented Feb 7, 2024

FYI @upodroid I'd love to, but, Google dropped us from their CI platform, so we can't get boring-ssl support back -- hence @rishikakedia's mention of the openssl roadmap. (She's on the s390x side of the IBM house. I'm on the ppc64le side.)

For reference: envoyproxy/envoy#28363

Given @valen-mascarenhas14's comment -- these issues seem to be worked on the ppc64le side. Does that mean there are no issues on Power? I'm trying both understand the situation and to line up all the folks with who they are and who they're referring to when they say "we." :D

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

I read the envoy PR and the solution is to fix it properly in BoringSSL.

It seems patches do exist but you need to upstream them and give maintainer/vendor X real IBM hardware to test against those architectures.

https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6435
https://github.com/linux-on-ibm-z/docs/wiki/Building-TensorFlow

All of this stuff needs to be upstreamed

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

Fyi, I don't have anything against the s390x/ppcl64e platforms but I have to repeat these important best practices(which might be done but not visible to me/public).

I wonder how much hassle we'll go through RISC-V when it becomes a thing in the future.

@rishikakedia
Copy link

@upodroid : we did internal assessment and we believe that https://github.com/envoyproxy/envoy-openssl should be enabled for s390x and ppc64le architecture by first half of 2024. So I suggest we discuss about this issue of knative upstream CI post that availability?

@rhuss
Copy link
Contributor

rhuss commented Feb 8, 2024

An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.

@rishikakedia
Copy link

@upodroid FYI: we use prow to trigger jobs but infra for testing is provided by P/Z teams by provisioning capacity on ibm cloud.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 8, 2024

An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.

Yeah this sounds good

@clnperez
Copy link

clnperez commented Feb 8, 2024

@upodroid -- Google is the maintainer of boringssl, and they removed support for power explicitly (see google/boringssl@7d2338d). I asked one of the maintainers about adding our hardware back. It's not just a matter of upstreaming, or giving them hardware. It's complicated, but they let people know not to rely on it in their README:

BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. 

So we're stuck between a rock and a hard place here because third parties are depending on it.

All that said, thanks to everyone for the consideration and flexibility.

@psschwei
Copy link
Contributor

If I'm understanding correctly, what prompted this issue is that it wasn't clear if these tests were being maintained. To my understanding they are being maintained, failures/flakes are being fixed, etc. although that maintenance may not have been communicated especially well. So given that, I don't think we need to drop them as long as they're being actively maintained.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 12, 2024

To my understanding they are being maintained, failures/flakes are being fixed, etc.

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@valen-mascarenhas14
Copy link
Contributor

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@dprotaso I can see all the tests are running & passing for ppc64le eventing tests (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#ppc64le-e2e-tests&width=20)

@dilipgb
Copy link
Contributor

dilipgb commented Feb 13, 2024

To my understanding they are being maintained, failures/flakes are being fixed, etc.

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@dprotaso the eventing jobs failure on s390x, we are actively debugging. Since we are sending 2 flags (--platfom=linux/s390x --insecure_registry) to KO_FLAGS, the platform is not getting recognised (I hope you can recreate the issue and check on your end if needed). If send only --platfom=linux/s390x test runs fine for all branches. We have similar set up in release-1.11, where we send both KO_FLAGS and its working fine. Hence its taking time to troubleshoot and understand the issue why its happening in later releases than 1.11.

Since we have self-signed certificate on our registry we need the flag to exist. As @psschwei rightly pointed out, it's certainly the communication gap and we will address it in future.

@dilipgb
Copy link
Contributor

dilipgb commented Feb 13, 2024

@dprotaso I have moved the KO_DOCKER_REPO to IBM Cloud from self-hosted artifactory instance. Please approve the PR, this resolves the eventing issues we were facing. #351.

@davidhadas
Copy link
Contributor

@dprotaso,

My understanding of the current status is:

  1. There are teams working on the 2 additional architectures that clearly ask for the tests to continue.
  2. They are committed to support these tests and ensure Knative work on the additional architectures
  3. The costs of the additional hardware needed for testing is covered by IBM
  4. Significant parts of Knative can be used as is by the community on these additional HW architectures, but there are some identified gaps (envoy) that are presently being worked on by the teams.

Did I miss anything?

A reasonable path forward here is to keep testing on the two additional HW architectures and allow the teams to remove such gaps and ensure that the community can use Knative as is on the additional architectures.

We can reevaluate this in six months time to see the progress made.

@dprotaso
Copy link
Member Author

I've talked to the productivity folks and they're in agreement with Roland's suggestion (here). PRs are out and are written in a way to make it easy to revert in the future.

Until the necessary dependencies support ppc/s390x out of the box we're effectively testing code that no-end user can checkout and run on their cluster without custom IBM patches. Like @upodroid mentioned these should be worked on in their respective projects.

Until then if these architectures are valuable for end-users it seems like IBM should create a Knative distribution for those architectures and we can link out to them on the Knative website. We do this for other vendors and their distributions.

For continuous testing you can take a look at Red Hat as an example - they have midstream repos for Knative and run their own prow instance. Given IBM already has prow clusters it would seem pretty incremental to host your own control plane - and use the resources in this and Red Hat repos as a guide.

@davidhadas
Copy link
Contributor

davidhadas commented Feb 14, 2024

@dprotaso
IBM (as a HW vendor in this case) is not creating its own downstream distribution, it is supporting the use of the Knative distribution on additional HW architectures, like it does with other OS. Therefore, the midstream RedHat example is not a good one.

As a community, it makes no sense to reject one architecture over another, especially when we have no good reason.

In this case, the community already decided in the past to support the additional architectures, and there are community users using Knative on these architectures today. So this is not a decision that can be taken lightly for a community to stop supporting or to stop testing such architectures on new releases.

Note that we always pride that one can use different parts of Knative independently, and we as a community support open APIs and allowing users to use what they need out of Knative, so the argument that one dependency of one piece in the entire Knative distribution is still under work for this architecture, is not a reason to stop supporting community users using this architecture by stopping the release cycle.

I have added this to be discussed in the TOC.

(The PR @dprotaso was referring to is #357)

@rishikakedia
Copy link

Yes as @davidhadas mentioned we at IBM are maintaining the knative enablement for s390x and ppc64le architectures. If we have issues with test cases - we will work on priority to fix them. We will open a new issue to re-enable s390x and ppc64le, need knative community to support. Thanks @davidhadas @psschwei for your comments.

@davidhadas
Copy link
Contributor

Apparently #356 slipped in although there is no agreement on this.
I have started #360 to revert it until an agreement is reached.

@dsimansk
Copy link
Contributor

dsimansk commented Feb 14, 2024

Summary of today's TOC call:

  • The periodic CI jobs for P and Z architectures stay in place in current format

    • The ask for respective maintainers/teams to stabilize the runs
    • Regularly Monitor the results and proactively fix or raise issues with Knative bits or infra
    • Contacts for (I'll capture that in community or infra readme appropriately)
  • Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasks

  • In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z.

    • Goal is to have community releases of Knative usable on respective architectures with 3rd party dependecies available as well.

Stretch goal: introduce e2e tests setup with Istio for Serving. In the current limitation there's no coverage. In additionto the Envoy efforts, if there's alternative open source proxy that can be used for Istio. The new job to cover for the scenario should be introduced.

I've tried to capture main points from the discussion.

@knative/technical-oversight-committee
@knative/productivity-wg-leads

@dprotaso dprotaso added this to the v1.16.0 milestone Feb 14, 2024
@dprotaso
Copy link
Member Author

In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z.

This falls into the v1.16 milestone - will circle back when that sprint starts.

@davidhadas
Copy link
Contributor

I assume Istio is just one option.
Kourier is another.
Counter a third.

We aim to ensure that community users can install Knative with a corresponding OS networking layer supported by Knative.
Such that users can follow Knative documentation to get up and running with Knative.

It is nice if all networking layers are supported but not necessary.

@xnox
Copy link

xnox commented Feb 21, 2024

@pleia2 can you please check if Z or Power teams will be affected, or how they can take this in-house if needed? Just in case it affects plans for IBM Secure Service Container.

@davidhadas
Copy link
Contributor

davidhadas commented Feb 21, 2024

@xnox, you can contact the Z or Power teams via slack knative-s390x-ppc

@pleia2
Copy link

pleia2 commented Feb 21, 2024

@xnox Thanks for the heads up, I'll check internally with my teams at IBM, but I'll also follow the lead of @davidhadas here regarding the knative-s390x-pcc channel, since there are some key folks publicly engaged there from both Power and Z (I've also just joined)

@davidhadas davidhadas changed the title Are s390x/pcc jobs still valuable? Are s390x/ppc jobs still valuable? Feb 21, 2024
@dilipgb
Copy link
Contributor

dilipgb commented Mar 25, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests