Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements Gateway Controller #3805

Closed
wants to merge 13 commits into from
Closed

Conversation

danehans
Copy link
Contributor

@danehans danehans commented Jun 13, 2021

Adds an initial Gateway controller implementation.

  • cmd/contour/serve.go: Updates NewGatewayController() to accept a gatewayclass controller string. ControllerName is now a required param, so the nil check is removed.
  • examples/*: Updates examples for Gateway support. Note that gateway/gatewayclass RBAC added the 'update' verb to support adding/removing the finalizer.
  • internal/controller/gateway.go: Adds a controller to reconcile Gateway objects owned by Contour. The controller validates, finalizes the referenced gatewayclass (if needed), and updates the status of managed Gateway objects.
  • internal/gateway/*: Adds functions for managing gateway objects.
  • internal/k8s/informers.go: Updates RBAC for managing status of Gateway objects.
  • internal/slice/*: Introduces a slice package for adding/removing a string from a []string.
  • internal/status/*: Adds support for managing Gateway conditions.
  • internal/validation/*: Adds initial support for validating Gateway objects.
  • internal/validation/gatewayclass_test.go Changes the scope of existing variables to support Gateway tests.
  • pkg/config/parameters_test.go Updates controllerName from optional to required.
  • test/scripts/install-contour-working.sh: Adds the controller name of the test gatewayclass to the contour config.

Signed-off-by: Daneyon Hansen daneyonhansen@gmail.com

@danehans danehans requested a review from a team as a code owner June 13, 2021 19:03
@danehans danehans requested review from stevesloka and sunjayBhatia and removed request for a team June 13, 2021 19:03
@danehans danehans self-assigned this Jun 13, 2021
@danehans danehans added area/gateway-api Issues or PRs related to the Gateway (Gateway API working group) API. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 13, 2021
@danehans danehans added this to the 1.17.0 milestone Jun 13, 2021
@danehans danehans added this to 1.17 release in Contour Project Board Jun 13, 2021
@codecov
Copy link

codecov bot commented Jun 13, 2021

Codecov Report

Merging #3805 (74caf31) into main (b2c7b00) will decrease coverage by 1.09%.
The diff coverage is 41.28%.

❗ Current head 74caf31 differs from pull request most recent head 14464ad. Consider uploading reports for the commit 14464ad to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3805      +/-   ##
==========================================
- Coverage   75.81%   74.72%   -1.10%     
==========================================
  Files         107      110       +3     
  Lines        7340     7537     +197     
==========================================
+ Hits         5565     5632      +67     
- Misses       1654     1780     +126     
- Partials      121      125       +4     
Impacted Files Coverage Δ
cmd/contour/serve.go 10.74% <0.00%> (ø)
internal/controller/gateway.go 0.00% <0.00%> (ø)
internal/controller/gatewayclass.go 0.00% <ø> (ø)
internal/k8s/informers.go 0.00% <ø> (ø)
internal/status/gatewayconditions.go 42.85% <42.85%> (ø)
internal/status/routeconditions.go 31.88% <66.66%> (ø)
internal/status/cache.go 89.13% <82.35%> (-4.99%) ⬇️
internal/status/gatewayclassconditions.go 89.28% <89.28%> (ø)
internal/dag/gatewayapi_processor.go 93.57% <90.24%> (-0.33%) ⬇️
internal/slice/slice.go 100.00% <100.00%> (ø)
... and 3 more

@danehans danehans force-pushed the gw_impl branch 3 times, most recently from 6b45a4e to 83d4841 Compare June 15, 2021 18:42
@danehans danehans marked this pull request as draft June 16, 2021 00:19
@danehans danehans marked this pull request as ready for review June 16, 2021 17:04
@danehans
Copy link
Contributor Author

#3812 created to track the need for gateway controller integration tests.

Copy link
Member

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly looking pretty good, but I think that we have to use the Scheduled Condition, not just remove it. Scheduled means that the resource is valid, has been accepted for processing, and infrastructure is being created. Ready means that infrastructure is created and ready to route traffic.

Obviously, until we are handling Envoy management, we can't guarantee Ready, but we should still be updating Scheduled. I guess the upstream docs could use some improvement to make this flow clearer?

// The gateway is safe to process, so check if it's valid.
errs := validation.ValidateGateway(ctx, r.client, gw)
if errs != nil {
r.log.WithField("name", request.Name).WithField("namespace", request.Namespace).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also be setting the Scheduled Condition to status: false here, with a NotReconciled Reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Gateway has been reconciled, but no infra is being provisioned since it's invalid. I will set Scheduled=false and use an InvalidGateway reason with the message indicating why the Gateway is invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work if not provisioning infra? Need to keep that use-case in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not provisioning infra, Scheduled is roughly equivalent to the Valid condition we use on HTTPProxy, it indicates that whether the object is syntactically valid. Should this be more than one Condition? Possibly, but I wasn't able to make that argument convincingly enough upstream.

r.log.WithField("name", request.Name).WithField("namespace", request.Namespace).
Error("invalid gateway: ", int_errors.ParseFieldErrors(errs))
} else {
// TODO: Finalize the gateway when managed infrastructure is added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set Scheduled to status: true at this point, with a Reason of Scheduled, I think? Reference is https://github.com/kubernetes-sigs/gateway-api/blob/16876c7362d58cae211e5599a6ef939ed1b859e9/apis/v1alpha2/gateway_types.go#L574-L614

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently status.SyncGateway() sets Ready=true, but I'm updating that to Scheduled=true since the controller is not provisioning Envoy infra yet. I have also updated ensureClassFinalizer() to addClassFinalizer() so the func does not update the object. The object's status will now be sync'ed before being updated on the API server.

// TODO: Ensure the gateway by creating manage infrastructure, i.e. the Envoy service.
}

if err := status.SyncGateway(ctx, r.client, gw, errs); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this status Sync pattern common for controller-runtime controllers? In any case, we need to ensure that the Scheduled Condition is populated correctly (it signals that Contour has accepted the Gateway for processing and, eventually, started the process of creating infrastructure). We should add Ready with status: false when we add scheduled.

Then, Ready can be set to status: true when the infrastructure is provisioned.

I think we may need to change SyncGateway() to update Scheduled first maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this status Sync pattern common for controller-runtime controllers?

It's a pattern I have seen with other controllers, essentially reconcile the object and calculate status before updating the object on the API server.

I think we may need to change SyncGateway() to update Scheduled first maybe?

Ack. I was still mentally operating in the operator world where the infra is provisioned. I have updated computeGatewayReadyCondition() to computeGatewayScheduledCondition() to manage the "Scheduled" condition type. When Envoy infra management is added, we can re-introduce computeGatewayReadyCondition() to calculate the "Ready" condition type.

internal/status/gateway.go Outdated Show resolved Hide resolved
Copy link
Member

@stevesloka stevesloka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we manage setting status from the result of the DAG build vs the controller? Seems like we need to move the status setting to the DAG processor like it is today, and leave the gateway controller to manage the Envoy instances (if desired).

}

// validateGatewayNamespace validates whether other gateways exist in the namespace of gw.
func validateGatewayNamespace(ctx context.Context, cli client.Client, gw *gatewayapi_v1alpha1.Gateway, path *field.Path) field.ErrorList {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now all validation is done in the DAG builder processor. How do we manage validation here on top of validation there? There are a bunch of bits that could be pulled out and pushed upstream, however, other things, like specific configuration items can't have generic validation and need to set status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO validation is one of the first steps that a controller should perform on the object it reconciles. If utilizing the controller-runtime caches is the long-term direction, I think it's best to not add any more functionality to the DAG. As-is, the DAG will continue to perform validation, status mgt, etc. for xRoutes and Gateway listeners. The controller-runtime controllers will do the same for the objects they reconcile. When the xRoute controllers are implemented, we migrate the validation, status mgt, etc. from the Gateway API DAG processor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't go down that road right now, I'd let the DAG builder validate its stuff and keep everything together.

To me, this new controller should just manage Envoy instances, everything else should stay consistent as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't let me distract, if other's don't agree then go with that. =)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good chance to start pulling functionality out of the DAG, and decomposing it a bit. I think of Contour as having three stages: Ingestion of objects into the DAG, the DAG run, and then export to Envoy. I think that it's okay to have some validation at the ingestion stage, as long as we keep in mind what that means for the DAG. It's just a data model, and if we can split complexity out of it, it will be easier to code and maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to either process a full dag to find some of the validation errors. One simple one is a missing/invalid service reference should set status on the Gateway. Probably for now we'll need to have the Gateway Controller get feedback from the result of a dag run and see if errors exist that it cares about to set proper status.

@danehans
Copy link
Contributor Author

danehans commented Jun 17, 2021

How do we manage setting status from the result of the DAG build vs the controller? Seems like we need to move the status setting to the DAG processor like it is today, and leave the gateway controller to manage the Envoy instances (if desired).

@stevesloka I thought we discussed moving away from the DAG and start utilizing the runtime controllers cache? @skriss did you mention that you were going to explore this in more detail? If the intent is to transition to the runtime cache, I thought it makes sense to have the controller manage GatewayClass/Gateway status and migrate the DAG's existing xRoute status management to xRoute controllers in the future.

@stevesloka
Copy link
Member

stevesloka commented Jun 17, 2021

I thought we discussed moving away from the DAG and start utilizing the runtime controllers cache?

The current DAG build is critical to how Contour works. Moving away from that is a core fundamental change to Contour. It's doable but needs a ton of work + overhaul to the core internals.

We talked I think about using the controller cache as opposed to our current custom caching, but still, the DAG needs to stay.

@danehans danehans force-pushed the gw_impl branch 2 times, most recently from fac6fbd to dcac033 Compare June 17, 2021 18:15
@youngnick
Copy link
Member

I agree that the DAG needs to stay, but I think that filtering objects before they go in is pretty reasonable (which is what pre-DAG validation logic will do).

@stevesloka
Copy link
Member

prefiltering

We also need to be careful to make sure if a cluster is running and something causes the Gateway to become invalid for one specific reason (say an invalid service), that we still keep the Gateway in the Contour cache so that we don't break all other routes that would be valid.

@danehans danehans marked this pull request as draft June 19, 2021 00:18
Copy link
Member

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things, but overall I think this is okay. I'll leave to @stevesloka to give the final approval.

}

// See if the referenced gatewayclass is admitted.
gcAdmitted := false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this needs to be here because we don't update status on admission? If so, could you put a comment to that effect, maybe with a TODO to do something about it?

if p.source.gateway == nil {
p.Error("Gateway is not defined!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the error should say something like "No gateway pointing to the GatewayClass whatever could be found"?

internal/featuretests/v3/gatewayapi_test.go Show resolved Hide resolved
entries: make(map[string]map[types.NamespacedName]CacheEntry),
proxyUpdates: make(map[types.NamespacedName]*ProxyUpdate),
gatewayRef: gateway,
gatewayUpdates: make(map[types.NamespacedName]*GatewayConditionsUpdate),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base Gateway status is actually pretty simple, but the Listener status is not. It's where all the Route stuff shows up, currently, so I think that having a way to distinguish between those is good.

entries: make(map[string]map[types.NamespacedName]CacheEntry),
proxyUpdates: make(map[types.NamespacedName]*ProxyUpdate),
gatewayRef: gateway,
gatewayUpdates: make(map[types.NamespacedName]*GatewayConditionsUpdate),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should get this working and come back for code duplication later.

@danehans
Copy link
Contributor Author

So, this needs to be here because we don't update status on admission?

@youngnick that is the gatewayclass referenced by the gateway. If a gateway references a gatewayclass that is not admitted, then the gateway is considered invalid and the appropriate status condition is surfaced.

@danehans
Copy link
Contributor Author

e2e (kubernetes:n-2 failure due to:

object is being deleted: gateways.networking.x-k8s.io "contour" already exists

e2e tests should consider appending a UID to gateway names for each test to avoid this issue. Gateways/GatewayClasses will use finalizers so deletion may take time, especially as Contour manages Envoy infra.

Copy link
Member

@stevesloka stevesloka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm to get this in and unblock a bunch of other issues once we get CI passing.

@skriss
Copy link
Member

skriss commented Jun 24, 2021

e2e (kubernetes:n-2 failure due to:

object is being deleted: gateways.networking.x-k8s.io "contour" already exists

e2e tests should consider appending a UID to gateway names for each test to avoid this issue. Gateways/GatewayClasses will use finalizers so deletion may take time, especially as Contour manages Envoy infra.

The plan is to move to having each test create a Gateway in its own namespace, to avoid test pollution like this. Requires #3803 to be able to easily update Contour config to point to the right gateway.

@youngnick
Copy link
Member

Only failing CI is the codecov one, I think this is good to merge @danehans.

danehans and others added 13 commits June 25, 2021 09:02
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
Signed-off-by: Steve Sloka <slokas@vmware.com>
Signed-off-by: Steve Sloka <slokas@vmware.com>
Signed-off-by: Steve Sloka <slokas@vmware.com>
Signed-off-by: Daneyon Hansen <daneyonhansen@gmail.com>
@skriss
Copy link
Member

skriss commented Jun 28, 2021

Looks like we've got some merge conflicts; @stevesloka were you going to take this one over?

@stevesloka
Copy link
Member

Yup I can figure this one out @skriss.

@youngnick
Copy link
Member

@stevesloka if you can clear the merge conflicts and CI is still passing, then I think this one is good to go for v1.17?

@stevesloka
Copy link
Member

@youngnick nope there's an issue with finalizers. I'd like to pick this apart into pieces so we can make sure to test everything and find the flake. Personally, I'd delay the release, but need to chat about that. Otherwise would land in 1.18.

@stevesloka
Copy link
Member

Implemented in #3858, closing and will follow the new PR for updates.

@stevesloka stevesloka closed this Jul 2, 2021
@stevesloka stevesloka mentioned this pull request Jul 6, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gateway-api Issues or PRs related to the Gateway (Gateway API working group) API. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
No open projects
Contour Project Board
  
1.17 release (completed)
Development

Successfully merging this pull request may close these issues.

None yet

5 participants