Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Service load-balancing with StateDB #32185

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Apr 25, 2024

This is first of many PRs to refactor Cilium's load-balancing control-plane to use StateDB to store
the service and backend information.

Since we want to mature this implementation gradually and to answer the still many small questions
related to this (topo-aware services, clustermesh, NodePort frontend IP expansion etc.), this PR is
introducing the support via a feature flag (enable-new-services) so this can be tested and developed
gradually. This hidden feature flag will allow using the Services API instead of ServiceManager.
The flag enables a reconciler the state in tables to ServiceManager.

Nothing yet uses the new API and the flag is disabled by default, so this code has no impact
to normal operation, but it allows further collaborative development on the API and on the final
reconciler that goes directly to the load-balancing BPF maps.

The implementation is structured as follows:

pkg/loadbalancer/experimental is chosen as the package for these. This clearly separates
the new code from the production code. Eventually it will make sense to move the code into the
loadbalancer package that already defines the data structures related to load-balancing.

pkg/loadbalancer/experimental/tables.go:
Defines Table[Frontend] and Table[Backend]. The Frontend type is very similar to SVC but does
not include the backends. It is at the same level of detail however in that it represents a single service frontend
(unlike e.g. k8s.Service). Frontend type in comparison to SVC has additional fields related to reconciliation status and backend revision (the link between the two tables). To avoid modifying existing code I've opted to copying the type definition and converting between them. experimental.Backend is similarly almost the same as loadbalancer.Backend, but adds source and frontend references. Long term we would likely include more fields into these to include more metadata (e.g. ServiceAffinity, Cluster name etc.). The fields that can be modified externally are split into FrontendParams to make the division between fields managed by the Services API and those managed by users clear.

pkg/loadbalancer/experimental/services.go:
Implements the "writer" to Table[Frontend] and Table[Backend]. We don't want to give write access to the tables directly as we want to validate and merge changes and we need to mark service's status as pending to ask the reconciler
to process it. Also any changes to backends require bumping the backend revision of each referenced service to trigger
reconciliation.

Inserting a service and backends looks as follows:

var s *experimental.Services
// Create a WriteTxn against frontend and backend tables.
txn := s.WriteTxn() 

// Insert a new frontend.
created, err := s.UpsertFrontend(
  txn,
  experimental.FrontendParams{
    Name: loadbalancer.ServiceName{Name: "test", Namespace: "default"},
    Frontend: loadbalancer.NewL3n4Addr(...),
    Type: loadbalancer.SVCTypeClusterIP,
    Source: source.Kubernetes,
  },
)
if err != nil {
  // error can be non-nil if there are validation problems or conflicts with existing services
  // we can choose to abort the whole transaction and throw away changes.
  txn.Abort()
  return
}

// Insert backends for the given service.
// If backends already exist, then the reference to the service is added, but existing
// e.g. backend state is kept.
err := s.UpsertBackends(
  txn,
  loadbalancer.ServiceName{Name: "test", Namespace: "default"},
  &loadbalancer.Backend{L3n4Addr: addr1, State: loadbalancer.BackendStateActive},
  &loadbalancer.Backend{L3n4Addr: addr2, State: loadbalancer.BackendStateActive},
  &loadbalancer.Backend{L3n4Addr: addr3, State: loadbalancer.BackendStateActive},
)

// Commit the transaction and notify watchers of Table[Frontend] and Table[Backend].
txn.Commit()

The changes to these tables can be watched as usual with StateDB LowerBound queries
(or with DeleteTracker for delete events as well):

var db *statedb.DB
var frontends statedb.Table[*experimental.Frontend]

// Iterate through changes in revision order.
latestRev := statedb.Revision(0)
for {
  iter, watch := frontends.LowerBound(db.ReadTxn(), statedb.ByRevision(latestRev+1))
  for svc, rev, ok := iter.Next(); ok; svc, rev, ok = iter.Next() {
    latestRev = rev
    // process changed service
  }
  // Wait for table to update
  <-watch
}

// To wait for datapath reconciliation we can query and watch the reconciliation status:
for {
  fe, _, watch, found := frontends.FirstWatch(db.ReadTxn(), experimental.FrontendIndexL3n4Addr.Query(addr))
  if !found { break }

  if fe.Status.Kind == reconciler.StatusKindDone {
    break
  }

  // Wait for the service to update.
  <-watch
}

More examples of the API can be found from pkg/loadbalancer/experimental/services_test.go.

Read access to these tables is directly through Table[Frontend] and Table[Backend]. This is the general approach with StateDB: we want to keep the StateDB API always directly available for reading to reinforce the use of this common API and reduce overall API complexity by not wrapping queries.

The feature flag can be enabled and support tested with:

$ cilium config set enable-new-services true
# wait for restart

$ kubectl exec -it -n kube-system ds/cilium -- bash
# cilium-dbg statedb dump | jq .frontends
# cilium-dbg statedb dump | jq .backends
# cilium statedb health|grep -i new-services-reconciler

This PR does not yet include "cilium-dbg statedb" commands for these tables as we need to first switch
cilium/cilium to use cilium/statedb (#32125). That switches the StateDB
REST API to use JSON encoding instead of gob, which we need to be able to transmit structs that have
private fields.

The implementation is currently such that it can slot between ServiceCache and ServiceManager, which
means it still relies on e.g. the service expansion done in K8sWatcher. Also backend filtering (topology awareness) and correlation is still kept in ServiceCache/K8sWatcher, but some of it can definitely be moved into the experimental.Services
API later on. pkg/service also performs backend selection that could be made part of GetBackendsForService later.

This PR is meant to have zero impact on normal production use of Cilium by having all new code behind
a hidden and disabled feature flag.

Rough follow-ups to this PR will be:

  • Rebase this once cilium/cilium has moved to cilium/statedb. This allows reconcilers to modify the objects they're reconciling which gives a natural way to push information upwards, e.g. to set the allocated ID or pull other data from BPF towards these objects.
  • Replace ServiceCache with a proxy implementing the same API on top of Services API and the tables
  • Introduce an alternative to ServiceManager for reconciling the changes to the tables directly to BPF maps.
  • Replace uses of ServiceManager/Resource[Service]/Resource[Endpoints]/ServiceCache etc. with Table[Service]/Table[ServiceBackend].
  • Remove Resource[Service] and Resource[Endpoints] and directly write to the tables with a client-go Reflector without the additional cache.Store.
  • Remove ServiceCache and ServiceManager

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 25, 2024
@joamaki joamaki force-pushed the feature/statedb-services branch 4 times, most recently from 55b99f9 to f2ce936 Compare April 25, 2024 14:37
@joamaki joamaki added the release-note/misc This PR makes changes that have no direct user impact. label Apr 25, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 25, 2024
Copy link

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this @joamaki! This looks really promising, mostly just have a bunch of questions to make sure I'm understanding things correctly.

Comment on lines 89 to 92
// Do not allow services with same address but different names to override each other.
if !existing.Name.Equal(params.Name) {
return false, ErrServiceConflict
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't guarantee the order of operations from Kubernetes API Server so this could be preventing an entirely valid change - ie Service A and B both claim the same IP because the create event for B comes in before the delete event for A. I don't think it's likely, but if we're rejecting this here, we'll need to ensure we have retries with a backoff for anything that's calling UpsertService.

Copy link
Contributor Author

@joamaki joamaki Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's very interesting. I don't think Cilium currently handles such a possibility. But if it indeed is possible then yes we would need to have retries in the data source implementation.

Or an alternative is to be able to have overlapping services and allow this to happen in any order. E.g. Shadowed []*Service field to store the overlapping, but inactive services. That seems fairly complicated so wouldn't go there unless the alternatives are worse.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest asking ourselves what is a unique and immutable key for Frontends and then querying by this key. Does Frontend correspond to a pair (Service, Port) in Kubernetes? If so, then perhaps we should query by the fully qualified Service name (namespace+name in Kubernetes) plus the Port (including protocol).

Copy link
Contributor Author

@joamaki joamaki May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unique immutable key is the tuple (address, protocol, port) (L3n4Addr). The service name is there to group multiple frontends together and to associate the backends with the frontends (it's a secondary index).

We cannot have the service name in the key since then we may end up with multiple frontends with the same address and that's impossible to reconcile (the assumption is that there's multiple data sources (k8s, REST API, kvstore, ...) writing into this so collisions are possible).

I guess what we could do is allow a frontend to have many names, and thus multiple "backend sets" which we'd merge together. This way multiple data sources could create the same frontend (by IP:port) but call it different things. Not sure if that's more confusing than useful...

Comment on lines 94 to 98
// Do not allow overriding a service that has been created from
// another source. If the service is 'Unspec'
if existing.Source != params.Source {
return false, ErrServiceSourceMismatch
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like a reasonable approach, but any reason not to just overwrite here instead? Would it be worth having some kind of configurable prioritization of Sources where data from Source A always has precedence from data from Source B? What does "Unspec" mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I was trying to start out with some reasonable, but strict defaults. I have not fully thought through yet how sources that have overlapping services should interact as it's not really something we've supported before.

Not sure why I have an unfinished comment here nor what I was thinking about with "Unspec" sources. Will clean it up.

}

func (s *Services) deleteService(txn ServiceWriteTxn, svc *Service) error {
// Release references to the backends

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe make it clear that this will also result in backend deletion

Suggested change
// Release references to the backends
// Release references to the backends and delete them if they're orphaned

return err
}

func (s *Services) DeleteServicesBySource(txn ServiceWriteTxn, source source.Source) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you expect this would be used? Do you expect that it will be possible to add or remove sources while Cilium is actively running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed when a source needs to resync. Plan is to use the client-go Reflector directly without an Informer in which case we need to implement the Replace() operation and that's where we'd use this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case reflector.CacheStoreReplace:

return nil
}

func (s *Services) UpsertBackends(txn ServiceWriteTxn, serviceName ServiceName, source source.Source, bes ...*Backend) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think through how we would translate this from Kubernetes informers and/or as an xDS client.

It seems like we'd need to do something like this:

  1. Get existing set of endpoints in Cilium state with loadbalancer.GetBackendsForService()
  2. Compare that list with the set of endpoints you've received from Kubernetes (likely by merging the EndpointSlices you've received from k8s API Server) to determine which endpoints need to be added, updated, or deleted
  3. Creates go to UpsertBackends(), updates go to UpdateBackendState(), and deletes go to multiple ReleaseBackend() calls

All of the above then needs to be wrapped in some form of transaction where you're blocking writes to endpoints tied to that Service for the duration of those steps. Is that roughly the correct intended usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably instead hold the state inside the Kubernetes or other data source to figure out from the endpoint update what the delta is. E.g.

currentBackends := map[string]sets.Set[loadbalancer.L3n4Addr]{}
.

@@ -341,12 +342,39 @@ type ServiceID uint16

// ServiceName represents the fully-qualified reference to the service by name,
// including both the namespace and name of the service (and optionally the cluster).
// +k8s:deepcopy-gen=true
type ServiceName struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not worth changing given how widely it's already used, but I think a name like ServiceIdentity would more clearly describe what this is used for. The ServiceName references throughout keep on throwing me off.

Source: source,
}
if old, _, ok := s.bes.First(txn, BackendAddrIndex.Query(be.L3n4Addr)); ok {
// UpsertBackends cannot overwrite the backend state or allocated ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably just looking at the world through Kubernetes-tinted glasses, but wouldn't you usually have endpoint state if you had the other endpoint info? Why not allow state to be updated here instead of requiring a follow up call to UpsertBackendState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we might have the same backend added from multiple sources or from multiple EndpointSlices and we want to retain the learned state/health of the backend even when it's being added multiple times. Though now that I look at this again, we do want to propagate the terminating state from K8s. This definitely needs a rethink.


params := ServiceParams{
Name: svc.Name,
Frontend: svc.Frontend.L3n4Addr,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you handle multiple addresses/IP families? Are these just considered different Services in this context? Are there separate "Services" for every port as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's a separate service for each type, port, address. I'm starting to lean towards calling this a Frontend instead of Service as it's confusing when this is not semantically the same as the K8s one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed Service to Frontend

Comment on lines 114 to 119
iter := loadbalancer.GetBackendsForService(txn, ops.Backends, svc)
bes := statedb.Collect(
statedb.Filter(iter, func(be *loadbalancer.ServiceBackend) bool {
return svc.Frontend.ProtoEqual(&be.L3n4Addr)
}))
_, _, err := ops.Manager.UpsertService(toSVC(svc.Frontend, svc, bes))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are stale endpoints pruned? Not seeing any evidence of that happening here, but I'm probably just missing something obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServiceManager keeps track of backends that are referenced by services and removes the ones that do not exist. The UpsertService call here tells ServiceManager about all the backends currently associated to a service.

I would like to replace ServiceManager later on with a StateDB reconciler that takes the Service and queries for associated backends and similarly figures out which backends are now gone and need to be removed from the backends BPF map.

Looks roughly like this: https://github.com/cilium/cilium/pull/31755/files#diff-1aab408d0c33a6dfa08997f13c3a94f372eb95a331b731c09752240a960be346R45

// Release references to the backends
iter, _ := s.bes.Get(txn, BackendServiceIndex.Query(svc.Name))
for be, _, ok := iter.Next(); ok; be, _, ok = iter.Next() {
be, orphan := be.removeRef(svc.Name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems at least theoretically possible that some race conditions or errors here could leave orphaned backends around indefinitely. I'm not really sure what the solution is here, but have you already been thinking about how to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that all writes to the tables go via the Services API that will take care of managing the references. The solution is to implement that correctly :-)

type FrontendParams struct {
Name loadbalancer.ServiceName // Fully qualified service name
Source source.Source
Frontend loadbalancer.L3n4Addr // Frontend address
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struct Frontend has a field of type FrontendParams, which in turn has a field called Frontend again yet of a different type. I'd suggest renaming the last one to something like Address, which would eliminate the need for a comment in this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, will rename.

@DamianSawicki
Copy link

Service type in comparison to SVC has additional fields related to reconciliation status and backend revision (the link between the two tables).

I believe Service should be Frontend.

Implements the "writer" to Table[Service] and Table[ServiceBackend].

And here probably these should be Table[Frontend] and Table[Backend].

Add the experimental Services API for managing load-balancing frontends
and backends. This is added as a new experimental package to avoid
confusing it with the production implementation.

When the hidden "--enable-new-services" flag is set the reconciler is started to
reconcile Table[Frontend] and Table[Backend] with the ServiceManager.

There is no code yet to insert into the tables, so this does nothing
by itself yet.

The tables can be inspected with "cilium-dbg statedb dump".

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants