Skip to content

design doc: Emissary's CRD conversion logic

Luke Shumaker edited this page Oct 7, 2022 · 2 revisions

date: 2022-10-07

This is an update to the older (2022-08-26) Edgissary's CRD conversion logic document in Notion.

This document describes things as of PR #4055. If working with legacy versions that don't include that PR, things will be different.

Emissary needs to convert things

Emissary takes several different getambassador.io API versions; getambassador.io/v1, getambassador.io/v2, and getambassador.io/v3alpha1. It accepts these both as Kubernetes resources (defined via CRDs), and as YAML annotations on certain other Kubernetes resources:

  • For Kubernetes resources, the CRDs point to a converting webhook that we call "apiext" that converts between whichever versions the Kubernetes apiserver needs. Emissary itself asks the apiserver for resources as the latest version, so that the main part of the Emissary code only needs to worry about one version.

    controller-runtime helps with this; providing code that handles all the webhook-y stuff, and we only need to provide the conversion functions.

  • For annotations, Emissary does have to handle multiple versions. The "watcher" part of Emissary constructs a snapshot that it passes to the rest of the system; this watcher also converts the resources in the snapshot to the latest version, so that the main part of the Emissary code only needs to worry about one version.

    We borrow some of controller-runtime's webhook conversion code (since controller-runtime doesn't publicly expose the parts we need) in order to do this; borrowed_webhook.go. This is perhaps not the design that would be best on its own, but consistency with controller-runtime is good.

How does that controller-runtime stuff work?

There are 2 upstream conversion mechanisms at play here:

  1. The core k8s.io/apimachinery/pkg/runtime.Scheme where you call scheme.AddConversionFunc(srcType, dstType, conversionFunc) and it builds a mesh of possible conversions. These functions are not methods on the resource types, which may be a downside, but it does mean that you can write conversions for 3rd-party types that you can't add methods to. And plus there's a conversion-gen tool that will mostly write these functions for you and write functions that do all those scheme.AddConversionFunc calls for you. The big downside is that it doesn't know how to traverse this as a graph; you'd have to register a separate function for each combination of srcType and dstTypes; to do arbitrary conversions you'd have to generate a full-mesh of functions.

  2. sigs.k8s.io/controller-runtime/pkg/conversion attempts to address the tedium of needing N^2 conversion functions and cutting that to 2N by building a layer on top of the core runtime.Scheme. And by "on top of" I actually mostly mean "missing the point of". It uses the scheme as a simple listing of known versions, ignoring any conversion funcs in the scheme. Instead of supplying N^2 conversion funcs, between every pair of versions, you just designate one version as the "hub" and all of the other versions ("spokes") implement a ConvertFrom function to convert from the hub to that other version and a ConvertTo method to convert from that other version to the hub. And then the webhook conversion takes the pair of 2 types and if either is a hub, just does calls the appropriate method directly; otherwise it takes the scheme and iterates over it looking for the hub version, then calls the source type's ConvertTo to convert to the hub and then the destination type's ConvertFrom to convert from the hub. Also, IMO this is only POC quality in controller-runtime; it's definitely not prime-time ready, one of the biggest drawbacks being that controller-gen can't help you generate these conversion methods, it's clear not many folks are using it For Real yet (I submitted a kubecon talk about it...).

(You can ask a scheme what types it knows, but not what conversions it knows, at least not without just trying to do the conversion and checking if it returns an error. I've thought that this would be a good addition, and that would allow controller-runtime to be enhanced to be able to build a graph of what the scheme itself knows how to do that, then use basic graph-traversal algorithms instead of needing an explicit hub and all of those spoke methods.)

How Emissary uses that runtime stuff and controller-runtime stuff

So, what do we do? Both! Emissary uses conversion-gen to generate runtime.Scheme-compatible conversion functions (well, a patched conversion-gen, upstream conversion-gen has turned out to be quite buggy, and also missing a few highly-convenient features). Then we have awk scripts in generate.mk that generate adapters between the two systems; .Hub(), .ConvertTo(), and .ConvertFrom() methods that call in to the scheme to do the actual conversions. Best of both worlds!

Emissary designates the latest version (currently v3alpha1) as the hub version, and v2 as the spoke version; this is done by these 2 lines in generate.mk:

generate-fast/files += $(OSS_HOME)/pkg/api/getambassador.io/v1/zz_generated.conversion.go
generate-fast/files += $(OSS_HOME)/pkg/api/getambassador.io/v1/zz_generated.conversion-spoke.go
generate-fast/files += $(OSS_HOME)/pkg/api/getambassador.io/v2/zz_generated.conversion.go
generate-fast/files += $(OSS_HOME)/pkg/api/getambassador.io/v2/zz_generated.conversion-spoke.go
generate-fast/files += $(OSS_HOME)/pkg/api/getambassador.io/v3alpha1/zz_generated.conversion-hub.go

Emissary's special sauce: daisy chaining

However, there's a trick here! generate.mk doesn't have conversion-gen generate functions between spoke←→hub, it has conversion-gen generate functions between one version and the next version; so v1 gets functions to convert between it and v2, v2 gets functions to convert between it and v3alpha1; and then the .ConvertTo and .ConvertFrom adapter methods iterate over that daisy-chain of conversions in the scheme in order to implement the spoke←→hub conversion.

So, the questions: Why'd I choose v3alpha1 as the hub version? Why not use the storage version as the hub version? Why have conversion-gen generate this daisy-chain instead of directly generating the spoke←→hub conversions? The answers to these are all sort of inter-tangled:

  • Precedent: How most of Kube's built-in types do this internally is that they convert to an "internal" version as the hub (which is also how the backend code receives it); this version isn't ever exposed to users; sometimes it's a sort of "bleeding-edge" version that might get frozen as the next stable version, sometimes it has some subtle type changes in it that make it nicer to work with in Go but less nice in YAML. Since we don't have an "internal" version with not-yet-released changes (just "TODO" comments about needing to make those changes), using the latest version (v3alpha1) is closest to what upstream kube does, which is the best we have right now for what "best practice" is.

  • It's about future-readiness:

    • If we designate an old version (v2) as the hub version, that means that in the future when we drop that version, all of the conversion funcs would need to be rewritten for the new hub version. The funcs are mostly generated, but there need to be some hand-written parts too; having to rewrite those all at once is at best a huge amount of churn effort, and at worst a good source of bugs/regressions.

    • If we designate an old version (v2) as the hub version, that means that all of the future conversions will need to deal with weird quirks of v2, and we'll have implementation baggage slowing things down (tech debt) as we move further and further from v2. For example, if we change some semantic from v3alpha1 to v3alpha2, the logical thing to do would be to just code up that conversion from v3alpha1 to v3alpha2, without having to think about "how does this round-trip through v2?"

    • While v3alpha1 is the current latest version, it will eventually be an old version, and the above will apply to it. So do we say "when we introduce a new version, we must immediately make that version be the hub version?"

    • If we evergreen designate the latest version as the hub-version, that means we're updating all of the conversions every time we introduce a new version; see that first point about churn and bugs.

    • If we create a new internal-only version that gets to be the hub version (same as how most kube builtins do), that mostly solves all of the above, but means that everytime we want to change it we need to go update all of the conversions. This adds significant friction to making changes, and creates a strong pressure to prune old versions so that they aren't weighing us down. The friction is bad for developers, and over-eagerly removing versions is bad for users.

    • If instead of just following controller-runtime's hub/spoke model (because, again, I don't think controller-runtime's conversion stuff is terribly thought out or fleshed out at this point and is more of just a POC), and we implement our own daisy-chain model, I think this solves all of these problems. An old tail version can be pruned without affecting anything else. You only need to think about the most recent 2 versions when making a change. When adding a new version, the only change to conversions that you need to make is adding functions to 1 version (the now-formerly most recent version).

So IMO the daisy-chain model is the only maintainable model. And so why use the most-recent-version as the hub? Well, for the chain, you want all links going the same direction, so it's either the lowest version or the highest that doesn't get conversion functions. And I suppose the runtime.Scheme-to-controller-runtime adapters don't need to follow that, and let any version be the hub; but that'd be more complicated, and I don't really see the point.