Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

issues with schemaId / storage-schemas.conf in place cluster upgrades #1137

Open
Dieterbe opened this issue Nov 13, 2018 · 1 comment
Open

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Nov 13, 2018

we have a few bugs that manifest when doing a rolling upgrade of storage-schemas.conf to a cluster. During the upgrade, different nodes will have different storage-schemas.conf rules and thus a same value of schemaId will mean different things on different nodes, which means:

  1. instances receiving queries use schemaId to lookup retention and resolution in alignRequests. thus "our last rollout added some new retentions and queries were messed up until the rollout completed (at which point everything was ok)" per @shanson7 (note: can be worked around by doing a blue/green deployment. we can chose to make this merely a documentation bug)

  2. write nodes will trigger panic if they receive a chunk persist message for a span/rollup they don't recognize, which may happen if you change storage schemas or storage-aggregations.conf

  3. not an issue yet, but when we ever add spec-exec for render requests (or a failover mechanism that retries a failed render on the other replica), schemaId may be off (similar note as in 1 applies here)

TODO: track IrId, AggId through the source code and see if there's similar issues with them

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 13, 2018

dieter7:12 PM
to address the alignRequests one, each node could return their version of storage-schemas.conf ,and alignRequests could theoretically work with that. but it sounds like more hassle than it's worth. i'm inclined to say if you need to make changes to storage-schemas.conf use a blue/green deployment or switch over traffic to the other cluster which is pretty much the same thing
7:12 PM
the write node issue seems pretty critical though
7:14 PM
I think we should also add a section to the operations guide about when to do a blue/green style deploy vs an in-place upgrade
dieter [7:20 PM]
the main downside of blue/green (or running a 2nd cluster) is you need to double your read instances at least temporarily or your entire cluster respectively, which is not something we plan to solve in a while I think, realistically
so as long as we know that there will be scenarios in which this type of upgrade is needed (e.g. major clustering changes), there's not much to be gained by avoiding it for scenarios like this one.

@Dieterbe Dieterbe added this to the 1.0 milestone Nov 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant