Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling the Routing Service - A Discussion About How to Start Up Faster #3953

Open
kevinkreiser opened this issue Feb 7, 2023 · 2 comments

Comments

@kevinkreiser
Copy link
Member

kevinkreiser commented Feb 7, 2023

Editors Note: We've been struggling with this problem since the creation of the project. Indeed most other projects with data of this size that allow querries of this size, will have struggled with this same problem. These are some of the various thoughts and topics that have come up over the years regarding this problem.

Problem Description

In typical backend services one generally relies on horizontal scaling to match changing request load. In short, your service is configured to add or remove capacity (instances of the service on separate machines/vms/k8s pods etc.) depending on the value of some metric (eg. cpu utilization). A planet-wide routing tileset itself can be anywhere from 70 to 140gb in size. For best performance (in terms of latency) the service is run with all the data local to its process (typically on an nvme disk). What this means is that scaling up (adding capacity) takes time; typically 10+ minutes to get the data onto the disk before the new service instance can start. In that time incoming request volume can have overwhelmed the capacity of your existing instances which will result in your service returning 504 (timeout) some of the queued requests.

Conventional Scaling is Rarely Practical

At the end of the day we have a database and we want to scale it horizontally. Depending on your use-case, the shape of your request traffic can vastly limit your options for horizontally scaling. Many factors play a role but the variance in request rate and variance in complexity of those requests over time are the typical deciding factors in whether or not its possible for you to horizontally scale valhalla as it works today. If you have a nice sinusoidal request rate over the course of the day (coresponding to when people are awake and not) and your requests don't vary that much in complexity, then you can probably get away with scaling valhalla in the typical way; just set a really low CPU threshold and let the service add and remove machines over the course of the day.

Sadly though the world is usually only ever that predicable in the limit. With huge request volumes you can expect some dependability but its far more common to have a request rate that is at least some what "bursty," meaning a higher degree of variance in request rate and complexity over time. Strategies or theories for responding to bursty request patterns are therefore what we will concern ourselves with in the rest of this writeup.

Typical Confinguration has Obvious Waste

The typical backend service (as opposed to embedded phone app or in-car dash system) configuration for valhalla consists of:

  1. running the service in multiprocess mode for operational stability due to isolation of requests
  2. making the data accessible via a memory mapped tar file so that OS filesystem cache (ram) can be shared across processes that need lock-free concurrent access to the data. This is essentially a multi-reader (the service processes) single-writer (the OS) cache

As mentioned above we download the whole of the data (one big tar file) to the disk before we start the service processes which memory map the file. The download takes a while, even with compression/decompression tricks, and memory mapping the file can also take some time too (though this can be shortened by using tar index support which we added in #3281).

So the obvious thing we want to do is stop wasting time getting data to the service that it doesn't immediately need. We download the whole planet before starting the service but a given request can be serviced by only a handful of tiles (fraction of the data).

How to Reduce the Wasted Data

So now that we know we don't need all the data to start handling requests (that's the benefit of having a tiled dataset) how do we determine what data to cut out? There are basicaly two approaches to solving this problem, and neither of them is mutually exclusive.

The first option is to reduce the total scope of the data by concentrating on routing in only one part of the world. Some people refer to this approach as "sharding"; an idea that's been around forever in distributed systems in which no single instance has a copy of the entire dataset but rather just a single piece of the whole; a shard.

The second option is lazy loading. An equally old idea in which data is fetched/loaded on-the-fly as the algorithm calls for it. That is, the data accessor is lazy, it doesn't preload any data it waits for the code to ask it for data and loads it only then or in the background while other work is already being done.

Sharding: Regional Extracts

The sharding approach would practically result in at least several top level instance configurations based on their data coverage:

  1. North America
  2. South America
  3. Afro-Eurasia
  4. Oceania + Rest of World

This is primarily because these are the major connected regions of the world. Of course these regions could be further reduced to smaller and smaller regions. For example you could keep the Afro-Eurasia region in the case that you'd need to route someone from Cape Town to Beijing but you could also have separate instances running with just Western Europe or just North Africa and so on. The shards can overlap in other words. We do this with the hope that you don't see a lot of bursty request patterns for the larger shards (Afro-Eurasia has to be like 70% of the planet data) so that most of the horizontal scaling is done on the smaller shards which start faster by virtue of having much less data

Pros to this approach:

  1. Cut up the data as much as you like to match your traffic the best you can
  2. The approach works without any code changes to valhalla
  3. The idea is straight forward

Cons to this approach:

  1. You're only as scalable as your largest shard, its possible its still too large to practically start in time
  2. You need a complicated request proxy that knows about the shards and doesn't pick the wrong one (this is non-trivial, shard shapes must be based on connectivity somehow)
  3. Data production is complicated for the same reason, need to learn how to cut up the data based on what we know about request location and volume
  4. Managing heterogenious clusters is complicated, they all have to be tracked and alarmed separately

Lazy Loading

Lazy loading in a practical sense means just-in-time tile fetching from some location that isn't local to the process. Valhalla already supports this via its http drivers. Even today, one can configure valhalla to fetch tiles on-the-fly from s3 for example (using pre-signed urls). What this means is that even today you can get a new instance of the service started instantly and start servicing requests instantly. The problem with this approach though is that the latency of such a solution is far too high. Http round trip for something like s3 is already on the order of 50-100ms. For it to work we need low latency access to the tile data. NVME drives have a latency of something like 50-100μs; this is our cache miss penalty in memory mapped tar mode described above. In other words our worst case is 3 orders of magnitude faster than lazy loaded http data access. On top of that, the stuff we lazy load isn't shared in a memory map across processes, because we are grabbing individual tiles we end up having an in-memory tile cache local to every process. This means we need a lot more ram to handle the same request load as before or we need to do a lot more per process cache invalidation which also slows things up. So yeah a pretty dismal situation.

So those are the two issue we need to solve to make lazy loading practical.

  1. Low latency access to remote tiles -> A global distributed cache
  2. Low ram requirements per instance -> A local shared cache

The good news is we already have both of these readily available to us. For the former we can use Redis/Memcached (or hosted variants) and for the latter we can continue using our memory mapped tar. How do we do that?

At Data Build Time

  1. optional: modify the tar generation script to do better sorting of the tiles within the tar, cluster level 0 1 and 2 tiles together in blocks rather than lexical sort, could also try the infamous hilbert curve (though not clear how different levels fit in that scheme). this should improve cache coherency some when memory mapping
  2. populate redis with tileset/tileid as keys and corresponding binary gph tiles as values
  3. also add the tar file index.bin with tileset/index.bin as the key and the actual file as the value

Single-Writer or Multi-Writer, Either Way Don't Race

Here we have two options, we can start a sidecar to handle operations with redis or whatever lower latency storage and the filesystem tar or we can do this all inside graphreader and libvalhalla itself. Below I wrote an outline with the former in mind but the latter is equally do-able.

  1. download the index.bin file from redis (should be tiny)
  2. write the beginning of the tar with the entry and contents of index.bin and then write 0s until the rest of the tar is the desired length (based on last entry of index.bin)
  3. memory map the tar we just created
  4. open up a (domain) socket to listen for requests from the valhalla graph reader processes, if they see something is missing in the tar they will ask this process to get it
  5. when they ask, this sidecar process will:
    1. go to redis to get the tile data
    2. write all the bytes excepting the graph_tile_size part of the graphtileheader, into the location it belongs within the tar based on index.bin
    3. then write the graph_tile_size part of the graphtileheader signaling that the write is complete (to the readers in the service process)
  6. return an ok over the socket signalling the caller that it can now use the memory mapped address in the tar

Multi-Reader, Check then Fallback

On the graphreader side where we need to get access to the data we essentially need to add a little bit of checking to the tar "driver".

  1. Update the tar driver to check the graph_tile_size field for a non-zero value in the graphtileheader
  2. If the value is non-zero we are good to go return the address to the tile inside the mem-mapped tar as normal
  3. If the value is zero fallback to sending a request for the tile over the open socket
  4. Once the socket returns ok we can return the address as normal

Synopsis

A practical approach to solving the horizontal scaling problem likely requires a combination of strategies, shrinking data and delaying data fetching. I would suggest starting with lazy loading support, measuring its efficacy and then making the decision on whether or not sharding is required to further reduce the scope of data needed at start up.

Risks

  1. if the request load is so varied that it covers the whole globe, there is nothing we can do. look at you request history and geographic distribution in the queries to determine how localized bursty traffic is
  2. OS filesystem cache is not magic, it takes time to move bytes from disk into ram. A redis round trip goes through ram, to disk and back to ram. this is the price we pay to avoid having machines with a ton of ram. It also means that latency of requests will be slow when a server is starting up. We hope that it will still be usable but if there is an experiment we can design to derisk this it should be done. There is no point in building the whole system if it still takes 5 minutes for the fresh instance to reach semi-normal request latencies
  3. I'm sure there are more, I'll add them when its not 2am..
@nilsnolde nilsnolde pinned this issue Feb 7, 2023
@j-bbr
Copy link

j-bbr commented May 31, 2023

I created a proxy for valhalla a while ago, something very similar to the sharding approach you described above so I wanted to add my 2 cents to the discussion.

We’re using Valhalla mainly for matrix and isochrone requests, that are fairly local (less than 100 km between points) but spread out all over the globe. Handling and Scaling Valhalla the same way as our other services is a benefit in itself for us.

Requirements

  • Should run in the Cloud with Kubernetes (we have had some bad experiences especially with building but also with serving large tile sets in containers)
  • Scale automatically
  • Provide routing coverage for most of the globe

Our Solution

A Kubernetes deployment with mostly medium sized valhalla containers that serve tiles that consist of one larger or multiple small countries. (So France could be one, and Benelux another for example). A fairly simple proxy that intercepts requests, builds a bounding box from the relevant coordinate parameters and forwards to the appropriate container. The proxy is initialized with a FeatureCollection (5 MB file) of optimized border outlines at different levels (continent, country, region) and uses a hierarchical lookup to match, preferring smaller containers where possible (so if a route that starts and ends in Germany could be served by a Europe container or by a Germany container then the Germany container will be picked).

PROs:

  • Small start-up times -> reasonable reaction to request bursts
  • Easily and automatically scalable with cpu/memory limits

CONs:

  • Limits to optimal routing: If the route starts and ends in a in a covered region but the optimal route goes through a 3rd region then it’s not guaranteed to be found. Certainly the biggest obstacle to being used generally

I think the cons for the sharding approach greatly depend on how much the assumption that the overwhelming majority of requests is local (subnational for us) applies in your use case and if you can live with less than optimal routing in the other cases. In our case it’s a 100 percent local queries so we have it easy but I would expect it to be generally true. I would expect it to be true (subnational) for most pedestrian and cyclist routing, for most public transit routing, for most isochrone and matrix requests and even for planning a road trip from Cape Town to Beijing I would expect many users to query different legs of the trip more frequently than the complete route. This is of course just my assumption and our experience so it be interesting to hear different use cases. So with that I see the drawbacks of the sharding approach slightly different to what you mentioned

  1. Scalable as the largest shard: If you cover say 95% of request volume in 5 seconds and the remaining 5% in 5 minutes I think that’s still very desirable. Large shards aren’t swamped with as many requests on start-up as well
  2. This is very true if you need the same routing guarantees from this sharded deployment as you would from a global valhalla planet deployment. With more limited routing guarantees the proxy can be fairly simple
  3. I think you could put in a lot of work optimizing here but I doubt you really need too. The main importance from a deployment perspective is that the startup times are low, so even if your containers coverage doesn’t really match your request patterns exactly you still have a system that adapts and scales automatically. I think deploying containers who match national borders (combining smaller adjacent countries where possible) makes a lot of sense as it makes for fairly understandable guarantees for routing: routing between destinations within a country is optimal except in cases where the optimal route goes through a 3rd country.
  4. Managing heterogeneous clusters: I don’t quite understand this point. I think bringing shard sizes down on average makes the management in a cluster much more approachable and lets you use widely used cluster orchestration/management/scaling tools.

Future work:
We’re happy with the proxy for our purposes and could also publish it if there is interest but I think what would be interesting for the future is if there is a way to feed some high level information (the connectivity information) from valhalla into the proxy to determine better to which container the request should be directed.

TL;DR:
Our experience is that If you’re willing/able to live with certain trade-offs regarding routing guarantees a sharding approach lets you easily deploy a scalable valhalla cluster that covers much of the globe.

@nilsnolde
Copy link
Member

nilsnolde commented May 31, 2023

Thanks for the detailed write-up! Good that you made it work your use case and that you can live with those exact trade-offs. For truly global setups, sharding is likely still too slow to react to bursty traffic, and the penalty paid by lazy loading will likely become more attractive.

Re connectivity: we build it with loki.use_connectivity which is enabled by default. 2 ways of accessing it: /status with verbose: true (it'll be cached after first request) or valhalla_build_connectivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants