New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling the Routing Service - A Discussion About How to Start Up Faster #3953
Comments
I created a proxy for valhalla a while ago, something very similar to the sharding approach you described above so I wanted to add my 2 cents to the discussion. We’re using Valhalla mainly for matrix and isochrone requests, that are fairly local (less than 100 km between points) but spread out all over the globe. Handling and Scaling Valhalla the same way as our other services is a benefit in itself for us. Requirements
Our SolutionA Kubernetes deployment with mostly medium sized valhalla containers that serve tiles that consist of one larger or multiple small countries. (So France could be one, and Benelux another for example). A fairly simple proxy that intercepts requests, builds a bounding box from the relevant coordinate parameters and forwards to the appropriate container. The proxy is initialized with a FeatureCollection (5 MB file) of optimized border outlines at different levels (continent, country, region) and uses a hierarchical lookup to match, preferring smaller containers where possible (so if a route that starts and ends in Germany could be served by a Europe container or by a Germany container then the Germany container will be picked). PROs:
CONs:
I think the cons for the sharding approach greatly depend on how much the assumption that the overwhelming majority of requests is local (subnational for us) applies in your use case and if you can live with less than optimal routing in the other cases. In our case it’s a 100 percent local queries so we have it easy but I would expect it to be generally true. I would expect it to be true (subnational) for most pedestrian and cyclist routing, for most public transit routing, for most isochrone and matrix requests and even for planning a road trip from Cape Town to Beijing I would expect many users to query different legs of the trip more frequently than the complete route. This is of course just my assumption and our experience so it be interesting to hear different use cases. So with that I see the drawbacks of the sharding approach slightly different to what you mentioned
Future work: TL;DR: |
Thanks for the detailed write-up! Good that you made it work your use case and that you can live with those exact trade-offs. For truly global setups, sharding is likely still too slow to react to bursty traffic, and the penalty paid by lazy loading will likely become more attractive. Re connectivity: we build it with |
Editors Note: We've been struggling with this problem since the creation of the project. Indeed most other projects with data of this size that allow querries of this size, will have struggled with this same problem. These are some of the various thoughts and topics that have come up over the years regarding this problem.
Problem Description
In typical backend services one generally relies on horizontal scaling to match changing request load. In short, your service is configured to add or remove capacity (instances of the service on separate machines/vms/k8s pods etc.) depending on the value of some metric (eg. cpu utilization). A planet-wide routing tileset itself can be anywhere from 70 to 140gb in size. For best performance (in terms of latency) the service is run with all the data local to its process (typically on an nvme disk). What this means is that scaling up (adding capacity) takes time; typically 10+ minutes to get the data onto the disk before the new service instance can start. In that time incoming request volume can have overwhelmed the capacity of your existing instances which will result in your service returning 504 (timeout) some of the queued requests.
Conventional Scaling is Rarely Practical
At the end of the day we have a database and we want to scale it horizontally. Depending on your use-case, the shape of your request traffic can vastly limit your options for horizontally scaling. Many factors play a role but the variance in request rate and variance in complexity of those requests over time are the typical deciding factors in whether or not its possible for you to horizontally scale valhalla as it works today. If you have a nice sinusoidal request rate over the course of the day (coresponding to when people are awake and not) and your requests don't vary that much in complexity, then you can probably get away with scaling valhalla in the typical way; just set a really low CPU threshold and let the service add and remove machines over the course of the day.
Sadly though the world is usually only ever that predicable in the limit. With huge request volumes you can expect some dependability but its far more common to have a request rate that is at least some what "bursty," meaning a higher degree of variance in request rate and complexity over time. Strategies or theories for responding to bursty request patterns are therefore what we will concern ourselves with in the rest of this writeup.
Typical Confinguration has Obvious Waste
The typical backend service (as opposed to embedded phone app or in-car dash system) configuration for valhalla consists of:
As mentioned above we download the whole of the data (one big tar file) to the disk before we start the service processes which memory map the file. The download takes a while, even with compression/decompression tricks, and memory mapping the file can also take some time too (though this can be shortened by using tar index support which we added in #3281).
So the obvious thing we want to do is stop wasting time getting data to the service that it doesn't immediately need. We download the whole planet before starting the service but a given request can be serviced by only a handful of tiles (fraction of the data).
How to Reduce the Wasted Data
So now that we know we don't need all the data to start handling requests (that's the benefit of having a tiled dataset) how do we determine what data to cut out? There are basicaly two approaches to solving this problem, and neither of them is mutually exclusive.
The first option is to reduce the total scope of the data by concentrating on routing in only one part of the world. Some people refer to this approach as "sharding"; an idea that's been around forever in distributed systems in which no single instance has a copy of the entire dataset but rather just a single piece of the whole; a shard.
The second option is lazy loading. An equally old idea in which data is fetched/loaded on-the-fly as the algorithm calls for it. That is, the data accessor is lazy, it doesn't preload any data it waits for the code to ask it for data and loads it only then or in the background while other work is already being done.
Sharding: Regional Extracts
The sharding approach would practically result in at least several top level instance configurations based on their data coverage:
This is primarily because these are the major connected regions of the world. Of course these regions could be further reduced to smaller and smaller regions. For example you could keep the Afro-Eurasia region in the case that you'd need to route someone from Cape Town to Beijing but you could also have separate instances running with just Western Europe or just North Africa and so on. The shards can overlap in other words. We do this with the hope that you don't see a lot of bursty request patterns for the larger shards (Afro-Eurasia has to be like 70% of the planet data) so that most of the horizontal scaling is done on the smaller shards which start faster by virtue of having much less data
Pros to this approach:
Cons to this approach:
Lazy Loading
Lazy loading in a practical sense means just-in-time tile fetching from some location that isn't local to the process. Valhalla already supports this via its http drivers. Even today, one can configure valhalla to fetch tiles on-the-fly from s3 for example (using pre-signed urls). What this means is that even today you can get a new instance of the service started instantly and start servicing requests instantly. The problem with this approach though is that the latency of such a solution is far too high. Http round trip for something like s3 is already on the order of 50-100ms. For it to work we need low latency access to the tile data. NVME drives have a latency of something like 50-100μs; this is our cache miss penalty in memory mapped tar mode described above. In other words our worst case is 3 orders of magnitude faster than lazy loaded http data access. On top of that, the stuff we lazy load isn't shared in a memory map across processes, because we are grabbing individual tiles we end up having an in-memory tile cache local to every process. This means we need a lot more ram to handle the same request load as before or we need to do a lot more per process cache invalidation which also slows things up. So yeah a pretty dismal situation.
So those are the two issue we need to solve to make lazy loading practical.
The good news is we already have both of these readily available to us. For the former we can use Redis/Memcached (or hosted variants) and for the latter we can continue using our memory mapped tar. How do we do that?
At Data Build Time
Single-Writer or Multi-Writer, Either Way Don't Race
Here we have two options, we can start a sidecar to handle operations with redis or whatever lower latency storage and the filesystem tar or we can do this all inside graphreader and libvalhalla itself. Below I wrote an outline with the former in mind but the latter is equally do-able.
graph_tile_size
part of thegraphtileheader
, into the location it belongs within the tar based on index.bingraph_tile_size
part of thegraphtileheader
signaling that the write is complete (to the readers in the service process)Multi-Reader, Check then Fallback
On the graphreader side where we need to get access to the data we essentially need to add a little bit of checking to the tar "driver".
graph_tile_size
field for a non-zero value in thegraphtileheader
Synopsis
A practical approach to solving the horizontal scaling problem likely requires a combination of strategies, shrinking data and delaying data fetching. I would suggest starting with lazy loading support, measuring its efficacy and then making the decision on whether or not sharding is required to further reduce the scope of data needed at start up.
Risks
The text was updated successfully, but these errors were encountered: