New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance improvements for WMS metatiling #7457
base: main
Are you sure you want to change the base?
Conversation
For this work, I decided to use nested locks:
But after running some really intensive JMeter tests with hundreds of threads, I discovered that the MemoryLockProvider in geowebcache will cause deadlocks if it's being used in a nested way and you just so happen to get unfortunate collisions caused by the striping. I create a ticket in that project, with a pull request to resolve it. This pull request will now be dependent upon that pull request. |
Hi there, thanks for sharing your work! I don't have time right now to review the PRs but I see a pattern that cannot be accepted: unbounded resourcing. The thread pools in particular (and possibly the locks, not sure here) need some level of control, the hardware GeoServer is running on has finite capabilities. Btw, I'm pointing this out so that you can act on it without waiting for the actual review, the rest looks great (at least at a cursory look). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting stuff. See comments below.
I've run it with some interactive tests, making sure I was using the GWC changes and the new imageio-ext, using the "tile layer" preview. In some cases I'm not getting all the images, in others, like topp:states, the map remains blank. Not sure why though.
if (tryCache && tryCacheFetch(tile)) { | ||
|
||
// After getting the lock on the meta tile and invidual tile, try again | ||
if (tryCache && tryCacheFetch(conveyorTile)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something is amiss in this method.
The existing code is inefficient when the server load is limited, but it provides guarantees that the same work is not done multiple times:
- First it gets a lock on the meta tile
- If the reference tile is missing, then it dispatched the getmap
- it clips and save all the tiles in the meta-tile before the lock is released
Now, let's place this in a real-world load scenario. A user walks in a bar and opens a tile based viewer!
The viewer is browser-based, and it sends 6 requests in parallel to the server, each requesting separate tiles.
Some of these tiles below to the same meta tile. With the above algorithm, one wins and computes, the others are sitting doing nothing, and when they are released, they do the tryCacheFetch, finding the tile they want, and thus not dispatching the GetMap once more.
Here is what I see in the current implementation (please tell me where I'm wrong):
- The first request that manages to get in, locks the meta-tile and dispatcher the getmap
- That same request computes the tile it wants on the request thread, while the others are sent to the background pool for async computation
- The first one waits for the tile latch to be released, when all the threads computing the sub-images have managed to get the tile lock, but they still haven't computed and save the tile on disk.
- The first request relaeses the metatile lock, allowing the other overlapping requests (different tile, same metatile) to execute
- From here on, the concurrent dance takes places and guarantees are lost. In particular, a request thread for another tile might find the tile it wants still hasn't landed on disk, and decides to dispatch the getmap again. This is a duplicated expensive computation that involves grabbing connections from the pool and so on. Eventually this leads to trying to compute the tile and getting stuck on the tile lock... and when the tile lock is released... it's not even trying to fetch the tile from disk, it's just computing it again?
This would seem preferable:
- Grab the meta-tile lock, try to fetch the tile
- If not found, dispatch the getmap
- Compute the tile inline, execute the others async
- Wait to release the meta-tile lock until all the other tiles are saved on disk (using the latch, the thread releasing the last bit of the latch releases the meta-tile lock).
With this approach, you get parallel and async computation, the first tile gets released soon, the others have to wait for the full meta to be computed. More parallel than the existing code, does not duplicate computation. But not as parallel as your current proposal.
Maybe one can do things a bit differently then? How about:
- Grab the meta-tile lock, try to fetch the tile
- If not found, grab the tile lock
- Once the tile lock is acquired, check if the tile has already been computed, otherwise dispatch the getmap
- Once the getmap is computed, we can release the meta-tile lock (the getmap is the metatile, that's what we want to protect)
- Do the same as your proposal here for the rest, releasing the tile locks once the saving is done. The threads waiting will check if the tile is there, will find them, and return immediately.
I'm not 100% sure about the above, but the idea is that we need to avoid re-computing the getmap and avoid re-saving the tiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My currently implementation attempts to ensure that the GetMap request for the metatile is only dispatched a single time. My implementation is actually very close to your last proposal.
You mentioned the following, which is not quite correct:
In particular, a request thread for another tile might find the tile it wants still hasn't landed on disk, and decides to dispatch the getmap again
The concurrent user request thread will block on the tile lock (obtained by the ongoing async thread) before checking the cache, and when the async thread is complete and unlocks the tile lock, the request thread will find the tile on disk. To help outline that, here is the perspective of the primary request that is able to dispatch the getmap:
- Check cache before any locks
- Obtain the meta-tile lock.
- After obtaining meta-tile lock, obtain the tile lock
- Check cache again, after obtaining both the meta-tile and sub-tile lock.
- If not found, dispatch the getmap
- Compute the tile inline, execute the others async
- Async threads obtains locks on all the other tiles
- Primary thread waits for all tile locks to be obtained before releasing meta-tile lock (using a countdown latch).
The primary request was able to obtain the meta-tile lock and each individual tile lock before releasing the meta-tile lock. The async threads will not release their tile locks until they've saved the tiles to disk. The concurrent user request will eventually obtain the meta-tile lock but then block on their respective tile locks while the async thread is still going. Here is the perspective of a concurrent request that is for a different tile of the same meta-tile:
- Check cache before any locks.
- Blocks on meta-tile lock initially, but is eventually obtained.
- Blocks on tile lock because an async thread is still working on it, but is eventually obtained.
- Check cache again, this time it's found because the async thread put it there. No need to dispatch getmap.
This is very similar to your very last proposal with I think only one key difference, from what I can tell. I intentionally don't release the meta-tile lock until I have certainty that all the individual tiles have been locked by the async threads. If there's a concurrent user request, we want it to wait for the async thread and not the other way around because the async threads already have the image whereas the concurrent user request does not (and would need to dispatch a getmap). Therefore, the async thread must be able to obtain the tile lock and the concurrent user request (after obtaining the meta-tile lock) will then block and wait for the async thread.
Pseudo code that could help:
getTile(tile){
if (isCached(tile)){
return getFromCache(tile)
}
lock(metatile) // concurrent request will block here if an ongoing getmap is dispatched
lock(tile) // concurrent request will block here if an async thread is running
if (isCached(tile)){
return getFromCache(tile)
} else {
dispatchGetMap(metatile);
response = saveTile(tile)
for(tile : metatile.getTiles()){
async(() -> {
lock(tile)
saveTile(tile)
release(tile)
})
}
waitForAllTileLocksToBeObtained()
release(lock)
release(metatile)
return response
}
I also attached a diagram to help me sort out my own thoughts on the matter.
Please continue to be critical in case I'm overlooking something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reviewer, being critical is my job 🤣
The current code looks good, and the diagram is indeed useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a second round of interactive tests.
The lockup I've observed yesterday has been fixed by your last commit, nice.
I've now tested a layer "seed" with 8 threads to see how the new code interacts with it. The interaction looks poor. After a bit of seeding, I've looked into a jstack dump, finding 1500+ "gwc-tile-saving-xyz" threads (I have a 10x10 meta-tile configuration, which is one of the largest we use for production purposes, in order to stress the system).
On one side, this is due to having an unbounded thread pool (my first feedback, still needs to be addressed)... While I was expecting to see more like 800 background threads (8 seeding threads, 100 tiles for each), that's too many threads for it to be acceptable in any case.
I've also noticed that while the seeding appeared to be complete (no more seeding threads alive), the background threads were still working saving tiles. This will break common automation cases where a machine is dedicated to seed offline, and then the result is moved to the production cluster for usage: the machinery will think the seed is done and start moving the tiles, while they are still being produced.
Also, during a seed the level of concurrency is already controlled by the seed form, having a hidden extra concurrency is probably not great, as it multiplies the CPU load in a way that's not entirely predictable. Perhaps the extra thread pool should not be used in that case, and be limited to tile creation on user demand? However, that will require passing down more information, to inform the tile layer about its operational context.
Final note, the "log" method in GeoServerTileLayer will have to disappear or use the LOGGER at FINE/FINER level, and be guarded so that it does not try build strings/allocate arrays during a seed (I get you used it for debugging purposes, just noting the PR cannot be merged with it still in).
if (tryCache && tryCacheFetch(tile)) { | ||
|
||
// After getting the lock on the meta tile and invidual tile, try again | ||
if (tryCache && tryCacheFetch(conveyorTile)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reviewer, being critical is my job 🤣
The current code looks good, and the diagram is indeed useful.
I've committed some changes:
|
I separated encoding and saving into independent runnables and the primary ConveyorTile that was requested from the user is now encoded on the main thread but saved asynchronously. The additional tiles belonging to the same meta-tile will be encoded and saved asynchronously. Since the concurrency is now a configurable setting, I updated the benchmark to compare results both with and without concurrency. Here are updated results: Now that saving to disk is asynchronous, I'm a little puzzled as to why there is still such a discrepancy between disabling GWC and using GWC but there's no cache hits. |
@mbosecke I'm not sure, but wondering why the numbers in "run without GWC" have roughly doubled compared to the charts in your first comment (from 400/500 ops per second to above 1000) Also mind the async saving (and I might be misunderstanding your comments here)... it can work for the interactive use case, but for seeding, when the last seeding thread is done, the work must be complete. Otherwise, scripts that wait for the seed to end to move tiles to another machine, will work against a storage that is not yet fully populated. Thinking out loud, IO can also be a source of issues... it makes a world of difference if you're saving on a local SSD, on a network storage, or on an object storage, trying to push too much towards network disk (especially if they are cheaper spinning disks) can be counter-productive. The 2*core rule of thumb is for WMS requests, when encoding goes directly to the network, not sure about the case when we write to persistent storage. That might require different tuning. I don't have a ready answer here, just pointing out a potentially problematic point. |
Also thinking out loud, maybe this work can be split into two parts. One is just contributing the faster PNG encoding for sub-images, while the other can built on top of it and provide better concurrency for the interactive seeding case. |
Finally found a minute to look at the code. If I read correctly, in case of seeding there is no async behavior whatsoever, the asynchronous bits are just for interactive usage. So, looking good from this point of view. Going back to the performance, it is puzzling indeed. I've run the benchmark twice locally (Ryzen 1700x, 8 cores), with these results:
Some numbers are odd, but I can reproduce the overall "WMS without GWC is faster". I've tried to look at the workload, making it print the first 100 tiles that are getting requested, and in the cases without cache hits I see the following sequence:
So one tile out of four is being requested... Update: come to think of it, setting |
Thanks for looking into that further. Just FYI, I've been distracted lately with other things but I'm hoping to revisit this further in the coming week or two! I haven't abandoned this initiative yet. |
Integrated JMH with Junit to easily run.
…BufferedImageAdapter. Both the library and the adapter were making deep copies of image raster data in the case of sub-images (i.e. when metatiling). The imageio-ext png writer has since been improved to be able to handle sub-images natively without requiring a complete copy of it's underlying data.
…rk for metatiles to additional threads. The main thread will only encode the desired ConveyorTile while other tiles will be done asynchronously. For a 4x4 metatile, which has 16 inner tiles, this saves the main thread from having to encode the 15 additional tiles in addition to the requested tile.
…s on their individual tiles. Once all locks are obtained, then it's safe to release the primary metatile lock.
… outside of the coverage.
- Exposed a configuration option for users to adjust it - Default size of "number of cores * 2" - Disabling concurrency for seeding tasks
…uring the old one is shutdown.
…verTileLayer instead of "ConveyorTile.servletReq". This appears to work for both WMS and WMTS whereas "ConveyorTile.servletReq" only worked for WMTS. Seeding does not appear to set "Dispatcher.REQUEST" so it avoids the additional concurrency, as desired. Also upgraded the benchmark test to test both with and without concurrency set.
… saved in the cache asynchronously.
…QUEST instead of ConveyorTile.servletReq.
Updates to the PNG encoder have made their way into ImageIO-EXT 1.4.10 and main has been updated to use it. |
Overview
My intention was specifically to improve WMS performance (specifically throughput), with 4x4 metatiles, using the integrated GWC, encoding to PNG. I feel like this is likely a pretty common use case in GeoServer.
I committed a benchmark class and two independent performance improvements. Although the improvements are independent of each other I'm including them in one pull request since they both leverage the new benchmark class.
Measurable JMH Benchmark
The first thing I did was create a JMH benchmark so that we could get repeatable measurements. My understanding is that it's pretty common amongst the development team to use JMeter but this new benchmark has the following advantages:
The benchmark tests a few different scenarios:
runWithoutGwc
: Integrated GWC is disabledrunWithGwcAndNoCacheHits
: Integrated GWC is enabled but incoming requests don't overlap each other so there are no cache hitsrunWithGwcAnd50PercentCacheHits
: Approximately 50% of the incoming requests will be cache hitsrunWithGwcAnd75PercentCacheHits
: Approximately 75% of the incoming requests will be cache hitsrunWithGwcAnd90PercentCacheHits
: Approximately 90% of the incoming requests will be cache hitsNotes:
WmsMetatileBenchmarkTest.runBenchmark()
. The test has the@Ignore
annotation so it is not included by default when running tests. My IDE (Intellij) allows me to just run this one "test" with a single click of a button.src/gwc/target/benchmark-results.json
and can be visualized by uploading the JSON file here: https://jmh.morethan.io/Improvement 1: Improved performance of PNGWriter for sub-images
This improvement was primary performed in this pull request for the imageio-ext library:
geosolutions-it/imageio-ext#299
All that needed to be done in GeoServer was to upgrade to the latest version of that library and stop using
BufferedImageAdapter
. I'm currently using a SNAPSHOT version of that library until my pull request over there gets accepted and a new version of the library is released.Improvement 2: Increased GWC Metatiling concurrency
Originally, if a 4x4 metatile was requested, GWC would encode all 16 tiles before returning the requested tile to the end user. With this change, the requested tile will be encoded on the main thread but all additional 15 tiles will be encoded on separate asynchronous threads. This has a significant impact on throughput.
Results
The JMH benchmark results (higher is better):
I also used JMeter to run similar scenarios, this time with a layer group with a PostGIS backend and actual meaningful PNG tiles to validate if the results are similar and they are:
Notes / Caveats
Checklist
main
branch (backports managed later; ignore for branch specific issues).For core and extension modules:
[GEOS-XYZWV] Title of the Jira ticket
.