Significant performance improvements for WMS metatiling #7457

mbosecke · 2024-03-04T20:47:43Z

Overview

My intention was specifically to improve WMS performance (specifically throughput), with 4x4 metatiles, using the integrated GWC, encoding to PNG. I feel like this is likely a pretty common use case in GeoServer.

I committed a benchmark class and two independent performance improvements. Although the improvements are independent of each other I'm including them in one pull request since they both leverage the new benchmark class.

Measurable JMH Benchmark

The first thing I did was create a JMH benchmark so that we could get repeatable measurements. My understanding is that it's pretty common amongst the development team to use JMeter but this new benchmark has the following advantages:

Is committed within the code base so it has all the advantages of source control and versioning
Can be run directly within the IDE just like any JUnit test

The benchmark tests a few different scenarios:

runWithoutGwc: Integrated GWC is disabled
runWithGwcAndNoCacheHits: Integrated GWC is enabled but incoming requests don't overlap each other so there are no cache hits
runWithGwcAnd50PercentCacheHits: Approximately 50% of the incoming requests will be cache hits
runWithGwcAnd75PercentCacheHits: Approximately 75% of the incoming requests will be cache hits
runWithGwcAnd90PercentCacheHits: Approximately 90% of the incoming requests will be cache hits

Notes:

I integrated JMH with JUnit by creating a bogus test called WmsMetatileBenchmarkTest.runBenchmark(). The test has the @Ignore annotation so it is not included by default when running tests. My IDE (Intellij) allows me to just run this one "test" with a single click of a button.
The benchmark saves it's results in src/gwc/target/benchmark-results.json and can be visualized by uploading the JSON file here: https://jmh.morethan.io/
A limitation of the benchmark is that it doesn't do a good job of mimicking real-world data-fetching (i.e. there's no PostGIS database or anything). It is really just measuring the GeoServer specific code which is mostly PNG encoding.
The benchmark is currently using the pre-made "Basic Polygon" layer and ultimately the majority of tiles it renders are completely empty PNGs. Despite this fact, the results still display discernible patterns that are useful in making decisions. This means we can't extrapolate and assume that the results represent real-world throughput.

Improvement 1: Improved performance of PNGWriter for sub-images

This improvement was primary performed in this pull request for the imageio-ext library:

geosolutions-it/imageio-ext#299

All that needed to be done in GeoServer was to upgrade to the latest version of that library and stop using BufferedImageAdapter. I'm currently using a SNAPSHOT version of that library until my pull request over there gets accepted and a new version of the library is released.

Improvement 2: Increased GWC Metatiling concurrency

Originally, if a 4x4 metatile was requested, GWC would encode all 16 tiles before returning the requested tile to the end user. With this change, the requested tile will be encoded on the main thread but all additional 15 tiles will be encoded on separate asynchronous threads. This has a significant impact on throughput.

Results

The JMH benchmark results (higher is better):

I also used JMeter to run similar scenarios, this time with a layer group with a PostGIS backend and actual meaningful PNG tiles to validate if the results are similar and they are:

Notes / Caveats

This pull request is not ready to be merged, simply because it relies on my other pull request for the imageio-ext library to be merged first. I just wanted to get the conversation started at this point.

Checklist

I have read the contribution guidelines.
I have sent a Contribution Licence Agreement (not required for small changes, e.g., fixing typos in documentation).
First PR targets the main branch (backports managed later; ignore for branch specific issues).
All the build checks are green (see automated QA checks).

For core and extension modules:

New unit tests have been added covering the changes.
Documentation has been updated (if change is visible to end users).
The REST API docs have been updated (when changing configuration objects or the REST controllers).
There is an issue in the GeoServer Jira (except for changes that do not affect administrators or end users in any way).
Commit message(s) must be in the form [GEOS-XYZWV] Title of the Jira ticket.
Bug fixes and small new features are presented as a single commit.
Each commit has a single objective (if there are multiple commits, each has a separate JIRA ticket describing its goal).

mbosecke · 2024-03-07T22:24:07Z

For this work, I decided to use nested locks:

lock(metatile);
try {
   for(tile : tiles){
        executor.submit(() -> {
             lock(tile);
             try {
                 // ... do work
             } finally {
                  unlock(tile);
             }
       });
    }
} finally {
   unlock(metatile);
}

But after running some really intensive JMeter tests with hundreds of threads, I discovered that the MemoryLockProvider in geowebcache will cause deadlocks if it's being used in a nested way and you just so happen to get unfortunate collisions caused by the striping. I create a ticket in that project, with a pull request to resolve it. This pull request will now be dependent upon that pull request.

aaime · 2024-03-08T08:05:12Z

Hi there, thanks for sharing your work! I don't have time right now to review the PRs but I see a pattern that cannot be accepted: unbounded resourcing. The thread pools in particular (and possibly the locks, not sure here) need some level of control, the hardware GeoServer is running on has finite capabilities.
The thread pool maximum size must be controlled by a configuration parameter in the GWC panel, so that an administrator can size it to match the actual hardware.
The rule of thumb when it comes to rendering is that throughput peaks somewhere between 2 and 4 times the number of cores, so that would be a useful default value, if none is set by the admin.

Btw, I'm pointing this out so that you can act on it without waiting for the actual review, the rest looks great (at least at a cursory look).

aaime

Very interesting stuff. See comments below.

I've run it with some interactive tests, making sure I was using the GWC changes and the new imageio-ext, using the "tile layer" preview. In some cases I'm not getting all the images, in others, like topp:states, the map remains blank. Not sure why though.

aaime · 2024-03-11T14:38:33Z

src/gwc/src/main/java/org/geoserver/gwc/layer/GeoServerTileLayer.java

-            if (tryCache && tryCacheFetch(tile)) {
+
+            // After getting the lock on the meta tile and invidual tile, try again
+            if (tryCache && tryCacheFetch(conveyorTile)) {


I think something is amiss in this method.

The existing code is inefficient when the server load is limited, but it provides guarantees that the same work is not done multiple times:

First it gets a lock on the meta tile

If the reference tile is missing, then it dispatched the getmap

it clips and save all the tiles in the meta-tile before the lock is released

Now, let's place this in a real-world load scenario. A user walks in a bar and opens a tile based viewer!
The viewer is browser-based, and it sends 6 requests in parallel to the server, each requesting separate tiles.
Some of these tiles below to the same meta tile. With the above algorithm, one wins and computes, the others are sitting doing nothing, and when they are released, they do the tryCacheFetch, finding the tile they want, and thus not dispatching the GetMap once more.

Here is what I see in the current implementation (please tell me where I'm wrong):

The first request that manages to get in, locks the meta-tile and dispatcher the getmap

That same request computes the tile it wants on the request thread, while the others are sent to the background pool for async computation

The first one waits for the tile latch to be released, when all the threads computing the sub-images have managed to get the tile lock, but they still haven't computed and save the tile on disk.

The first request relaeses the metatile lock, allowing the other overlapping requests (different tile, same metatile) to execute

From here on, the concurrent dance takes places and guarantees are lost. In particular, a request thread for another tile might find the tile it wants still hasn't landed on disk, and decides to dispatch the getmap again. This is a duplicated expensive computation that involves grabbing connections from the pool and so on. Eventually this leads to trying to compute the tile and getting stuck on the tile lock... and when the tile lock is released... it's not even trying to fetch the tile from disk, it's just computing it again?

This would seem preferable:

Grab the meta-tile lock, try to fetch the tile

If not found, dispatch the getmap

Compute the tile inline, execute the others async

Wait to release the meta-tile lock until all the other tiles are saved on disk (using the latch, the thread releasing the last bit of the latch releases the meta-tile lock).

With this approach, you get parallel and async computation, the first tile gets released soon, the others have to wait for the full meta to be computed. More parallel than the existing code, does not duplicate computation. But not as parallel as your current proposal.

Maybe one can do things a bit differently then? How about:

Grab the meta-tile lock, try to fetch the tile

If not found, grab the tile lock

Once the tile lock is acquired, check if the tile has already been computed, otherwise dispatch the getmap

Once the getmap is computed, we can release the meta-tile lock (the getmap is the metatile, that's what we want to protect)

Do the same as your proposal here for the rest, releasing the tile locks once the saving is done. The threads waiting will check if the tile is there, will find them, and return immediately.

I'm not 100% sure about the above, but the idea is that we need to avoid re-computing the getmap and avoid re-saving the tiles.

My currently implementation attempts to ensure that the GetMap request for the metatile is only dispatched a single time. My implementation is actually very close to your last proposal.

You mentioned the following, which is not quite correct:

In particular, a request thread for another tile might find the tile it wants still hasn't landed on disk, and decides to dispatch the getmap again

The concurrent user request thread will block on the tile lock (obtained by the ongoing async thread) before checking the cache, and when the async thread is complete and unlocks the tile lock, the request thread will find the tile on disk. To help outline that, here is the perspective of the primary request that is able to dispatch the getmap:

Check cache before any locks

Obtain the meta-tile lock.

After obtaining meta-tile lock, obtain the tile lock

Check cache again, after obtaining both the meta-tile and sub-tile lock.

If not found, dispatch the getmap

Compute the tile inline, execute the others async

Async threads obtains locks on all the other tiles

Primary thread waits for all tile locks to be obtained before releasing meta-tile lock (using a countdown latch).

The primary request was able to obtain the meta-tile lock and each individual tile lock before releasing the meta-tile lock. The async threads will not release their tile locks until they've saved the tiles to disk. The concurrent user request will eventually obtain the meta-tile lock but then block on their respective tile locks while the async thread is still going. Here is the perspective of a concurrent request that is for a different tile of the same meta-tile:

Check cache before any locks.

Blocks on meta-tile lock initially, but is eventually obtained.

Blocks on tile lock because an async thread is still working on it, but is eventually obtained.

Check cache again, this time it's found because the async thread put it there. No need to dispatch getmap.

This is very similar to your very last proposal with I think only one key difference, from what I can tell. I intentionally don't release the meta-tile lock until I have certainty that all the individual tiles have been locked by the async threads. If there's a concurrent user request, we want it to wait for the async thread and not the other way around because the async threads already have the image whereas the concurrent user request does not (and would need to dispatch a getmap). Therefore, the async thread must be able to obtain the tile lock and the concurrent user request (after obtaining the meta-tile lock) will then block and wait for the async thread.

Pseudo code that could help:

getTile(tile){ if (isCached(tile)){ return getFromCache(tile) } lock(metatile) // concurrent request will block here if an ongoing getmap is dispatched lock(tile) // concurrent request will block here if an async thread is running if (isCached(tile)){ return getFromCache(tile) } else { dispatchGetMap(metatile); response = saveTile(tile) for(tile : metatile.getTiles()){ async(() -> { lock(tile) saveTile(tile) release(tile) }) } waitForAllTileLocksToBeObtained() release(lock) release(metatile) return response }

I also attached a diagram to help me sort out my own thoughts on the matter.

Please continue to be critical in case I'm overlooking something!

As a reviewer, being critical is my job 🤣
The current code looks good, and the diagram is indeed useful.

src/gwc/src/main/java/org/geoserver/gwc/layer/GeoServerTileLayer.java

src/wms/src/main/java/org/geoserver/wms/RasterCleaner.java

aaime

I've made a second round of interactive tests.
The lockup I've observed yesterday has been fixed by your last commit, nice.

I've now tested a layer "seed" with 8 threads to see how the new code interacts with it. The interaction looks poor. After a bit of seeding, I've looked into a jstack dump, finding 1500+ "gwc-tile-saving-xyz" threads (I have a 10x10 meta-tile configuration, which is one of the largest we use for production purposes, in order to stress the system).

On one side, this is due to having an unbounded thread pool (my first feedback, still needs to be addressed)... While I was expecting to see more like 800 background threads (8 seeding threads, 100 tiles for each), that's too many threads for it to be acceptable in any case.

I've also noticed that while the seeding appeared to be complete (no more seeding threads alive), the background threads were still working saving tiles. This will break common automation cases where a machine is dedicated to seed offline, and then the result is moved to the production cluster for usage: the machinery will think the seed is done and start moving the tiles, while they are still being produced.

Also, during a seed the level of concurrency is already controlled by the seed form, having a hidden extra concurrency is probably not great, as it multiplies the CPU load in a way that's not entirely predictable. Perhaps the extra thread pool should not be used in that case, and be limited to tile creation on user demand? However, that will require passing down more information, to inform the tile layer about its operational context.

Final note, the "log" method in GeoServerTileLayer will have to disappear or use the LOGGER at FINE/FINER level, and be guarded so that it does not try build strings/allocate arrays during a seed (I get you used it for debugging purposes, just noting the PR cannot be merged with it still in).

aaime · 2024-03-12T09:17:55Z

src/gwc/src/main/java/org/geoserver/gwc/layer/GeoServerTileLayer.java

-            if (tryCache && tryCacheFetch(tile)) {
+
+            // After getting the lock on the meta tile and invidual tile, try again
+            if (tryCache && tryCacheFetch(conveyorTile)) {


As a reviewer, being critical is my job 🤣
The current code looks good, and the diagram is indeed useful.

mbosecke · 2024-03-12T21:27:13Z

I've committed some changes:

The thread pool now defaults to "number of cores * 2"
I added a setting on the "Caching Defaults" page to allow the user to override the default size of the thread pool. They can specify "0" to disable completely and all tiles will be encoded/saved on the main thread.
The GeoserverTileLayer will run everything on the main thread if the ~~ConveyorTile.servletReq~~ Dispatcher.REQUEST is null with the intention of disabling this feature during seeding tasks. I hope it semantically makes sense to use this to differentiate between a user-initiated request and any other type of request (ex. seeding).
Added unit tests in GWCTest in regards to the building and configuring of the ExecutorService.
Added unit tests in GeoServerTileLayerTest for scenarios with/without an ExecutorService and with/without a servlet request.

mbosecke · 2024-03-13T00:02:46Z

I separated encoding and saving into independent runnables and the primary ConveyorTile that was requested from the user is now encoded on the main thread but saved asynchronously. The additional tiles belonging to the same meta-tile will be encoded and saved asynchronously.

Since the concurrency is now a configurable setting, I updated the benchmark to compare results both with and without concurrency. Here are updated results:

Now that saving to disk is asynchronous, I'm a little puzzled as to why there is still such a discrepancy between disabling GWC and using GWC but there's no cache hits.

aaime · 2024-03-13T09:16:30Z

@mbosecke I'm not sure, but wondering why the numbers in "run without GWC" have roughly doubled compared to the charts in your first comment (from 400/500 ops per second to above 1000)
Is the the testing different now?

Also mind the async saving (and I might be misunderstanding your comments here)... it can work for the interactive use case, but for seeding, when the last seeding thread is done, the work must be complete. Otherwise, scripts that wait for the seed to end to move tiles to another machine, will work against a storage that is not yet fully populated.

Thinking out loud, IO can also be a source of issues... it makes a world of difference if you're saving on a local SSD, on a network storage, or on an object storage, trying to push too much towards network disk (especially if they are cheaper spinning disks) can be counter-productive. The 2*core rule of thumb is for WMS requests, when encoding goes directly to the network, not sure about the case when we write to persistent storage. That might require different tuning. I don't have a ready answer here, just pointing out a potentially problematic point.

aaime · 2024-03-14T09:22:45Z

Also thinking out loud, maybe this work can be split into two parts. One is just contributing the faster PNG encoding for sub-images, while the other can built on top of it and provide better concurrency for the interactive seeding case.

aaime · 2024-03-30T17:51:18Z

Finally found a minute to look at the code. If I read correctly, in case of seeding there is no async behavior whatsoever, the asynchronous bits are just for interactive usage. So, looking good from this point of view.

Going back to the performance, it is puzzling indeed. I've run the benchmark twice locally (Ryzen 1700x, 8 cores), with these results:

Benchmark                                                                                         Mode  Cnt     Score     Error  Units
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAnd50PercentCacheHits     thrpt    5   386,966 ± 125,898  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAnd75PercentCacheHits     thrpt    5   448,730 ± 111,005  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAnd90PercentCacheHits     thrpt    5  1524,361 ± 276,900  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAndNoCacheHits            thrpt    5   259,517 ±  78,680  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAnd50PercentCacheHits  thrpt    5   145,958 ±  14,191  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAnd75PercentCacheHits  thrpt    5   144,779 ±  35,341  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAnd90PercentCacheHits  thrpt    5   366,582 ±  61,034  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAndNoCacheHits         thrpt    5   146,138 ±  31,021  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithoutGwc                                      thrpt    5   673,572 ±  74,369  ops/s

Some numbers are odd, but I can reproduce the overall "WMS without GWC is faster". I've tried to look at the workload, making it print the first 100 tiles that are getting requested, and in the cases without cache hits I see the following sequence:

[1024, 0, 11], [1028, 0, 11], [1032, 0, 11], [1036, 0, 11], [1040, 0, 11], [1044, 0, 11], [1048, 0, 11], [1052, 0, 11],...

So one tile out of four is being requested...
For the GWC cache, that still means producing all the tiles in a meta-tile, and saving them, even if only one tile is requested. But for the WMS case, it just requires to produce that single tile, without the rest of the meta-tile.
If I'm understanding this correctly, for a fair comparison, you'd need to also change the meta-tiling factors of the layer down to 1:1, so that GWC and stand-alone WMS are doing the same job, or else, make the WMS case ask for all the tiles in a meta-tile.

Update: come to think of it, setting tiled=true&tilesorigin=... enables the usage of the "QuickTileCache", which does a bit of on the fly meta-tiling, but with a very different strategy: meta tiles are 3x3, the cache stores RenderedImages, not saved PNGs, while only the requested tile is encoded in PNG. If it's actually in use (some debugging should be used to confirm) then the results are very much "apples and oranges".

mbosecke · 2024-04-01T15:36:48Z

Thanks for looking into that further. Just FYI, I've been distracted lately with other things but I'm hoping to revisit this further in the coming week or two! I haven't abandoned this initiative yet.

Integrated JMH with Junit to easily run.

…BufferedImageAdapter. Both the library and the adapter were making deep copies of image raster data in the case of sub-images (i.e. when metatiling). The imageio-ext png writer has since been improved to be able to handle sub-images natively without requiring a complete copy of it's underlying data.

…rk for metatiles to additional threads. The main thread will only encode the desired ConveyorTile while other tiles will be done asynchronously. For a 4x4 metatile, which has 16 inner tiles, this saves the main thread from having to encode the 15 additional tiles in addition to the requested tile.

…ter.

…s on their individual tiles. Once all locks are obtained, then it's safe to release the primary metatile lock.

… outside of the coverage.

- Exposed a configuration option for users to adjust it - Default size of "number of cores * 2" - Disabling concurrency for seeding tasks

…uring the old one is shutdown.

…service.

…verTileLayer instead of "ConveyorTile.servletReq". This appears to work for both WMS and WMTS whereas "ConveyorTile.servletReq" only worked for WMTS. Seeding does not appear to set "Dispatcher.REQUEST" so it avoids the additional concurrency, as desired. Also upgraded the benchmark test to test both with and without concurrency set.

… saved in the cache asynchronously.

…QUEST instead of ConveyorTile.servletReq.

aaime · 2024-05-08T17:25:27Z

Updates to the PNG encoder have made their way into ImageIO-EXT 1.4.10 and main has been updated to use it.
All the bits this PR needs should be in line, if you can merge/rebase it with the current master.

This was referenced Mar 4, 2024

Significant performance improvement for sub-images. geosolutions-it/imageio-ext#299

Merged

MemoryLockProvider can cause deadlocks if using nested locks GeoWebCache/geowebcache#1226

Closed

aaime requested changes Mar 11, 2024

View reviewed changes

mbosecke force-pushed the main branch from d39d257 to 8092ab4 Compare March 11, 2024 23:28

aaime requested changes Mar 12, 2024

View reviewed changes

Mitchell Bösecke added 17 commits April 23, 2024 08:36

Added JMH benchmark for various metatile scenarios.

db51bb0

Integrated JMH with Junit to easily run.

Code cleanup and removing bad comments.

cbf6bef

Improved benchmark to support multiple threads at a time.

5e305a3

Benchmark now saves tile indices to CSV files that can be used in JMe…

260b29f

…ter.

Code formatting

e94a1b7

Main thread will wait for all the asynchronous threads to obtain lock…

3788b48

…s on their individual tiles. Once all locks are obtained, then it's safe to release the primary metatile lock.

Minor code cleanup

ed3a5d6

Fixing build issues

9b36932

Fixed issue where countdown latch wouldn't decrement if a request was…

5e65c45

… outside of the coverage.

Adjustments to the GWC metatiling concurrency:

a5a0a43

- Exposed a configuration option for users to adjust it - Default size of "number of cores * 2" - Disabling concurrency for seeding tasks

Code cleanup.

5296d64

Code cleanup.

17fc86d

Rebuilding metatiling executor when the config option changes and ens…

9bb0519

…uring the old one is shutdown.

Added tests for the GWC metatiling executor service.

4fb1dab

Added test to ensure seed requests don't use the metatiling executor …

21f3a92

…service.

Mitchell Bösecke added 3 commits April 23, 2024 08:36

The primary ConveyorTile is encoded on the main thread but can now be…

53a1abe

… saved in the cache asynchronously.

Fixed unit tests to match recent change of checking for Dispatcher.RE…

79e8acb

…QUEST instead of ConveyorTile.servletReq.

mbosecke force-pushed the main branch from 42dd782 to 79e8acb Compare April 23, 2024 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance improvements for WMS metatiling #7457

Significant performance improvements for WMS metatiling #7457

mbosecke commented Mar 4, 2024 •

edited

mbosecke commented Mar 7, 2024

aaime commented Mar 8, 2024 •

edited

aaime left a comment

aaime Mar 11, 2024

mbosecke Mar 11, 2024

aaime Mar 12, 2024

aaime left a comment

aaime Mar 12, 2024

mbosecke commented Mar 12, 2024 •

edited

mbosecke commented Mar 13, 2024

aaime commented Mar 13, 2024

aaime commented Mar 14, 2024

aaime commented Mar 30, 2024 •

edited

mbosecke commented Apr 1, 2024

aaime commented May 8, 2024

Significant performance improvements for WMS metatiling #7457

Are you sure you want to change the base?

Significant performance improvements for WMS metatiling #7457

Conversation

mbosecke commented Mar 4, 2024 • edited

Overview

Measurable JMH Benchmark

Improvement 1: Improved performance of PNGWriter for sub-images

Improvement 2: Increased GWC Metatiling concurrency

Results

Notes / Caveats

Checklist

mbosecke commented Mar 7, 2024

aaime commented Mar 8, 2024 • edited

aaime left a comment

Choose a reason for hiding this comment

aaime Mar 11, 2024

Choose a reason for hiding this comment

mbosecke Mar 11, 2024

Choose a reason for hiding this comment

aaime Mar 12, 2024

Choose a reason for hiding this comment

aaime left a comment

Choose a reason for hiding this comment

aaime Mar 12, 2024

Choose a reason for hiding this comment

mbosecke commented Mar 12, 2024 • edited

mbosecke commented Mar 13, 2024

aaime commented Mar 13, 2024

aaime commented Mar 14, 2024

aaime commented Mar 30, 2024 • edited

mbosecke commented Apr 1, 2024

aaime commented May 8, 2024

mbosecke commented Mar 4, 2024 •

edited

aaime commented Mar 8, 2024 •

edited

mbosecke commented Mar 12, 2024 •

edited

aaime commented Mar 30, 2024 •

edited