Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance improvements for WMS metatiling #7457

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

mbosecke
Copy link

@mbosecke mbosecke commented Mar 4, 2024

Overview

My intention was specifically to improve WMS performance (specifically throughput), with 4x4 metatiles, using the integrated GWC, encoding to PNG. I feel like this is likely a pretty common use case in GeoServer.

I committed a benchmark class and two independent performance improvements. Although the improvements are independent of each other I'm including them in one pull request since they both leverage the new benchmark class.

Measurable JMH Benchmark

The first thing I did was create a JMH benchmark so that we could get repeatable measurements. My understanding is that it's pretty common amongst the development team to use JMeter but this new benchmark has the following advantages:

  • Is committed within the code base so it has all the advantages of source control and versioning
  • Can be run directly within the IDE just like any JUnit test

The benchmark tests a few different scenarios:

  • runWithoutGwc: Integrated GWC is disabled
  • runWithGwcAndNoCacheHits: Integrated GWC is enabled but incoming requests don't overlap each other so there are no cache hits
  • runWithGwcAnd50PercentCacheHits: Approximately 50% of the incoming requests will be cache hits
  • runWithGwcAnd75PercentCacheHits: Approximately 75% of the incoming requests will be cache hits
  • runWithGwcAnd90PercentCacheHits: Approximately 90% of the incoming requests will be cache hits

Notes:

  • I integrated JMH with JUnit by creating a bogus test called WmsMetatileBenchmarkTest.runBenchmark(). The test has the @Ignore annotation so it is not included by default when running tests. My IDE (Intellij) allows me to just run this one "test" with a single click of a button.
  • The benchmark saves it's results in src/gwc/target/benchmark-results.json and can be visualized by uploading the JSON file here: https://jmh.morethan.io/
  • A limitation of the benchmark is that it doesn't do a good job of mimicking real-world data-fetching (i.e. there's no PostGIS database or anything). It is really just measuring the GeoServer specific code which is mostly PNG encoding.
  • The benchmark is currently using the pre-made "Basic Polygon" layer and ultimately the majority of tiles it renders are completely empty PNGs. Despite this fact, the results still display discernible patterns that are useful in making decisions. This means we can't extrapolate and assume that the results represent real-world throughput.

Improvement 1: Improved performance of PNGWriter for sub-images

This improvement was primary performed in this pull request for the imageio-ext library:

geosolutions-it/imageio-ext#299

All that needed to be done in GeoServer was to upgrade to the latest version of that library and stop using BufferedImageAdapter. I'm currently using a SNAPSHOT version of that library until my pull request over there gets accepted and a new version of the library is released.

Improvement 2: Increased GWC Metatiling concurrency

Originally, if a 4x4 metatile was requested, GWC would encode all 16 tiles before returning the requested tile to the end user. With this change, the requested tile will be encoded on the main thread but all additional 15 tiles will be encoded on separate asynchronous threads. This has a significant impact on throughput.

Results

The JMH benchmark results (higher is better):

jmh-results

I also used JMeter to run similar scenarios, this time with a layer group with a PostGIS backend and actual meaningful PNG tiles to validate if the results are similar and they are:

jmeter-results

Notes / Caveats

  • This pull request is not ready to be merged, simply because it relies on my other pull request for the imageio-ext library to be merged first. I just wanted to get the conversation started at this point.

Checklist

For core and extension modules:

  • New unit tests have been added covering the changes.
  • Documentation has been updated (if change is visible to end users).
  • The REST API docs have been updated (when changing configuration objects or the REST controllers).
  • There is an issue in the GeoServer Jira (except for changes that do not affect administrators or end users in any way).
  • Commit message(s) must be in the form [GEOS-XYZWV] Title of the Jira ticket.
  • Bug fixes and small new features are presented as a single commit.
  • Each commit has a single objective (if there are multiple commits, each has a separate JIRA ticket describing its goal).

@mbosecke
Copy link
Author

mbosecke commented Mar 7, 2024

For this work, I decided to use nested locks:

lock(metatile);
try {
   for(tile : tiles){
        executor.submit(() -> {
             lock(tile);
             try {
                 // ... do work
             } finally {
                  unlock(tile);
             }
       });
    }
} finally {
   unlock(metatile);
}

But after running some really intensive JMeter tests with hundreds of threads, I discovered that the MemoryLockProvider in geowebcache will cause deadlocks if it's being used in a nested way and you just so happen to get unfortunate collisions caused by the striping. I create a ticket in that project, with a pull request to resolve it. This pull request will now be dependent upon that pull request.

@aaime
Copy link
Member

aaime commented Mar 8, 2024

Hi there, thanks for sharing your work! I don't have time right now to review the PRs but I see a pattern that cannot be accepted: unbounded resourcing. The thread pools in particular (and possibly the locks, not sure here) need some level of control, the hardware GeoServer is running on has finite capabilities.
The thread pool maximum size must be controlled by a configuration parameter in the GWC panel, so that an administrator can size it to match the actual hardware.
The rule of thumb when it comes to rendering is that throughput peaks somewhere between 2 and 4 times the number of cores, so that would be a useful default value, if none is set by the admin.

Btw, I'm pointing this out so that you can act on it without waiting for the actual review, the rest looks great (at least at a cursory look).

Copy link
Member

@aaime aaime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting stuff. See comments below.

I've run it with some interactive tests, making sure I was using the GWC changes and the new imageio-ext, using the "tile layer" preview. In some cases I'm not getting all the images, in others, like topp:states, the map remains blank. Not sure why though.

if (tryCache && tryCacheFetch(tile)) {

// After getting the lock on the meta tile and invidual tile, try again
if (tryCache && tryCacheFetch(conveyorTile)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something is amiss in this method.

The existing code is inefficient when the server load is limited, but it provides guarantees that the same work is not done multiple times:

  1. First it gets a lock on the meta tile
  2. If the reference tile is missing, then it dispatched the getmap
  3. it clips and save all the tiles in the meta-tile before the lock is released

Now, let's place this in a real-world load scenario. A user walks in a bar and opens a tile based viewer!
The viewer is browser-based, and it sends 6 requests in parallel to the server, each requesting separate tiles.
Some of these tiles below to the same meta tile. With the above algorithm, one wins and computes, the others are sitting doing nothing, and when they are released, they do the tryCacheFetch, finding the tile they want, and thus not dispatching the GetMap once more.

Here is what I see in the current implementation (please tell me where I'm wrong):

  • The first request that manages to get in, locks the meta-tile and dispatcher the getmap
  • That same request computes the tile it wants on the request thread, while the others are sent to the background pool for async computation
  • The first one waits for the tile latch to be released, when all the threads computing the sub-images have managed to get the tile lock, but they still haven't computed and save the tile on disk.
  • The first request relaeses the metatile lock, allowing the other overlapping requests (different tile, same metatile) to execute
  • From here on, the concurrent dance takes places and guarantees are lost. In particular, a request thread for another tile might find the tile it wants still hasn't landed on disk, and decides to dispatch the getmap again. This is a duplicated expensive computation that involves grabbing connections from the pool and so on. Eventually this leads to trying to compute the tile and getting stuck on the tile lock... and when the tile lock is released... it's not even trying to fetch the tile from disk, it's just computing it again?

This would seem preferable:

  • Grab the meta-tile lock, try to fetch the tile
  • If not found, dispatch the getmap
  • Compute the tile inline, execute the others async
  • Wait to release the meta-tile lock until all the other tiles are saved on disk (using the latch, the thread releasing the last bit of the latch releases the meta-tile lock).

With this approach, you get parallel and async computation, the first tile gets released soon, the others have to wait for the full meta to be computed. More parallel than the existing code, does not duplicate computation. But not as parallel as your current proposal.

Maybe one can do things a bit differently then? How about:

  • Grab the meta-tile lock, try to fetch the tile
  • If not found, grab the tile lock
  • Once the tile lock is acquired, check if the tile has already been computed, otherwise dispatch the getmap
  • Once the getmap is computed, we can release the meta-tile lock (the getmap is the metatile, that's what we want to protect)
  • Do the same as your proposal here for the rest, releasing the tile locks once the saving is done. The threads waiting will check if the tile is there, will find them, and return immediately.

I'm not 100% sure about the above, but the idea is that we need to avoid re-computing the getmap and avoid re-saving the tiles.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My currently implementation attempts to ensure that the GetMap request for the metatile is only dispatched a single time. My implementation is actually very close to your last proposal.

You mentioned the following, which is not quite correct:

In particular, a request thread for another tile might find the tile it wants still hasn't landed on disk, and decides to dispatch the getmap again

The concurrent user request thread will block on the tile lock (obtained by the ongoing async thread) before checking the cache, and when the async thread is complete and unlocks the tile lock, the request thread will find the tile on disk. To help outline that, here is the perspective of the primary request that is able to dispatch the getmap:

  1. Check cache before any locks
  2. Obtain the meta-tile lock.
  3. After obtaining meta-tile lock, obtain the tile lock
  4. Check cache again, after obtaining both the meta-tile and sub-tile lock.
  5. If not found, dispatch the getmap
  6. Compute the tile inline, execute the others async
  7. Async threads obtains locks on all the other tiles
  8. Primary thread waits for all tile locks to be obtained before releasing meta-tile lock (using a countdown latch).

The primary request was able to obtain the meta-tile lock and each individual tile lock before releasing the meta-tile lock. The async threads will not release their tile locks until they've saved the tiles to disk. The concurrent user request will eventually obtain the meta-tile lock but then block on their respective tile locks while the async thread is still going. Here is the perspective of a concurrent request that is for a different tile of the same meta-tile:

  1. Check cache before any locks.
  2. Blocks on meta-tile lock initially, but is eventually obtained.
  3. Blocks on tile lock because an async thread is still working on it, but is eventually obtained.
  4. Check cache again, this time it's found because the async thread put it there. No need to dispatch getmap.

This is very similar to your very last proposal with I think only one key difference, from what I can tell. I intentionally don't release the meta-tile lock until I have certainty that all the individual tiles have been locked by the async threads. If there's a concurrent user request, we want it to wait for the async thread and not the other way around because the async threads already have the image whereas the concurrent user request does not (and would need to dispatch a getmap). Therefore, the async thread must be able to obtain the tile lock and the concurrent user request (after obtaining the meta-tile lock) will then block and wait for the async thread.

Pseudo code that could help:


getTile(tile){
   
   if (isCached(tile)){
      return getFromCache(tile)
   }

   lock(metatile) // concurrent request will block here if an ongoing getmap is dispatched
   lock(tile) // concurrent request will block here if an async thread is running

   if (isCached(tile)){
      return getFromCache(tile)
   } else {
     dispatchGetMap(metatile);
  
     response = saveTile(tile)
  
     for(tile : metatile.getTiles()){
        async(() -> {
           lock(tile)
           saveTile(tile)
           release(tile)
         })
      }
   
     waitForAllTileLocksToBeObtained()
   
     release(lock)
     release(metatile)
 
     return response  
}

I also attached a diagram to help me sort out my own thoughts on the matter.

Geoserver Data Flow - WMS PNG Tile with GWC Integration - Metatile Locking (1)

Please continue to be critical in case I'm overlooking something!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reviewer, being critical is my job 🤣
The current code looks good, and the diagram is indeed useful.

src/wms/src/main/java/org/geoserver/wms/RasterCleaner.java Outdated Show resolved Hide resolved
Copy link
Member

@aaime aaime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a second round of interactive tests.
The lockup I've observed yesterday has been fixed by your last commit, nice.

I've now tested a layer "seed" with 8 threads to see how the new code interacts with it. The interaction looks poor. After a bit of seeding, I've looked into a jstack dump, finding 1500+ "gwc-tile-saving-xyz" threads (I have a 10x10 meta-tile configuration, which is one of the largest we use for production purposes, in order to stress the system).

On one side, this is due to having an unbounded thread pool (my first feedback, still needs to be addressed)... While I was expecting to see more like 800 background threads (8 seeding threads, 100 tiles for each), that's too many threads for it to be acceptable in any case.

I've also noticed that while the seeding appeared to be complete (no more seeding threads alive), the background threads were still working saving tiles. This will break common automation cases where a machine is dedicated to seed offline, and then the result is moved to the production cluster for usage: the machinery will think the seed is done and start moving the tiles, while they are still being produced.

Also, during a seed the level of concurrency is already controlled by the seed form, having a hidden extra concurrency is probably not great, as it multiplies the CPU load in a way that's not entirely predictable. Perhaps the extra thread pool should not be used in that case, and be limited to tile creation on user demand? However, that will require passing down more information, to inform the tile layer about its operational context.

Final note, the "log" method in GeoServerTileLayer will have to disappear or use the LOGGER at FINE/FINER level, and be guarded so that it does not try build strings/allocate arrays during a seed (I get you used it for debugging purposes, just noting the PR cannot be merged with it still in).

if (tryCache && tryCacheFetch(tile)) {

// After getting the lock on the meta tile and invidual tile, try again
if (tryCache && tryCacheFetch(conveyorTile)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reviewer, being critical is my job 🤣
The current code looks good, and the diagram is indeed useful.

@mbosecke
Copy link
Author

mbosecke commented Mar 12, 2024

I've committed some changes:

  • The thread pool now defaults to "number of cores * 2"
  • I added a setting on the "Caching Defaults" page to allow the user to override the default size of the thread pool. They can specify "0" to disable completely and all tiles will be encoded/saved on the main thread.
  • The GeoserverTileLayer will run everything on the main thread if the ConveyorTile.servletReq Dispatcher.REQUEST is null with the intention of disabling this feature during seeding tasks. I hope it semantically makes sense to use this to differentiate between a user-initiated request and any other type of request (ex. seeding).
  • Added unit tests in GWCTest in regards to the building and configuring of the ExecutorService.
  • Added unit tests in GeoServerTileLayerTest for scenarios with/without an ExecutorService and with/without a servlet request.

@mbosecke
Copy link
Author

I separated encoding and saving into independent runnables and the primary ConveyorTile that was requested from the user is now encoded on the main thread but saved asynchronously. The additional tiles belonging to the same meta-tile will be encoded and saved asynchronously.

Since the concurrency is now a configurable setting, I updated the benchmark to compare results both with and without concurrency. Here are updated results:

Without Concurrency and With Concurrency

Now that saving to disk is asynchronous, I'm a little puzzled as to why there is still such a discrepancy between disabling GWC and using GWC but there's no cache hits.

@aaime
Copy link
Member

aaime commented Mar 13, 2024

@mbosecke I'm not sure, but wondering why the numbers in "run without GWC" have roughly doubled compared to the charts in your first comment (from 400/500 ops per second to above 1000)
Is the the testing different now?

Also mind the async saving (and I might be misunderstanding your comments here)... it can work for the interactive use case, but for seeding, when the last seeding thread is done, the work must be complete. Otherwise, scripts that wait for the seed to end to move tiles to another machine, will work against a storage that is not yet fully populated.

Thinking out loud, IO can also be a source of issues... it makes a world of difference if you're saving on a local SSD, on a network storage, or on an object storage, trying to push too much towards network disk (especially if they are cheaper spinning disks) can be counter-productive. The 2*core rule of thumb is for WMS requests, when encoding goes directly to the network, not sure about the case when we write to persistent storage. That might require different tuning. I don't have a ready answer here, just pointing out a potentially problematic point.

@aaime
Copy link
Member

aaime commented Mar 14, 2024

Also thinking out loud, maybe this work can be split into two parts. One is just contributing the faster PNG encoding for sub-images, while the other can built on top of it and provide better concurrency for the interactive seeding case.

@aaime
Copy link
Member

aaime commented Mar 30, 2024

Finally found a minute to look at the code. If I read correctly, in case of seeding there is no async behavior whatsoever, the asynchronous bits are just for interactive usage. So, looking good from this point of view.

Going back to the performance, it is puzzling indeed. I've run the benchmark twice locally (Ryzen 1700x, 8 cores), with these results:

Benchmark                                                                                         Mode  Cnt     Score     Error  Units
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAnd50PercentCacheHits     thrpt    5   386,966 ± 125,898  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAnd75PercentCacheHits     thrpt    5   448,730 ± 111,005  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAnd90PercentCacheHits     thrpt    5  1524,361 ± 276,900  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithConcurrencyAndNoCacheHits            thrpt    5   259,517 ±  78,680  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAnd50PercentCacheHits  thrpt    5   145,958 ±  14,191  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAnd75PercentCacheHits  thrpt    5   144,779 ±  35,341  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAnd90PercentCacheHits  thrpt    5   366,582 ±  61,034  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithGwcWithoutConcurrencyAndNoCacheHits         thrpt    5   146,138 ±  31,021  ops/s
WmsMetatileBenchmarkTest.WmsMetatileBenchmark.runWithoutGwc                                      thrpt    5   673,572 ±  74,369  ops/s

Some numbers are odd, but I can reproduce the overall "WMS without GWC is faster". I've tried to look at the workload, making it print the first 100 tiles that are getting requested, and in the cases without cache hits I see the following sequence:

[1024, 0, 11], [1028, 0, 11], [1032, 0, 11], [1036, 0, 11], [1040, 0, 11], [1044, 0, 11], [1048, 0, 11], [1052, 0, 11],...

So one tile out of four is being requested...
For the GWC cache, that still means producing all the tiles in a meta-tile, and saving them, even if only one tile is requested. But for the WMS case, it just requires to produce that single tile, without the rest of the meta-tile.
If I'm understanding this correctly, for a fair comparison, you'd need to also change the meta-tiling factors of the layer down to 1:1, so that GWC and stand-alone WMS are doing the same job, or else, make the WMS case ask for all the tiles in a meta-tile.

Update: come to think of it, setting tiled=true&tilesorigin=... enables the usage of the "QuickTileCache", which does a bit of on the fly meta-tiling, but with a very different strategy: meta tiles are 3x3, the cache stores RenderedImages, not saved PNGs, while only the requested tile is encoded in PNG. If it's actually in use (some debugging should be used to confirm) then the results are very much "apples and oranges".

@mbosecke
Copy link
Author

mbosecke commented Apr 1, 2024

Thanks for looking into that further. Just FYI, I've been distracted lately with other things but I'm hoping to revisit this further in the coming week or two! I haven't abandoned this initiative yet.

Mitchell Bösecke added 17 commits April 23, 2024 08:36
Integrated JMH with Junit to easily run.
…BufferedImageAdapter.

Both the library and the adapter were making deep copies of image raster data in the case of sub-images (i.e. when metatiling). The imageio-ext png writer has since been improved to be able to handle sub-images natively without requiring a complete copy of it's underlying data.
…rk for metatiles to additional threads. The main thread will only encode the desired ConveyorTile while other tiles will be done asynchronously.

For a 4x4 metatile, which has 16 inner tiles, this saves the main thread from having to encode the 15 additional tiles in addition to the requested tile.
…s on their individual tiles. Once all locks are obtained, then it's safe to release the primary metatile lock.
- Exposed a configuration option for users to adjust it
- Default size of "number of cores * 2"
- Disabling concurrency for seeding tasks
Mitchell Bösecke added 3 commits April 23, 2024 08:36
…verTileLayer instead of "ConveyorTile.servletReq".

This appears to work for both WMS and WMTS whereas "ConveyorTile.servletReq" only worked for WMTS.

Seeding does not appear to set "Dispatcher.REQUEST" so it avoids the additional concurrency, as desired.

Also upgraded the benchmark test to test both with and without concurrency set.
@aaime
Copy link
Member

aaime commented May 8, 2024

Updates to the PNG encoder have made their way into ImageIO-EXT 1.4.10 and main has been updated to use it.
All the bits this PR needs should be in line, if you can merge/rebase it with the current master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants