-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tiered Caching] Additional ITs for cache stats #13655
base: main
Are you sure you want to change the base?
[Tiered Caching] Additional ITs for cache stats #13655
Conversation
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for 5192014: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
5192014
to
727f1a9
Compare
❌ Gradle check result for 727f1a9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for cd145bc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13655 +/- ##
============================================
+ Coverage 71.42% 71.68% +0.26%
- Complexity 59978 61363 +1385
============================================
Files 4985 5064 +79
Lines 282275 288120 +5845
Branches 40946 41721 +775
============================================
+ Hits 201603 206545 +4942
- Misses 63999 64524 +525
- Partials 16673 17051 +378 ☔ View full report in Codecov by Sentry. |
❌ Gradle check result for 54e396c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
org.opensearch.indices.IndicesRequestCacheIT.testStaleKeysCleanupWithMultipleIndices {p0={"search.concurrent_segment_search.enabled":"true"}} |
❌ Gradle check result for 54e396c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
...c/internalClusterTest/java/org.opensearch.cache.common.tier/TieredSpilloverCacheStatsIT.java
Outdated
Show resolved
Hide resolved
int itemsOnHeap = HEAP_CACHE_SIZE / (int) singleSearchSize; | ||
int itemsOnDisk = 1 + randomInt(30); // The first one we search (to get the size) always goes to disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing these signify max entries that can be supported on heap/disk right? Can we name this accordingly then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were meant to be the number of items that end up on the heap/disk tier at the end of the test. For heap tier this is the same as max entries. (We dont do disk evictions since this is a MockDiskCache not an actual ehcache one). Will change to itemsOnDiskAfterTest
to make it clearer.
return Arrays.<Object[]>asList(new Object[] { Settings.builder().put(FeatureFlags.PLUGGABLE_CACHE, "true").build() }); | ||
} | ||
|
||
public void testMultiLevelAggregation() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add more tests here to cover below:
- Delete an index shard and verify
dropStatsForDimension
flow - Explicitly clear cache and verify stats are functioning correctly ie memory_size and entries are cleared up. (Seems already covered in CacheStatsAPI IT test below?)
- Disable/enable disk cache dynamically and verify stats works fine.
- Verify existing IRC stats matches with overall cache stats.
- Check negative scenarios ie send invalid levels and verify exception is thrown.
- Check stats for different dimension combination, indices/tier, indices, indices/shards/tier. Some of them might already be covered here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think this one is already covered in IndicesRequestCacheTests.testClosingIndexWipesStats()
- Agreed, I think this is covered in the test you mentioned
- I think this is covered in the last part of TieredSpilloverCacheTests.testTierStatsAddCorrectly(). We populate some stats, check they're right, disable disk cache, do some more requests, confirm the stats work right, and turn disk cache back on.
- This is done in CacheStatsAPIIndicesRequestCacheIT.testStatsMatchOldApi(), but would be a good idea to do a similar one for the tiered cache as well. Will add that
- Good idea, will add that
- I think we're only missing indices/shards/tier and shards/tier, I figured they were redundant since we already covered the indices cases and tier is the only "special" dimension that wasn't tested already in CacheStatsAPIIndicesRequestCacheIT. Can add these as well though
@@ -51,13 +52,15 @@ public void writeTo(StreamOutput out) throws IOException { | |||
|
|||
@Override | |||
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | |||
builder.startObject(NodesStatsRequest.Metric.CACHE_STATS.metricName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was broken in 2.14? How did stats end up looking like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was my mistake. In the response object, instead of having the stats in nodes.[node_id].caches.request_cache, it was at nodes.[node_id].request_cache. This change puts it in line with other nodes stats responses
|
||
if (cleanupKeysFromClosedShards.contains(shardIdInfo)) { | ||
// Since the shard is closed, the cache should drop stats for this shard. | ||
// This should not happen on a full cache cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing we are not dropping stats during cache cleanup as indexShard might still be present on that node, so we still need to count stats for future incoming requests for same shard?
Can we also an IT/UT test around this scenario if not already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, on cache cleanup we don't drop hits/misses/evictions, to match the behavior of the existing stats API. I suspect this is why they decided to do it that way. I think CacheStatsAPIIndicesRequestCacheIT.testCacheClear() should cover this scenario, right?
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for af5a9dc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for f12ec0f: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
// Contains CleanupKey objects for a full cache cleanup. | ||
final Set<Tuple<ShardId, Integer>> cleanupKeysFromFullClean = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is preventing us from deleting cache itself for full cache cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure why we can't do that, this is the way it was done by @kiranprakash154 and @sgup432 for the original cache cleanup logic. They had just combined cleanup keys for a full cleanup or for a single closed shard into cleanupKeysFromClosedShards
.
Here we're just separating them out because we only want to drop the dimensions from the stats object if the shard is closed, not if the whole cache is cleaned. The actual logic of when we call iterator.remove() should be the same as before.
// Contains CleanupKey objects of a closed shard. | ||
// Contains CleanupKey objects for a full cache cleanup. | ||
final Set<Tuple<ShardId, Integer>> cleanupKeysFromFullClean = new HashSet<>(); | ||
// Contains CleanupKey objects for a closed shard. | ||
final Set<Tuple<ShardId, Integer>> cleanupKeysFromClosedShards = new HashSet<>(); | ||
|
||
for (Iterator<CleanupKey> iterator = keysToClean.iterator(); iterator.hasNext();) { | ||
CleanupKey cleanupKey = iterator.next(); | ||
iterator.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove from the iterator towards the end of block? Also, removal during iteration does not cause any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removal during iteration is ok, since it was already happening in the original cache cleanup logic.
if (cleanupKey.readerCacheKeyId == null) { | ||
// null indicates full cleanup | ||
IndexShard indexShard = (IndexShard) cleanupKey.entity.getCacheIdentity(); | ||
// Add both shardId and indexShardHashCode to uniquely identify an indexShard. | ||
cleanupKeysFromFullClean.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode())); | ||
} else if (!cleanupKey.entity.isOpen()) { | ||
// The shard is closed | ||
IndexShard indexShard = (IndexShard) cleanupKey.entity.getCacheIdentity(); | ||
cleanupKeysFromClosedShards.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode())); | ||
} else { | ||
cleanupKeysFromOutdatedReaders.add(cleanupKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly cleaner:
final IndexShard indexShard = (IndexShard) cleanupKey.entity.getCacheIdentity();
if (cleanupKey.readerCacheKeyId == null) {
// null indicates full cleanup
// Add both shardId and indexShardHashCode to uniquely identify an indexShard.
cleanupKeysFromFullClean.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode()));
} else if (!cleanupKey.entity.isOpen()) {
// The shard is closed
cleanupKeysFromClosedShards.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode()));
} else {
cleanupKeysFromOutdatedReaders.add(cleanupKey);
}
// Get the immutable cache stats for a given cache, used to avoid having to process XContent in tests. | ||
// Safe to expose publicly as the ImmutableCacheStatsHolder can't be modified after its creation. | ||
public ImmutableCacheStatsHolder getStatsByCache(CacheType cacheType) { | ||
return statsByCache.get(cacheType); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method required for verification within the tests? Probably not the right thing to do, although not sure if there is better way of achieving this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not technically required, but I think it's worth adding because it simplifies the tests a lot.
Before I added this, I accomplished the same thing by calling NodeCacheStats.toXContent(), converting that to a nested Map<String, Object>, and traversing through that recursively. It worked but was pretty gross. But I noticed no other tests seem to directly handle XContent like this, and instead they use methods to access whatever data is used to produce XContent in the first place. So I thought we could simplify the tests by adding this.
if (aggregationLevels != null && !aggregationLevels.isEmpty()) { | ||
paramMap.put("level", String.join(",", aggregationLevels)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the aggregationLevels not needed anymore as part of params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe during review for the original cache stats API PR, when these original ITs were written, there was a change to how params worked so that levels got passed into the constructor for ImmutableCacheStatsHolder. Before that, we had to also pass them in params
to ImmutableCacheStatsHolder.toXContent(). But I forgot to remove these lines from the ITs even though they were no longer needed.
checkCacheStatsAPIResponse(statsHolder, List.of(), expectedTotal, true, true); | ||
} | ||
|
||
public void testNodesStatsResponse() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add detailed description regarding the specific test case covered above each test method?
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for 08da716: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
public void testMultiLevelAggregation() throws Exception { | ||
internalCluster().startNodes( | ||
1, | ||
Settings.builder() | ||
.put(TieredSpilloverCacheIT.defaultSettings(HEAP_CACHE_SIZE_STRING)) | ||
.put( | ||
TieredSpilloverCacheSettings.TOOK_TIME_POLICY_CONCRETE_SETTINGS_MAP.get(CacheType.INDICES_REQUEST_CACHE).getKey(), | ||
new TimeValue(0, TimeUnit.SECONDS) | ||
) | ||
.build() | ||
); | ||
Client client = client(); | ||
String index1Name = "index1"; | ||
String index2Name = "index2"; | ||
startIndex(client, index1Name); | ||
startIndex(client, index2Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally it is better to have multiple test cases as different test methods. While this might make us repeat some of the setup overhead, the gains are worth it for readability/clarity and while debugging some regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can have common setup function to be used across multiple test cases
❌ Gradle check result for 6860739: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
6860739
to
90aa5fd
Compare
❌ Gradle check result for 90aa5fd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❕ Gradle check result for d03988a: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
❌ Gradle check result for 4464fd9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Adds more ITs and UT coverage for cache stats in the tiered spillover cache. Also has 3 small bugfixes which were found during testing:
request_cache
object in XContent response for the cache stats API was incorrectly at nodes.[node_id].request_cache instead of nodes.[node_id].caches.request_cacheRelated Issues
Resolves #13455
Check List
- [N/A] API changes companion pull request created.- [N/A] Commit changes are listed out in CHANGELOG.md file (See: Changelog)- [N/A] Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.