Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tiered Caching] Additional ITs for cache stats #13655

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

peteralfonsi
Copy link
Contributor

@peteralfonsi peteralfonsi commented May 13, 2024

Description

Adds more ITs and UT coverage for cache stats in the tiered spillover cache. Also has 3 small bugfixes which were found during testing:

  • Cache clear API incorrectly wiped hits, misses, and eviction stats
  • Items evicted from the heap tier, but rejected from the disk tier due to policies, incorrectly weren't counted towards the total evictions for the cache
  • request_cache object in XContent response for the cache stats API was incorrectly at nodes.[node_id].request_cache instead of nodes.[node_id].caches.request_cache

Related Issues

Resolves #13455

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
      - [N/A] API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
    - [N/A] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
    - [N/A] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Peter Alfonsi added 2 commits May 10, 2024 09:39
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for 5192014: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 727f1a9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for cd145bc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Peter Alfonsi added 2 commits May 14, 2024 12:53
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

✅ Gradle check result for c365f30: SUCCESS

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 78.26087% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 71.68%. Comparing base (b15cb0c) to head (d03988a).
Report is 328 commits behind head on main.

Current head d03988a differs from pull request most recent head 4464fd9

Please upload reports for the commit 4464fd9 to get more accurate results.

Files Patch % Lines
...pensearch/common/cache/service/NodeCacheStats.java 0.00% 3 Missing ⚠️
...e/common/tier/TieredSpilloverCacheStatsHolder.java 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13655      +/-   ##
============================================
+ Coverage     71.42%   71.68%   +0.26%     
- Complexity    59978    61363    +1385     
============================================
  Files          4985     5064      +79     
  Lines        282275   288120    +5845     
  Branches      40946    41721     +775     
============================================
+ Hits         201603   206545    +4942     
- Misses        63999    64524     +525     
- Partials      16673    17051     +378     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for 54e396c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dblock
Copy link
Member

dblock commented May 15, 2024

❌ Gradle check result for 54e396c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

org.opensearch.indices.IndicesRequestCacheIT.testStaleKeysCleanupWithMultipleIndices {p0={"search.concurrent_segment_search.enabled":"true"}}

#13437

Copy link
Contributor

❌ Gradle check result for 54e396c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines 86 to 87
int itemsOnHeap = HEAP_CACHE_SIZE / (int) singleSearchSize;
int itemsOnDisk = 1 + randomInt(30); // The first one we search (to get the size) always goes to disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing these signify max entries that can be supported on heap/disk right? Can we name this accordingly then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were meant to be the number of items that end up on the heap/disk tier at the end of the test. For heap tier this is the same as max entries. (We dont do disk evictions since this is a MockDiskCache not an actual ehcache one). Will change to itemsOnDiskAfterTest to make it clearer.

return Arrays.<Object[]>asList(new Object[] { Settings.builder().put(FeatureFlags.PLUGGABLE_CACHE, "true").build() });
}

public void testMultiLevelAggregation() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add more tests here to cover below:

  1. Delete an index shard and verify dropStatsForDimension flow
  2. Explicitly clear cache and verify stats are functioning correctly ie memory_size and entries are cleared up. (Seems already covered in CacheStatsAPI IT test below?)
  3. Disable/enable disk cache dynamically and verify stats works fine.
  4. Verify existing IRC stats matches with overall cache stats.
  5. Check negative scenarios ie send invalid levels and verify exception is thrown.
  6. Check stats for different dimension combination, indices/tier, indices, indices/shards/tier. Some of them might already be covered here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think this one is already covered in IndicesRequestCacheTests.testClosingIndexWipesStats()
  2. Agreed, I think this is covered in the test you mentioned
  3. I think this is covered in the last part of TieredSpilloverCacheTests.testTierStatsAddCorrectly(). We populate some stats, check they're right, disable disk cache, do some more requests, confirm the stats work right, and turn disk cache back on.
  4. This is done in CacheStatsAPIIndicesRequestCacheIT.testStatsMatchOldApi(), but would be a good idea to do a similar one for the tiered cache as well. Will add that
  5. Good idea, will add that
  6. I think we're only missing indices/shards/tier and shards/tier, I figured they were redundant since we already covered the indices cases and tier is the only "special" dimension that wasn't tested already in CacheStatsAPIIndicesRequestCacheIT. Can add these as well though

@@ -51,13 +52,15 @@ public void writeTo(StreamOutput out) throws IOException {

@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject(NodesStatsRequest.Metric.CACHE_STATS.metricName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was broken in 2.14? How did stats end up looking like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was my mistake. In the response object, instead of having the stats in nodes.[node_id].caches.request_cache, it was at nodes.[node_id].request_cache. This change puts it in line with other nodes stats responses


if (cleanupKeysFromClosedShards.contains(shardIdInfo)) {
// Since the shard is closed, the cache should drop stats for this shard.
// This should not happen on a full cache cleanup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing we are not dropping stats during cache cleanup as indexShard might still be present on that node, so we still need to count stats for future incoming requests for same shard?
Can we also an IT/UT test around this scenario if not already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, on cache cleanup we don't drop hits/misses/evictions, to match the behavior of the existing stats API. I suspect this is why they decided to do it that way. I think CacheStatsAPIIndicesRequestCacheIT.testCacheClear() should cover this scenario, right?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
@github-actions github-actions bot added the v2.15.0 Issues and PRs related to version 2.15.0 label May 28, 2024
Copy link
Contributor

❌ Gradle check result for af5a9dc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for f12ec0f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +718 to +719
// Contains CleanupKey objects for a full cache cleanup.
final Set<Tuple<ShardId, Integer>> cleanupKeysFromFullClean = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is preventing us from deleting cache itself for full cache cleanup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why we can't do that, this is the way it was done by @kiranprakash154 and @sgup432 for the original cache cleanup logic. They had just combined cleanup keys for a full cleanup or for a single closed shard into cleanupKeysFromClosedShards.

Here we're just separating them out because we only want to drop the dimensions from the stats object if the shard is closed, not if the whole cache is cleaned. The actual logic of when we call iterator.remove() should be the same as before.

// Contains CleanupKey objects of a closed shard.
// Contains CleanupKey objects for a full cache cleanup.
final Set<Tuple<ShardId, Integer>> cleanupKeysFromFullClean = new HashSet<>();
// Contains CleanupKey objects for a closed shard.
final Set<Tuple<ShardId, Integer>> cleanupKeysFromClosedShards = new HashSet<>();

for (Iterator<CleanupKey> iterator = keysToClean.iterator(); iterator.hasNext();) {
CleanupKey cleanupKey = iterator.next();
iterator.remove();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove from the iterator towards the end of block? Also, removal during iteration does not cause any issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removal during iteration is ok, since it was already happening in the original cache cleanup logic.

Comment on lines 726 to 736
if (cleanupKey.readerCacheKeyId == null) {
// null indicates full cleanup
IndexShard indexShard = (IndexShard) cleanupKey.entity.getCacheIdentity();
// Add both shardId and indexShardHashCode to uniquely identify an indexShard.
cleanupKeysFromFullClean.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode()));
} else if (!cleanupKey.entity.isOpen()) {
// The shard is closed
IndexShard indexShard = (IndexShard) cleanupKey.entity.getCacheIdentity();
cleanupKeysFromClosedShards.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode()));
} else {
cleanupKeysFromOutdatedReaders.add(cleanupKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly cleaner:

final IndexShard indexShard = (IndexShard) cleanupKey.entity.getCacheIdentity();
if (cleanupKey.readerCacheKeyId == null) {
  // null indicates full cleanup
  // Add both shardId and indexShardHashCode to uniquely identify an indexShard.
  cleanupKeysFromFullClean.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode()));
} else if (!cleanupKey.entity.isOpen()) {
  // The shard is closed
  cleanupKeysFromClosedShards.add(new Tuple<>(indexShard.shardId(), indexShard.hashCode()));
} else {
  cleanupKeysFromOutdatedReaders.add(cleanupKey);
}

Comment on lines +84 to +88
// Get the immutable cache stats for a given cache, used to avoid having to process XContent in tests.
// Safe to expose publicly as the ImmutableCacheStatsHolder can't be modified after its creation.
public ImmutableCacheStatsHolder getStatsByCache(CacheType cacheType) {
return statsByCache.get(cacheType);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method required for verification within the tests? Probably not the right thing to do, although not sure if there is better way of achieving this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not technically required, but I think it's worth adding because it simplifies the tests a lot.

Before I added this, I accomplished the same thing by calling NodeCacheStats.toXContent(), converting that to a nested Map<String, Object>, and traversing through that recursively. It worked but was pretty gross. But I noticed no other tests seem to directly handle XContent like this, and instead they use methods to access whatever data is used to produce XContent in the first place. So I thought we could simplify the tests by adding this.

Comment on lines -254 to -256
if (aggregationLevels != null && !aggregationLevels.isEmpty()) {
paramMap.put("level", String.join(",", aggregationLevels));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the aggregationLevels not needed anymore as part of params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe during review for the original cache stats API PR, when these original ITs were written, there was a change to how params worked so that levels got passed into the constructor for ImmutableCacheStatsHolder. Before that, we had to also pass them in params to ImmutableCacheStatsHolder.toXContent(). But I forgot to remove these lines from the ITs even though they were no longer needed.

checkCacheStatsAPIResponse(statsHolder, List.of(), expectedTotal, true, true);
}

public void testNodesStatsResponse() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add detailed description regarding the specific test case covered above each test method?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for 08da716: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines 70 to 85
public void testMultiLevelAggregation() throws Exception {
internalCluster().startNodes(
1,
Settings.builder()
.put(TieredSpilloverCacheIT.defaultSettings(HEAP_CACHE_SIZE_STRING))
.put(
TieredSpilloverCacheSettings.TOOK_TIME_POLICY_CONCRETE_SETTINGS_MAP.get(CacheType.INDICES_REQUEST_CACHE).getKey(),
new TimeValue(0, TimeUnit.SECONDS)
)
.build()
);
Client client = client();
String index1Name = "index1";
String index2Name = "index2";
startIndex(client, index1Name);
startIndex(client, index2Name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it is better to have multiple test cases as different test methods. While this might make us repeat some of the setup overhead, the gains are worth it for readability/clarity and while debugging some regressions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can have common setup function to be used across multiple test cases

Copy link
Contributor

❌ Gradle check result for 6860739: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for 90aa5fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❕ Gradle check result for d03988a: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Peter Alfonsi <petealft@amazon.com>
Copy link
Contributor

❌ Gradle check result for 4464fd9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance skip-changelog :test Adding or fixing a test v2.15.0 Issues and PRs related to version 2.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add more ITs around tiered caching stats
4 participants