Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checks before index creation #34598

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

AmitPhulera
Copy link
Contributor

@AmitPhulera AmitPhulera commented May 10, 2024

Product Description

While we tried to deploy #34246 to get indices out, we faced several issues because of the state of the cluster.
This PR adds some checks in order to make index creation process more safer for future -

  • It adds a check to verify disk watermark settings
  • Fails if all the data nodes are beyond the watermark settings
  • If after index creation, cluster gets into red state i.e primary replicas are not assigned, it gives users a better error message.

Review by 🐡

Feature Flag

NA

Safety Assurance

Added adequate tests and have tested it locally.

Safety story

Only affects new index creation which is very supervised thing and would happen mostly during deploy. So it should not affect any user in any way.

Automated test coverage

Added tests for the changes.

QA Plan

NA

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@AmitPhulera AmitPhulera force-pushed the ap/es/add-checks-before-index-creation branch from 1cb7c0d to 2df4b3a Compare May 10, 2024 12:02
Copy link
Contributor

@mjriley mjriley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR overall looks very good. I love the tests, but did leave a lot of generic testing guidance that doesn't need to be responded to as part of this PR. My main concern is whether multiple unassigned shards should be possible, and then some minor documentation issues

@@ -131,6 +131,10 @@ def cluster_routing(self, *, enabled):
value = "all" if enabled else "none"
self._cluster_put_settings({"cluster.routing.allocation.enable": value})

def cluster_get_settings(self, is_flat=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a test below for is_flat=False. Is this option necessary?

self.adapter._cluster_put_settings(settings_obj)
settings = self.adapter.cluster_get_settings()
self.assertEqual(settings['transient'], settings_obj)
self._clear_cluster_settings(verify=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clearing the cluster settings necessary for what this test is trying to test? If it's just cleanup, it might be better to move it into a cleanup block that ensures the tests clean up after themselves.

reason = err.info['error']['reason']
if 'unable to find any unassigned shards to explain' in reason:
return {'unassigned_shards': []}
raise err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if it would be more appropriate to use just raise rather than raise err here. I don't think we likely don't need the extra stack trace for this except block.

def cluster_allocation_explain(self):
"""
Returns cluster allocation for the first failed shard with their node allocation decisions.
It can be extended to return allocation of a specific shard or index if provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what this comment means? My interpretation is that you're suggesting we could modify this code in the future?

})
shard_info["rejection_explanation"].append(node_info)
return {
"unassigned_shards": [shard_info]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unassigned_shards is either a single index array or an empty array -- is the array necessary here? Could we instead return the unassigned shard or None?

)

if not has_one_node_with_available_space:
raise Exception("""All data nodes are above low watermark capacity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better class of Exception we can raise here? Raising generic Exception would force catching by generic Exception, which is we try to avoid

Comment on lines +148 to +149
sleep_time = 0 if settings.UNIT_TESTING else 3
sleep(sleep_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love putting test considerations into production code. I'm assuming this makes writing tests easier, but I also just wanted to point out that tests can mock out the sleep call on their own

if cluster_health['status'] == 'red':
self._explain_shard_allocation_failure()
manager.index_delete(self.name)
raise Exception(f"Failed to create {self.name} failed. Deleted {self.name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer a non-generic Exception to be raised here

"""
unassigned_shards = manager.cluster_allocation_explain()
if unassigned_shards:
for shard in unassigned_shards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, unassigned_shards will only have a single items or be empty. Is this a bug? If not, do we need the iteration here? Or, at the minimum...it seems like we have the possibility of an "unassigned_shard", but never "unassigned_shards", right?

self.assertIndexMappingMatches(self.index, self.type, self.mapping)
self.assertIndexHasAnalysis(self.index, self.analysis)

def test_create_index_with_creation_checks_with_high_disk_watermark(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably another test better written as a unit test. Because we're hitting the actual cluster, we have to specify a very low watermark to hopefully ensure that all the nodes are above the watermark. If we instead were testing against some fake node configuration, we could make it obvious that that configuration was currently all above the watermark. That would also allow us to write tests asserting the warnings were issued for all hosts above the watermark, even if one node was available below the watermark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants