-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checks before index creation #34598
base: master
Are you sure you want to change the base?
Conversation
…fore-index-creation
…d give a better error message to user when cluster gets to red state
1cb7c0d
to
2df4b3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR overall looks very good. I love the tests, but did leave a lot of generic testing guidance that doesn't need to be responded to as part of this PR. My main concern is whether multiple unassigned shards should be possible, and then some minor documentation issues
@@ -131,6 +131,10 @@ def cluster_routing(self, *, enabled): | |||
value = "all" if enabled else "none" | |||
self._cluster_put_settings({"cluster.routing.allocation.enable": value}) | |||
|
|||
def cluster_get_settings(self, is_flat=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a test below for is_flat=False
. Is this option necessary?
self.adapter._cluster_put_settings(settings_obj) | ||
settings = self.adapter.cluster_get_settings() | ||
self.assertEqual(settings['transient'], settings_obj) | ||
self._clear_cluster_settings(verify=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is clearing the cluster settings necessary for what this test is trying to test? If it's just cleanup, it might be better to move it into a cleanup block that ensures the tests clean up after themselves.
reason = err.info['error']['reason'] | ||
if 'unable to find any unassigned shards to explain' in reason: | ||
return {'unassigned_shards': []} | ||
raise err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if it would be more appropriate to use just raise
rather than raise err
here. I don't think we likely don't need the extra stack trace for this except block.
def cluster_allocation_explain(self): | ||
""" | ||
Returns cluster allocation for the first failed shard with their node allocation decisions. | ||
It can be extended to return allocation of a specific shard or index if provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what this comment means? My interpretation is that you're suggesting we could modify this code in the future?
}) | ||
shard_info["rejection_explanation"].append(node_info) | ||
return { | ||
"unassigned_shards": [shard_info] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unassigned_shards
is either a single index array or an empty array -- is the array necessary here? Could we instead return the unassigned shard or None
?
) | ||
|
||
if not has_one_node_with_available_space: | ||
raise Exception("""All data nodes are above low watermark capacity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better class of Exception
we can raise here? Raising generic Exception
would force catching by generic Exception
, which is we try to avoid
sleep_time = 0 if settings.UNIT_TESTING else 3 | ||
sleep(sleep_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love putting test considerations into production code. I'm assuming this makes writing tests easier, but I also just wanted to point out that tests can mock out the sleep
call on their own
if cluster_health['status'] == 'red': | ||
self._explain_shard_allocation_failure() | ||
manager.index_delete(self.name) | ||
raise Exception(f"Failed to create {self.name} failed. Deleted {self.name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer a non-generic Exception
to be raised here
""" | ||
unassigned_shards = manager.cluster_allocation_explain() | ||
if unassigned_shards: | ||
for shard in unassigned_shards: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, unassigned_shards
will only have a single items or be empty. Is this a bug? If not, do we need the iteration here? Or, at the minimum...it seems like we have the possibility of an "unassigned_shard", but never "unassigned_shards", right?
self.assertIndexMappingMatches(self.index, self.type, self.mapping) | ||
self.assertIndexHasAnalysis(self.index, self.analysis) | ||
|
||
def test_create_index_with_creation_checks_with_high_disk_watermark(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably another test better written as a unit test. Because we're hitting the actual cluster, we have to specify a very low watermark to hopefully ensure that all the nodes are above the watermark. If we instead were testing against some fake node configuration, we could make it obvious that that configuration was currently all above the watermark. That would also allow us to write tests asserting the warnings were issued for all hosts above the watermark, even if one node was available below the watermark.
Product Description
While we tried to deploy #34246 to get indices out, we faced several issues because of the state of the cluster.
This PR adds some checks in order to make index creation process more safer for future -
Review by 🐡
Feature Flag
NA
Safety Assurance
Added adequate tests and have tested it locally.
Safety story
Only affects new index creation which is very supervised thing and would happen mostly during deploy. So it should not affect any user in any way.
Automated test coverage
Added tests for the changes.
QA Plan
NA
Rollback instructions
Labels & Review