Add checks before index creation #34598

AmitPhulera · 2024-05-10T11:14:56Z

Product Description

While we tried to deploy #34246 to get indices out, we faced several issues because of the state of the cluster.
This PR adds some checks in order to make index creation process more safer for future -

It adds a check to verify disk watermark settings
Fails if all the data nodes are beyond the watermark settings
If after index creation, cluster gets into red state i.e primary replicas are not assigned, it gives users a better error message.

Review by 🐡

Feature Flag

NA

Safety Assurance

Added adequate tests and have tested it locally.

Safety story

Only affects new index creation which is very supervised thing and would happen mostly during deploy. So it should not affect any user in any way.

Automated test coverage

Added tests for the changes.

QA Plan

NA

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

…fore-index-creation

…d give a better error message to user when cluster gets to red state

…s enabled

mjriley

This PR overall looks very good. I love the tests, but did leave a lot of generic testing guidance that doesn't need to be responded to as part of this PR. My main concern is whether multiple unassigned shards should be possible, and then some minor documentation issues

mjriley · 2024-05-10T14:04:09Z

corehq/apps/es/client.py

@@ -131,6 +131,10 @@ def cluster_routing(self, *, enabled):
        value = "all" if enabled else "none"
        self._cluster_put_settings({"cluster.routing.allocation.enable": value})

+    def cluster_get_settings(self, is_flat=True):


I don't see a test below for is_flat=False. Is this option necessary?

mjriley · 2024-05-10T14:05:27Z

corehq/apps/es/tests/test_client.py

+        self.adapter._cluster_put_settings(settings_obj)
+        settings = self.adapter.cluster_get_settings()
+        self.assertEqual(settings['transient'], settings_obj)
+        self._clear_cluster_settings(verify=True)


Is clearing the cluster settings necessary for what this test is trying to test? If it's just cleanup, it might be better to move it into a cleanup block that ensures the tests clean up after themselves.

mjriley · 2024-05-10T14:10:23Z

corehq/apps/es/client.py

+            reason = err.info['error']['reason']
+            if 'unable to find any unassigned shards to explain' in reason:
+                return {'unassigned_shards': []}
+            raise err


I'm curious if it would be more appropriate to use just raise rather than raise err here. I don't think we likely don't need the extra stack trace for this except block.

mjriley · 2024-05-10T14:11:33Z

corehq/apps/es/client.py

+    def cluster_allocation_explain(self):
+        """
+        Returns cluster allocation for the first failed shard with their node allocation decisions.
+        It can be extended to return allocation of a specific shard or index if provided.


Can you clarify what this comment means? My interpretation is that you're suggesting we could modify this code in the future?

mjriley · 2024-05-10T14:19:35Z

corehq/apps/es/client.py

+                    })
+            shard_info["rejection_explanation"].append(node_info)
+        return {
+            "unassigned_shards": [shard_info]


It seems unassigned_shards is either a single index array or an empty array -- is the array necessary here? Could we instead return the unassigned shard or None?

mjriley · 2024-05-10T15:20:43Z

corehq/apps/es/migration_operations.py

+                )
+
+        if not has_one_node_with_available_space:
+            raise Exception("""All data nodes are above low watermark capacity.


Is there a better class of Exception we can raise here? Raising generic Exception would force catching by generic Exception, which is we try to avoid

mjriley · 2024-05-10T15:26:53Z

corehq/apps/es/migration_operations.py

+        sleep_time = 0 if settings.UNIT_TESTING else 3
+        sleep(sleep_time)


I don't love putting test considerations into production code. I'm assuming this makes writing tests easier, but I also just wanted to point out that tests can mock out the sleep call on their own

mjriley · 2024-05-10T15:30:18Z

corehq/apps/es/migration_operations.py

+        if cluster_health['status'] == 'red':
+            self._explain_shard_allocation_failure()
+            manager.index_delete(self.name)
+            raise Exception(f"Failed to create {self.name} failed. Deleted {self.name}")


Would prefer a non-generic Exception to be raised here

mjriley · 2024-05-10T15:38:13Z

corehq/apps/es/migration_operations.py

+        """
+        unassigned_shards = manager.cluster_allocation_explain()
+        if unassigned_shards:
+            for shard in unassigned_shards:


As written, unassigned_shards will only have a single items or be empty. Is this a bug? If not, do we need the iteration here? Or, at the minimum...it seems like we have the possibility of an "unassigned_shard", but never "unassigned_shards", right?

mjriley · 2024-05-10T15:41:57Z

corehq/apps/es/tests/test_migration_operations.py

+        self.assertIndexMappingMatches(self.index, self.type, self.mapping)
+        self.assertIndexHasAnalysis(self.index, self.analysis)
+
+    def test_create_index_with_creation_checks_with_high_disk_watermark(self):


This is probably another test better written as a unit test. Because we're hitting the actual cluster, we have to specify a very low watermark to hopefully ensure that all the nodes are above the watermark. If we instead were testing against some fake node configuration, we could make it obvious that that configuration was currently all above the watermark. That would also allow us to write tests asserting the warnings were issued for all hosts above the watermark, even if one node was available below the watermark.

AmitPhulera added 8 commits April 11, 2024 16:20

rename and refactor _clear_cluster_routing to _clear_cluster_settings

9e81cdb

add a function to cluster settings in the adapter

ee1021c

add fn to get cluster allocation of unassigned shards

912de30

add ability in manage adapter to get disk utilization info from cluster

d8db5a6

add fn to parse watermark settings to percentage

943bbd4

Merge remote-tracking branch 'origin/master' into ap/es/add-checks-be…

a183d31

…fore-index-creation

add checks for disk usage on all the nodes before creating indices an…

be72322

…d give a better error message to user when cluster gets to red state

isort

9cb9ea4

AmitPhulera requested review from millerdev, mjriley, esoergel, gherceg and jingcheng16 May 10, 2024 11:14

AmitPhulera added 2 commits May 10, 2024 17:30

ensure es migrations that create the index should have creation check…

5fabf0e

…s enabled

enable and disable cluster routing before and after index creation

2df4b3a

AmitPhulera force-pushed the ap/es/add-checks-before-index-creation branch from 1cb7c0d to 2df4b3a Compare May 10, 2024 12:02

mjriley reviewed May 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checks before index creation #34598

Add checks before index creation #34598

AmitPhulera commented May 10, 2024 •

edited

mjriley left a comment

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

mjriley May 10, 2024

		sleep_time = 0 if settings.UNIT_TESTING else 3
		sleep(sleep_time)

Add checks before index creation #34598

Are you sure you want to change the base?

Add checks before index creation #34598

Conversation

AmitPhulera commented May 10, 2024 • edited

Product Description

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

mjriley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmitPhulera commented May 10, 2024 •

edited