Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hintedhandoff_additional_test.TestHintedHandoff.test_hintedhandoff_sync_point_api failure: sync_point DONE instead of IN_PROGRESS; and hang during shutdown #18733

Open
kbr-scylla opened this issue May 17, 2024 · 0 comments
Assignees
Labels
area/hinted handoff Issues related to Hinted Handoff tests/dtest

Comments

@kbr-scylla
Copy link
Contributor

kbr-scylla commented May 17, 2024

Failure spotted in dtest-debug job
https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/295/testReport/junit/hintedhandoff_additional_test/TestHintedHandoff/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split006___test_hintedhandoff_sync_point_api/

https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/295/testReport/junit/hintedhandoff_additional_test/TestHintedHandoff/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split006___test_hintedhandoff_sync_point_api_2/

The two jenkins pages refer to the same test, but different causes of failure.

One failure was because there was an assertion failure

        #####
        logger.debug("SUBTEST 5: Create a hint sync point and decommission the target node - waiting should succeed")
    
        logger.debug("Stopping node2...")
        node2.stop(wait_other_notice=True)
    
        logger.debug("Pause hint replay on node1")
        self.enable_error("hinted_handoff_pause_hint_replay", node1)
    
        logger.debug(f"Inserting {len(keys5)} keys...")
        insert_c1c2(session, keys=keys5, consistency=ConsistencyLevel.ANY)
    
        logger.debug("Starting node2...")
        node2.start(wait_other_notice=True)
    
        logger.debug("Create hint sync point on node1")
        sync_point_id = create_sync_point(node=node1)
    
        logger.debug("Decommissioning node2...")
        node2.decommission()
    
        logger.debug("Check that the sync point is not yet resolved (because hint replay is paused)")
>       wait_for_sync_point(sync_point_id, 0, expect="IN_PROGRESS")
>           assert status.json() == expect
E           AssertionError: assert 'DONE' == 'IN_PROGRESS'
E             - IN_PROGRESS
E             + DONE

The second failure was because the dtest framework found an unexpected coredump from node1 on shutdown.
The logs of node1 at the end look like this:

INFO  2024-05-12 06:56:34,529 [shard 1:main] task_manager - Unregistered module repair
INFO  2024-05-12 06:56:34,530 [shard 0:main] init - Shutting down repair service was successful
INFO  2024-05-12 06:56:34,530 [shard 0:main] init - Shutting down drain storage proxy
INFO  2024-05-12 06:56:34,530 [shard 0:main] hints_manager - Asked to stop a shard hint manager
INFO  2024-05-12 06:56:34,530 [shard 0:main] hints_manager - Asked to stop a shard hint manager
INFO  2024-05-12 06:56:34,530 [shard 0:main] hints_manager - Shard hint manager has stopped
INFO  2024-05-12 06:56:34,531 [shard 1:main] hints_manager - Asked to stop a shard hint manager
INFO  2024-05-12 06:56:34,531 [shard 1:main] hints_manager - Asked to stop a shard hint manager
INFO  2024-05-12 06:56:34,531 [shard 1:main] hints_manager - Shard hint manager has stopped

And dtest-gw0.log says this:

07:06:32,397 750     ccm                            WARNING  cluster.py          :760  | test_hintedhandoff_sync_point_api: node1: node1 is still running after 600 seconds. Trying to generate coredump using kill(58388, SIGQUIT)...
07:06:33,398 750     ccm                            WARNING  cluster.py          :760  | test_hintedhandoff_sync_point_api: node1: node1 scylla-jmx is still running. Killing process using kill(600, SIGKILL)...
07:06:33,400 750     dtest_setup                    DEBUG    dtest_setup.py      :628  | test_hintedhandoff_sync_point_api: exclude_errors: []
07:06:33,407 750     dtest_setup                    DEBUG    dtest_setup.py      :628  | test_hintedhandoff_sync_point_api: exclude_errors: []
07:06:33,432 750     dtest_setup                    WARNING  dtest_setup.py      :164  | test_hintedhandoff_sync_point_api: Moving core file /jenkins/workspace/scylla-master/dtest-debug/scylla-dtest/scylla.58388.1715497592.core to /jenkins/workspace/scylla-master/dtest-debug/scylla-dtest/logs-full.debug.006/1715497593421_hintedhandoff_additional_test.py::TestHintedHandoff::test_hintedhandoff_sync_point_api/node1-scylla.58388.1715497592.core

So what happened is that shutdown of the node hanged (somewhere inside hints_manager), dtest framework timed out, told the process to generate a coredump, and then killed it --- and then failed because coredump was found. (The failure report is not perfect, dtest framework should say that it timed out waiting for node shutdown, but I digress)

GitHub doesn't allow me to upload the coredump because it's over 25MB (even after compression). Could be useful to debug the issue though, to understand where the process hanged -- so whoever sees this issue I recommend downloading the coredump locally before artifacts get cleared.

Anyway uploading logs:
dtest-gw0.log
node1.log
node2.log

Could it be related to the recent huge change in hints? 64ba620

@piodul piodul added the area/hinted handoff Issues related to Hinted Handoff label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hinted handoff Issues related to Hinted Handoff tests/dtest
Projects
None yet
Development

No branches or pull requests

5 participants