Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: fluvio cluster resume stuck in trying to connect to SC #3989

Open
ajhunyady opened this issue May 7, 2024 · 12 comments
Open

[Bug]: fluvio cluster resume stuck in trying to connect to SC #3989

ajhunyady opened this issue May 7, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@ajhunyady
Copy link
Contributor

ajhunyady commented May 7, 2024

I shutdown the machine without closing down the cluster.
After restart, I just fluvio cluster resume as instructed by the message in fluvio cluster start, but it is stuck in a loop:

 % fluvio cluster start
📝 Running pre-flight checks
    ✅ Local Fluvio is not installed
    ❌ Check Clean Fluvio Local Installation failed Local Fluvio cluster wasn't deleted. Use 'resume' to resume created cluster or 'delete' before starting💔 Some pre-flight check failed!
Preflight check failed

 % fluvio cluster resume
✅ Local Fluvio is not installed
🎉 All checks passed!
✅ Local Cluster initialized
👤 Profile set
🖥️  Trying to connect to SC: 127.0.0.1:9003 94 seconds elapsed 

To remediate this, I needed to delete the cluster and create a new one.
I also got an error during delete:

% fluvio cluster delete
SPU monitoring socket, can't be removed: No such file or directory (os error 2)
Uninstalled fluvio local components

Environment: Using Mac with Intel CPU.

Note: The message flashes continuously while the spinner updates.

@ajhunyady ajhunyady added the bug Something isn't working label May 7, 2024
@ajhunyady
Copy link
Contributor Author

@avikam can you please take a look at this one?

@avikam
Copy link
Contributor

avikam commented May 7, 2024

@ajhunyady can you reproduce it with RUST_LOG=debug?
specifically, I'd like to know if it happened because of a version discrepancy.

For example, if you had fluvio running, and then restarted your machine and cargo built master before resuming, you might see this:

2024-05-07T21:00:15.505156Z  WARN ... fluvio_cluster::start::common: Current Version 0.11.8-dev is not same as expected: 0.11.7-dev

edit - more context:
this may happen because the platform version is loaded from the local-config file, as it's a value that is deduced from the CLI. The start/resume commands verify that the version is as expected. When it's not, the SPU verification fails despite a successful connection, and loops infinitely.
I think it's worth addressing it, that is, allow you to be platform version agnostic in the local case (and have a proper warning, instead of an infinite loop) - but I do want to understand first if that is the thing that happened to you

@sehz
Copy link
Contributor

sehz commented May 7, 2024

restart logic should check version number and warn user of incompatibility

@ajhunyady
Copy link
Contributor Author

@avikam the debug log is attached.
debug.log

@avikam
Copy link
Contributor

avikam commented May 7, 2024

indeed it's the same issue.
your log shows:

2024-05-07T13:55:39.967622Z  WARN install_only:launch_sc{host_name="127.0.0.1:9003" port=9003 pb=Indicatiff(ProgressBar)}:try_connect_to_sc{config=FluvioConfig { endpoint: "127.0.0.1:9003", use_spu_local_address: false, tls: Disabled, metadata: {}, client_id: None } platform_version=Version { major: 0, minor: 11, patch: 6 } pb=Indicatiff(ProgressBar)}: fluvio_cluster::start::common: Current Version 0.11.7 is not same as expected: 0.11.6

I would agree with @sehz the solution show be showing a warning to the user, and proceeding with the start/resume process

@ajhunyady
Copy link
Contributor Author

ajhunyady commented May 7, 2024

I actually tried both variants.

% fluvio cluster resume
 ✅ Local Fluvio is not installed
🎉 All checks passed!
✅ Local Cluster initialized
👤 Profile set
🖥️  Trying to connect to SC: 127.0.0.1:9003 19 seconds elapsed /                                                                                            
^C

% fluvio cluster start
📝 Running pre-flight checks
    ✅ Local Fluvio is not installed
    ❌ Check Clean Fluvio Local Installation failed Local Fluvio cluster wasn't deleted. Use 'resume' to resume created cluster or 'delete' before starting
    💔 Some pre-flight check failed!
Preflight check failed

I was forced to delete to recover.

@ajhunyady
Copy link
Contributor Author

ajhunyady commented May 7, 2024

Ok, I think I understand what's going on. We are currently updating smartmodules, so I'm constantly upgrading & downgrading to test. Since the smart engine changed, I must restart the cluster every time.

Now I realize that we may actually need a fluvio cluster upgrade command that should do whatever is needed to ensure the new version is compatible & permitted. In this case, nothing metadata-related changed, so the upgrade should ensure that the "previously saved cluster" accepts the new version. In other cases, a proper upgrade may be needed, but the upgrade translation should be handled by the developer who changed the metadata.

The current fluvio cluster upgrade only handles K8.

@sehz @avikam
Did I get this right?

@sehz
Copy link
Contributor

sehz commented May 7, 2024

Yup. restart should check to ensure nothing changed. if does, should ask user to perform upgrade command

@avikam
Copy link
Contributor

avikam commented May 9, 2024

Iv'e been thinking to address it in two PRs:

  1. stopgap: add a pre-check to the resume command to verify the versions match. This will remediate the bad user experience of looping forever.
  2. refactor upgrade, so that in the local installation type, instead of shut-down and start (which is currently broken, because we must use "resume"), we will update the platform-version only and resume the cluster.

an alternative for (2) is to remove the entire local-config file before starting the cluster. However, we'd lose configuration like number of SPUs, TLS policy, logs, etc.
@ajhunyady , @digikata does that sound reasonable?

@sehz
Copy link
Contributor

sehz commented May 9, 2024

Idea of having local-config is sound (although we have something similar in K8 config, so they should be consistent or converge).

First should have anyway as required pre-flight check.

Then figure out process of upgrade as this needs to be common across multiple cluster types

@digikata
Copy link
Contributor

digikata commented May 9, 2024

  1. definitely sounds good as a stopgap.

  2. sounds like good mechanics for an upgrade of the local cluster, and we can also figure out if there is a commonality between a local start upgrade and k8.

@avikam
Copy link
Contributor

avikam commented May 9, 2024

here is 1 before bats:
#3999

seems like there is an alignment so I'll keep pursuing that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants