Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] etcdbrctl fails to make use of the shared credential file for AWS S3, and the process does not exit gracefully on error #724

Open
renormalize opened this issue Apr 4, 2024 · 0 comments
Assignees
Labels

Comments

@renormalize
Copy link
Member

Describe the bug:
When etcdbrctl is run with neither AWS_APPLICATION_CREDENTIALS nor AWS_APPLICATION_CREDENTIALS_JSON set, it errors right when it attempts to take the first full snapshot after the command is run with the server subcommand. The process does not exit and instead retries forever with no backoff to take a snapshot, and fails.

Expected behavior:
etcdbrctl should pick up AWS credentials from the shared credential file in the .aws directory, and should be able to take the first full snapshot, and all the subsequent delta snapshots successfully, without erroring.
If no credentials are present in the shared credential file, the process must exit gracefully instead of repeatedly trying to take snapshots and failing. Another option is that it exponentially backs off the frequency at which it retries taking snapshots.

How To Reproduce (as minimally and precisely as possible):
Do not set the AWS_APPLICATION_CREDENTIALS or AWS_APPLICATION_CREDENTIALS_JSON environment variables, to ensure etcdbrctl picks up the shared credential file in the .aws directory for authentication.
Build the binary with the master branch at 17e447b (or any commit after 6958118), and run the following command:

➜  etcd-backup-restore git:(master) ✗ ./bin/etcdbrctl server --storage-provider=S3 --store-container=<YOUR_STORAGE_BUCKET> --store-prefix=etcd-test

Logs:

INFO[0005] Creating snapstore from provider: S3          actor=backup-restore-server
INFO[0005] Creating snapshotter...                       actor=backup-restore-server
INFO[0006] Starting the handleSsrStopRequest handler...  actor=backup-restore-server
INFO[0006] Probing etcd...                               actor=backup-restore-server
INFO[0006] Probing Etcd by checking etcd status ...      actor=backup-restore-server
INFO[0006] Setting status to : 200                       actor=backup-restore-server
INFO[0006] Taking scheduled full snapshot for time: 2024-04-04 18:53:38.537995 +0530 IST  actor=snapshotter
WARN[0006] Taking scheduled full snapshot failed: error checking if the credentials were updated error checking the modification time of the access credentials  no environment variable set for the AWS credential file  actor=snapshotter
ERRO[0006] Failed to take substitute first full snapshot: error checking if the credentials were updated error checking the modification time of the access credentials  no environment variable set for the AWS credential file  actor=backup-restore-server
INFO[0006] Probing etcd...                               actor=backup-restore-server
INFO[0006] Probing Etcd by checking etcd status ...      actor=backup-restore-server
INFO[0006] Setting status to : 200                       actor=backup-restore-server
INFO[0006] Taking scheduled full snapshot for time: 2024-04-04 18:53:38.54093 +0530 IST  actor=snapshotter
WARN[0006] Taking scheduled full snapshot failed: error checking if the credentials were updated error checking the modification time of the access credentials  no environment variable set for the AWS credential file  actor=snapshotter
ERRO[0006] Failed to take substitute first full snapshot: error checking if the credentials were updated error checking the modification time of the access credentials  no environment variable set for the AWS credential file  actor=backup-restore-server

Environment (please complete the following information):

  • Etcd version/commit ID: Any (tested with v3.5.11), since the misbehavior is in etcd-backup-restore.
  • Etcd-backup-restore version/commit ID: 17e447b, or any commit after 6958118.
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:AWS

Anything else we need to know?:

Why this bug occurs?
Prior to #670, the function snapstore.S3SnapStoreHash() returned nil as the error when neither AWS_APPLICATION_CREDENTIALS or AWS_APPLICATION_CREDENTIALS_JSON environment variables were set:

// S3SnapStoreHash calculates and returns the hash of aws S3 snapstore secret.
func S3SnapStoreHash(config *brtypes.SnapstoreConfig) (string, error) {
if dir, isSet := os.LookupEnv(awsCredentialFile); isSet {
awsConfig, err := readAWSCredentialFromDir(dir)
if err != nil {
return "", fmt.Errorf("error getting credentials from %v directory", dir)
}
return getS3Hash(awsConfig), nil
}
if filename, isSet := os.LookupEnv(awsCredentialJSONFile); isSet {
awsConfig, err := credentialsFromJSON(filename)
if err != nil {
return "", fmt.Errorf("error getting credentials using %v file", filename)
}
return getS3Hash(awsConfig), nil
}
return "", nil
}

This nil error was found to be non-idiomatic, and was changed to return a non-nil error in #670:
// GetS3CredentialsLastModifiedTime returns the latest modification timestamp of the AWS credential file(s)
func GetS3CredentialsLastModifiedTime() (time.Time, error) {
if dir, isSet := os.LookupEnv(awsCredentialDirectory); isSet {
// credential files which are essential for creating the S3 snapstore
credentialFiles := []string{"accessKeyID", "region", "secretAccessKey"}
for i := range credentialFiles {
credentialFiles[i] = filepath.Join(dir, credentialFiles[i])
}
awsTimeStamp, err := getLatestCredentialsModifiedTime(credentialFiles)
if err != nil {
return time.Time{}, fmt.Errorf("failed to get AWS credential timestamp from the directory %v with error: %v", dir, err)
}
return awsTimeStamp, nil
}
if filename, isSet := os.LookupEnv(awsCredentialJSONFile); isSet {
credentialFiles := []string{filename}
awsTimeStamp, err := getLatestCredentialsModifiedTime(credentialFiles)
if err != nil {
return time.Time{}, fmt.Errorf("failed to fetch file information of the AWS JSON credential file %v with error: %v", filename, err)
}
return awsTimeStamp, nil
}
return time.Time{}, fmt.Errorf("no environment variable set for the AWS credential file")
}

However, this nil error return was relied on by all the functions which are invoked before this in the call stack, and this specific case was not accounted for when the error handling was being rewritten correctly. Thus, all full (and delta) snapshots which check if the credentials have changed will see a non-nil error before taking a snapshot, and will therefore fail.

How to fix it?
The fact that credentials can be fetched from the shared credential file at startup should be accounted for, and if the credentials were fetched from the shared credential file, it should be stored in the state of the program. Until 635dd4d, SnapStore was not updated at all for full and delta snapshots under this authentication, and this could be improved on by recreating SnapStore for every full and delta snapshot. If this is not preferred, then the SnapStore instance is created once and it is never updated; as it was until 635dd4d.

@renormalize renormalize added the kind/bug Bug label Apr 4, 2024
@renormalize renormalize self-assigned this Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant