Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS SDK CurlHttp requests #2886

Closed
sglnd opened this issue Mar 13, 2024 · 6 comments
Closed

AWS SDK CurlHttp requests #2886

sglnd opened this issue Mar 13, 2024 · 6 comments
Labels
bug This issue is a bug. closed-for-staleness p3 This is a minor priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. third-party This issue is related to third-party libraries or applications.

Comments

@sglnd
Copy link

sglnd commented Mar 13, 2024

Describe the bug

We are using CurveFS to mount object storage as local files, and CurveFS relies on AWS SDK for C++ implementation. When we were making a change to an object storage bucket, we first disabled the bucket mounted by CurveFS, and then deleted it. After completing the bucket deletion, AWS SDK for C++ started generating a large number of CurlHttp requests. Upon reviewing the logs and code printed by the AWS SDK for C++ client, we found that after disabling the bucket, all returned httpreturncodes were 403. However, we are puzzled about the increase in the number of CurlHttp requests. As a result of this increase in CurlHttp requests, and due to our DNS configuration issue, DNS issues were also triggered. Therefore, we would like to identify the root cause of this problem. please help ~

Expected Behavior

Help explain the reason, and it's not certain if this is a bug, but it feels like it could be optimized.
The highlighted area in the graph represents the situation after the bucket has been deleted.

Current Behavior

WechatIMG181
WechatIMG182

The highlighted area in the graph represents the situation after the bucket has been deleted.

Reproduction Steps

I reviewed some of the code and it seems that after the bucket is deleted, although the HTTP status code still returns 403, there are some differences in the internal returns, causing a change in the logic of the AWS client. However, I cannot confirm this.

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

not sure

Compiler and Version used

not sure

Operating System and version

suse sp15

@sglnd sglnd added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 13, 2024
@jmklix
Copy link
Member

jmklix commented Mar 13, 2024

This looks like it might be a problem with CurveFS and trying to make sdk calls after the bucket gets deleted. Please open an issue with that product if you would like help with that. If you can reproduce this error and provide sdk logs then we could try to debug it, but currently there isn't enough information to do that. Please let me know if you have any aws-sdk-cpp specific questions that I can help you with.

@jmklix jmklix added third-party This issue is related to third-party libraries or applications. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Mar 13, 2024
@sglnd
Copy link
Author

sglnd commented Mar 14, 2024

This looks like it might be a problem with CurveFS and trying to make sdk calls after the bucket gets deleted. Please open an issue with that product if you would like help with that. If you can reproduce this error and provide sdk logs then we could try to debug it, but currently there isn't enough information to do that. Please let me know if you have any aws-sdk-cpp specific questions that I can help you with.

aws_del_bef.zip
aws_del_after.zip

I provided logs both before and after the bucket was deleted. Let me describe the background again: during the process, the bucket was Forbidden, and there was no change in the access to CurveFs from the upper layer. However, it was only when the bucket was finally deleted that a large number of CurlHttp accesses occurred. This timing coincided, leading us to believe that it was the different information returned after the bucket was deleted that caused some issues with the SDK's client judgment.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Mar 15, 2024
@jmklix
Copy link
Member

jmklix commented Mar 15, 2024

Can you please also provide a code sample that reproduces this error with only using the aws-sdk-cpp and s3? We don't guarantee that this sdk will work with curveFS

@jmklix jmklix added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Mar 15, 2024
@sglnd
Copy link
Author

sglnd commented Mar 16, 2024

Can you please also provide a code sample that reproduces this error with only using the aws-sdk-cpp and s3? We don't guarantee that this sdk will work with curveFS

We've also conducted ongoing analysis. Prior to the bucket deletion, all accesses resulted in HTTP 403 errors. However, after the deletion, there was a significant increase in HTTP 500 errors, accompanied by a large number of Curl HTTP requests. We've also reviewed the code for CurveFS and aws-sdk-cpp.

CurveFS code:

void S3Adapter::PutObjectAsync(std::shared_ptr<PutObjectAsyncContext> context) {
    Aws::S3::Model::PutObjectRequest request;
    request.SetBucket(bucketName_);
    request.SetKey(Aws::String{context->key.c_str(), context->key.size()});

    request.SetBody(Aws::MakeShared<PreallocatedIOStream>(
        AWS_ALLOCATE_TAG, context->buffer, context->bufferSize));

    auto originCallback = context->cb;
    auto wrapperCallback =
        [this,
         originCallback](const std::shared_ptr<PutObjectAsyncContext>& ctx) {
            inflightBytesThrottle_->OnComplete(ctx->bufferSize);
            ctx->cb = originCallback;
            ctx->cb(ctx);
        };

    Aws::S3::PutObjectResponseReceivedHandler handler =
        [context](
            const Aws::S3::S3Client * /*client*/,
            const Aws::S3::Model::PutObjectRequest & /*request*/,
            const Aws::S3::Model::PutObjectOutcome &response,
            const std::shared_ptr<const Aws::Client::AsyncCallerContext>
                &awsCtx) {
            std::shared_ptr<PutObjectAsyncContext> ctx =
                std::const_pointer_cast<PutObjectAsyncContext>(
                    std::dynamic_pointer_cast<const PutObjectAsyncContext>(
                        awsCtx));

            LOG_IF(ERROR, !response.IsSuccess())
                << "PutObjectAsync error: "
                << response.GetError().GetExceptionName()
                << "message: " << response.GetError().GetMessage()
                << "resend: " << ctx->key;

            ctx->retCode = (response.IsSuccess() ? 0 : -1);
            ctx->timer.stop();
            ctx->cb(ctx);
        };

    if (throttle_) {
        throttle_->Add(false, context->bufferSize);
    }

    inflightBytesThrottle_->OnStart(context->bufferSize);
    context->cb = std::move(wrapperCallback);
    s3Client_->PutObjectAsync(request, handler, context);
}

aws-sdk-cpp code:

HttpResponseOutcome AWSClient::AttemptExhaustively(const Aws::Http::URI& uri,
    Aws::Http::HttpMethod method,
    const char* signerName,
    const char* requestName,
    const char* signerRegionOverride,
    const char* signerServiceNameOverride) const
{
    if (!Aws::Utils::IsValidHost(uri.GetAuthority()))
    {
        return HttpResponseOutcome(AWSError<CoreErrors>(CoreErrors::VALIDATION, "", "Invalid DNS Label found in URI host", false/*retryable*/));
    }

    std::shared_ptr<HttpRequest> httpRequest(CreateHttpRequest(uri, method, Aws::Utils::Stream::DefaultResponseStreamFactoryMethod));
    HttpResponseOutcome outcome;
    AWSError<CoreErrors> lastError;
    Aws::Monitoring::CoreMetricsCollection coreMetrics;
    auto contexts = Aws::Monitoring::OnRequestStarted(this->GetServiceClientName(), requestName, httpRequest);
    const char* signerRegion = signerRegionOverride;
    Aws::String regionFromResponse;

    Aws::String invocationId = Aws::Utils::UUID::PseudoRandomUUID();
    RequestInfo requestInfo;
    requestInfo.attempt = 1;
    requestInfo.maxAttempts = 0;
    httpRequest->SetHeaderValue(Http::SDK_INVOCATION_ID_HEADER, invocationId);
    httpRequest->SetHeaderValue(Http::SDK_REQUEST_HEADER, requestInfo);
    AppendRecursionDetectionHeader(httpRequest);

    for (long retries = 0;; retries++)
    {
        if(!m_retryStrategy->HasSendToken())
        {
            return HttpResponseOutcome(AWSError<CoreErrors>(CoreErrors::SLOW_DOWN,
                                                            "",
                                                            "Unable to acquire enough send tokens to execute request.",
                                                            false/*retryable*/));

        };
        outcome = AttemptOneRequest(httpRequest, signerName, requestName, signerRegion, signerServiceNameOverride);
        outcome.SetRetryCount(retries);
        if (retries == 0)
        {
            m_retryStrategy->RequestBookkeeping(outcome);
        }
        else
        {
            m_retryStrategy->RequestBookkeeping(outcome, lastError);
        }
        coreMetrics.httpClientMetrics = httpRequest->GetRequestMetrics();
        TracingUtils::EmitCoreHttpMetrics(httpRequest->GetRequestMetrics(),
            *m_telemetryProvider->getMeter(this->GetServiceClientName(), {}),
            {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
        if (outcome.IsSuccess())
        {
            Aws::Monitoring::OnRequestSucceeded(this->GetServiceClientName(), requestName, httpRequest, outcome, coreMetrics, contexts);
            AWS_LOGSTREAM_TRACE(AWS_CLIENT_LOG_TAG, "Request successful returning.");
            break;
        }
        lastError = outcome.GetError();

        DateTime serverTime = GetServerTimeFromError(outcome.GetError());
        auto clockSkew = DateTime::Diff(serverTime, DateTime::Now());

        Aws::Monitoring::OnRequestFailed(this->GetServiceClientName(), requestName, httpRequest, outcome, coreMetrics, contexts);

        if (!m_httpClient->IsRequestProcessingEnabled())
        {
            AWS_LOGSTREAM_TRACE(AWS_CLIENT_LOG_TAG, "Request was cancelled externally.");
            break;
        }

        // Adjust region
        bool retryWithCorrectRegion = false;
        HttpResponseCode httpResponseCode = outcome.GetError().GetResponseCode();
        if (httpResponseCode == HttpResponseCode::MOVED_PERMANENTLY ||  // 301
            httpResponseCode == HttpResponseCode::TEMPORARY_REDIRECT || // 307
            httpResponseCode == HttpResponseCode::BAD_REQUEST ||        // 400
            httpResponseCode == HttpResponseCode::FORBIDDEN)            // 403
        {
            regionFromResponse = GetErrorMarshaller()->ExtractRegion(outcome.GetError());
            if (m_region == Aws::Region::AWS_GLOBAL && !regionFromResponse.empty() && regionFromResponse != signerRegion)
            {
                signerRegion = regionFromResponse.c_str();
                retryWithCorrectRegion = true;
            }
        }

        long sleepMillis = TracingUtils::MakeCallWithTiming<long>(
            [&]() -> long {
                return m_retryStrategy->CalculateDelayBeforeNextRetry(outcome.GetError(), retries);
            },
            TracingUtils::SMITHY_CLIENT_SERVICE_BACKOFF_DELAY_METRIC,
            *m_telemetryProvider->getMeter(this->GetServiceClientName(), {}),
            {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
        //AdjustClockSkew returns true means clock skew was the problem and skew was adjusted, false otherwise.
        //sleep if clock skew and region was NOT the problem. AdjustClockSkew may update error inside outcome.
        bool shouldSleep = !AdjustClockSkew(outcome, signerName) && !retryWithCorrectRegion;

        if (!retryWithCorrectRegion && !m_retryStrategy->ShouldRetry(outcome.GetError(), retries))
        {
            break;
        }

        AWS_LOGSTREAM_WARN(AWS_CLIENT_LOG_TAG, "Request failed, now waiting " << sleepMillis << " ms before attempting again.");

        if (shouldSleep)
        {
            m_httpClient->RetryRequestSleep(std::chrono::milliseconds(sleepMillis));
        }

        Aws::Http::URI newUri = uri;
        Aws::String newEndpoint = GetErrorMarshaller()->ExtractEndpoint(outcome.GetError());
        if (!newEndpoint.empty())
        {
            newUri.SetAuthority(newEndpoint);
        }
        httpRequest = CreateHttpRequest(newUri, method, Aws::Utils::Stream::DefaultResponseStreamFactoryMethod);

        httpRequest->SetHeaderValue(Http::SDK_INVOCATION_ID_HEADER, invocationId);
        if (serverTime.WasParseSuccessful() && serverTime != DateTime())
        {
            requestInfo.ttl = DateTime::Now() + clockSkew + std::chrono::milliseconds(m_requestTimeoutMs);
        }
        requestInfo.attempt ++;
        requestInfo.maxAttempts = m_retryStrategy->GetMaxAttempts();
        httpRequest->SetHeaderValue(Http::SDK_REQUEST_HEADER, requestInfo);
        Aws::Monitoring::OnRequestRetry(this->GetServiceClientName(), requestName, httpRequest, contexts);
    }
    auto meter = m_telemetryProvider->getMeter(this->GetServiceClientName(), {});
    auto counter = meter->CreateCounter(TracingUtils::SMITHY_CLIENT_SERVICE_ATTEMPTS_METRIC, TracingUtils::COUNT_METRIC_TYPE, "");
    counter->add(requestInfo.attempt, {{TracingUtils::SMITHY_METHOD_DIMENSION, requestName},{TracingUtils::SMITHY_SERVICE_DIMENSION, this->GetServiceClientName()}});
    Aws::Monitoring::OnFinish(this->GetServiceClientName(), requestName, httpRequest, contexts);
    return outcome;
}

Summary:
In the logic of AWS SDK for C++, the absence of a sleep buffer mechanism when encountering HTTP 500 errors may lead to the re-creation of HTTP requests, potentially triggering DNS requests. Therefore, there are still several questions we haven't fully understood:

After the bucket is disabled, which requests from object storage will return HTTP 500 error responses before and after deletion?
Why do a sudden surge of Curl requests occur instead of requests related to object storage operations?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Mar 17, 2024
@DmitriyMusatkin
Copy link
Contributor

couple observations:

  1. you seem to be running against a 3p S3 implementation, guessing from the "server : openresty/1.19.3.1" header in the responses and from 403 UserSuspended, which I believe AWS version of S3 never sends?
  2. SDK is seeing 500s returned when trying to make data plane calls against the bucket that no longer exists. SDK cannot parse error response(body does not seem to be formatted correctly?). this also seems like a non-standard behavior. I believe regular s3 returns 404 in this case
  3. CPP SDK retries 5xx errors. Default is up to 10 retries with fairly small increment between each
  4. curvefs seems to override default cpp sdk retry policy to have a tighter timeouts
  5. CPP SDK might retire ips that return 500s from the pool of resolved ips to avoid hotspotting bad ips across retries.

overall sdk will not do requests on its own. Likely whats happening is that s3 implementation you are hitting is returning 500s , which causes a bit of retry storm, which is further amplified by non-ideal retry config either in your app or curveFS. From the above i dont see anything particularly wrong happening on CPP SDK side, its working as designed.

@DmitriyMusatkin DmitriyMusatkin added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label May 8, 2024
Copy link

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness p3 This is a minor priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. third-party This issue is related to third-party libraries or applications.
Projects
None yet
Development

No branches or pull requests

3 participants