Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Lambda async invocation queue namespace #10831

Merged
merged 10 commits into from
May 24, 2024
Merged

Conversation

joe4dev
Copy link
Member

@joe4dev joe4dev commented May 15, 2024

Motivation

Addresses #10280

Re-creating a Lambda function immediately after deleting it, triggers a race condition and deletes the internal async invoke queue. This causes continuous error logs and breaks async invokes.
The namespace clash also causes further issues because queued invokes can get deleted.

Changes

  • Introduce a volatile (i.e., non-persisted) function instance id unique to a running LocalStack installation that can be used for the async queue namespace
  • Add test test_recreate_function to reproduce the reported function recreate issue
  • Add test test_function_update_during_invoke to validate the AWS behavior of function updates during a running invocation => Currently broken on LocalStack ⚠️
  • Add test test_async_invoke_queue_upon_function_update to validate the queuing behavior of async invokes when a function update happens => kinda works but the first invoke fails and gets retried due function updates aborting all running invokes (even for $LATEST)

Unrelated changes sneaked in:

  • Fix config default for event rule engine (align with docs)
  • Remove unnecessary lambda_service backlink in Lambda service manager (missing declaration cleanup from this PR)
  • Fix an outdated comment in the Docker runtime executor
  • Fix misleading name in test and snapshot

Testing

State pickling testing whether the instance_id attribute is persisted or not:

  1. Debug tests.aws.services.lambda_.test_lambda.TestLambdaCleanup.test_recreate_function with a breakpoint at the end of the tests
  2. Export the state using localstack-pro state export --format json lambda-state
  3. Load the pickled file in a scratch file:
import pickle
import pprint

with open("/Users/joe/Downloads/lambda-state/api_states/000000000000/lambda/us-east-1/store.state", "rb") as f:
    lambda_store = pickle.load(f)

    pprint.pp(lambda_store)
    print("attr_functions".center(80, "="))
    pprint.pp(lambda_store.attr_functions)
    print("functions".center(80, "="))
    pprint.pp(lambda_store.functions)

Discussion

  • I'm not really happy with the volatile variable hack in the lambda model. Maybe, we can find some better solution here 🤔
  • We should not abort running executions for updates of $LATEST. That seems like a common scenario.

@joe4dev joe4dev added the semver: patch Non-breaking changes which can be included in patch releases label May 15, 2024
@joe4dev joe4dev added this to the 3.5 milestone May 15, 2024
@joe4dev joe4dev self-assigned this May 15, 2024
Copy link

github-actions bot commented May 15, 2024

S3 Image Test Results (AMD64 / ARM64)

  2 files    2 suites   3m 12s ⏱️
399 tests 347 ✅  52 💤 0 ❌
798 runs  694 ✅ 104 💤 0 ❌

Results for commit 6445b95.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented May 15, 2024

LocalStack Community integration with Pro

    2 files  ± 0      2 suites  ±0   1h 40m 16s ⏱️ + 3m 26s
2 988 tests +38  2 677 ✅ +29  311 💤 +9  0 ❌ ±0 
2 990 runs  +38  2 677 ✅ +29  313 💤 +9  0 ❌ ±0 

Results for commit 6445b95. ± Comparison against base commit 9d80628.

This pull request removes 24 and adds 62 tests. Note that renamed tests count towards both.
tests.aws.services.cloudformation.resources.test_secretsmanager ‑ test_cfn_secret_policy
tests.aws.services.events.scheduled_rules.test_events_scheduled_rules_logs ‑ test_scheduled_rule_logs
tests.aws.services.events.scheduled_rules.test_events_scheduled_rules_sqs ‑ test_scheduled_rule_sqs
tests.aws.services.events.test_events.TestEvents ‑ test_list_tags_for_resource
tests.aws.services.events.test_events_rules ‑ test_create_rule_with_one_unit_in_singular_should_succeed[rate(1 day)]
tests.aws.services.events.test_events_rules ‑ test_create_rule_with_one_unit_in_singular_should_succeed[rate(1 hour)]
tests.aws.services.events.test_events_rules ‑ test_create_rule_with_one_unit_in_singular_should_succeed[rate(1 minute)]
tests.aws.services.events.test_events_rules ‑ test_put_rule_invalid_rate_schedule_expression[ rate(10 minutes)]
tests.aws.services.events.test_events_rules ‑ test_put_rule_invalid_rate_schedule_expression[rate( 10 minutes )]
tests.aws.services.events.test_events_rules ‑ test_put_rule_invalid_rate_schedule_expression[rate()]
…
tests.aws.services.apigateway.test_apigateway_ssm ‑ test_ssm_aws_integration
tests.aws.services.cloudformation.resources.test_secretsmanager ‑ test_cfn_secret_policy[default]
tests.aws.services.cloudformation.resources.test_secretsmanager ‑ test_cfn_secret_policy[true]
tests.aws.services.cloudformation.test_template_engine.TestMacros ‑ test_pyplate_param_type_list
tests.aws.services.events.test_events_schedule.TestScheduleCron ‑ test_schedule_cron_target_sqs
tests.aws.services.events.test_events_schedule.TestScheduleCron ‑ tests_put_rule_with_schedule_cron[cron(0 10 * * ? *)]
tests.aws.services.events.test_events_schedule.TestScheduleCron ‑ tests_put_rule_with_schedule_cron[cron(0 18 ? * MON-FRI *)]
tests.aws.services.events.test_events_schedule.TestScheduleCron ‑ tests_put_rule_with_schedule_cron[cron(0 8 1 * ? *)]
tests.aws.services.events.test_events_schedule.TestScheduleCron ‑ tests_put_rule_with_schedule_cron[cron(0/10 * ? * MON-FRI *)]
tests.aws.services.events.test_events_schedule.TestScheduleCron ‑ tests_put_rule_with_schedule_cron[cron(0/15 * * * ? *)]
…
This pull request skips 4 and un-skips 3 tests.
tests.aws.services.firehose.test_firehose.TestFirehoseIntegration ‑ test_kinesis_firehose_opensearch_s3_backup[domain]
tests.aws.services.firehose.test_firehose.TestFirehoseIntegration ‑ test_kinesis_firehose_opensearch_s3_backup[path]
tests.aws.services.firehose.test_firehose.TestFirehoseIntegration ‑ test_kinesis_firehose_opensearch_s3_backup[port]
tests.aws.services.lambda_.test_lambda.TestLambdaConcurrency ‑ test_reserved_concurrency_async_queue
tests.aws.services.sns.test_sns_filter_policy.TestSNSFilterPolicyBody ‑ test_filter_policy_on_message_body_or_attribute
tests.aws.services.sqs.test_sqs.TestSqsProvider ‑ test_set_unsupported_attribute_fifo[sqs]
tests.aws.services.sqs.test_sqs.TestSqsProvider ‑ test_set_unsupported_attribute_fifo[sqs_query]

♻️ This comment has been updated with latest results.

@joe4dev joe4dev marked this pull request as ready for review May 16, 2024 08:51
@joe4dev joe4dev requested a review from giograno May 16, 2024 08:52
@thrau thrau removed their request for review May 23, 2024 11:38
@thrau
Copy link
Member

thrau commented May 23, 2024

removing myself as reviewer as i was added only because of the config.py change

Copy link
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good solution for now.

We should definitely tackle the lambda function deletion locks and the aborted invocations soon!

I just had some minor questions regarding some comments, otherwise approved from my side :)

@@ -599,6 +599,25 @@ class Function:
def latest(self) -> FunctionVersion:
return self.versions["$LATEST"]

# HACK to model a volatile variable that should be ignored for persistence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not 100% satisfied with this hack, but I think for now it is the best solution. Let's hope we can find a way to get by without this hack in the future :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, pretty ugly hack 😢
We currently don't have a Python object such as the VersionManager at function level. Alternative hacks:

  • Use a dict to track them in the lambda service, but passing it down into the event managers is gonna be awkward
  • Add it to the version manager of a specific version (e.g., v1, $LATEST is annoying because then it needs to migrate)

Comment on lines 2670 to 2672
# Make it very likely that the invocation is being processed
# In LocalStack, Lambda image pulling could potentially affect the exact timing,
# but at least the invocation runtime startup should be in progress.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is misleading. Image pulling should have already taken place during the transition to Active. This is maybe just for the container creation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I updated the comment.
Acquiring the lock is actually pretty fast for invokes 🚀 Container creation happens later and does not matter because we always send invokes to environments we spawn on demand (i.e., no optimization step here).

Payload=json.dumps(payload),
)
# Make it very likely that the first request is being processed before sending further requests without sleeps.
# We could use a side effect (e.g., via S3) to signal a status update if Lambda image pulling causes flakiness,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here regarding pulling vs container creation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as well 👍

@@ -41,7 +41,7 @@
from localstack.constants import AWS_REGION_US_EAST_1
from localstack.services.lambda_.api_utils import qualified_lambda_arn, unqualified_lambda_arn
from localstack.utils.archives import unzip
from localstack.utils.strings import long_uid
from localstack.utils.strings import long_uid, short_uid
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giograno Does this hack invalidate all Lambda CloudPods due to a changed import 🤔

@joe4dev joe4dev merged commit b8290ff into master May 24, 2024
37 checks passed
@joe4dev joe4dev deleted the fix-lambda-async-queue-error branch May 24, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: patch Non-breaking changes which can be included in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants