Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pubsub): off-by-one error leading to bundlers never being freed #4253

Closed
wants to merge 3 commits into from
Closed

fix(pubsub): off-by-one error leading to bundlers never being freed #4253

wants to merge 3 commits into from

Conversation

mzanibelli
Copy link

Context

When adding an item to the publish scheduler with message ordering enabled, a new bundler is added for this particular key.

The publisher keeps an internal state of the current bundlers and is supposed to remove a bundler when all of its items have been processed.

To know if the bundler is ready for cleanup, an internal counter is incremented on Add() and decremented on handle().

Problem

This counter is erroneously initialized at 2, leading to never reaching 0 and the bundler never beeing freed.

Consequence

This causes a memory leak when a high number of different ordering keys are used: the internal bundlers list becomes huge and can keep growing - reaching millions of entries.

Remediation

Don't execute the increment statement upon bundle initialization since the internal counter is already at value 1.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Jun 13, 2021
@google-cla
Copy link

google-cla bot commented Jun 13, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no This human has *not* signed the Contributor License Agreement. label Jun 13, 2021
@mzanibelli
Copy link
Author

@googlebot I signed it!

@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@mzanibelli
Copy link
Author

@googlebot I signed it!

@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

When adding an item to the publish scheduler with message ordering enabled, a
new bundler is added for this particular key. The publisher keeps an internal
state of the current bundlers and is supposed to remove a bundler when all of
its items have been processed. To know if the bundler is ready for cleanup, an
internal counter is incremented on Add() and decremented on handle(). This
counter is erroneously initialized at 2, leading to never reaching 0 and never
beeing freed. This causes a memory leak when a high number of different
ordering keys are used: the internal bundlers list becomes huge and can hold
millions of entries.
@google-cla
Copy link

google-cla bot commented Jun 14, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@mzanibelli
Copy link
Author

@googlebot I signed it!

@google-cla google-cla bot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Jun 14, 2021
@mzanibelli mzanibelli marked this pull request as ready for review June 14, 2021 15:42
@mzanibelli mzanibelli requested review from a team as code owners June 14, 2021 15:42
Copy link
Member

@hongalex hongalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR, this is a good catch!

@hongalex hongalex added the kokoro:run Add this label to force Kokoro to re-run the tests. label Jun 15, 2021
@kokoro-team kokoro-team removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Jun 15, 2021
@hongalex
Copy link
Member

@mzanibelli So this fix actually uncovered an issue with our other methods (specifically FlushAndStop), which tries to iterate over bundlers and Go detects a data race with this and the deletion of the bundlers. I thought locking the mutex in FlushAndStop would fix this, but this causes other tests in publish_scheduler_test to fail. I don't have time to take a look at this now, but just wanted to give you a heads up as to why the CI is failing.

@hongalex hongalex self-requested a review June 15, 2021 20:06
@codyoss
Copy link
Member

codyoss commented Jun 15, 2021

@hongalex More mutexes can't be added for flushing as this could force a handling of the bundle I believe. Switching just bundlers to a sync.Map would solve the issue but I am not sure if this would have a negative performance impact or not, would need more testing but I think it would be okay...

@mzanibelli
Copy link
Author

Hello! Indeed I did not run the tests with the race detector enabled.
The problem seems deeper. I tried implementing sync.Map usage (for both bundlers and key counters) but it leads to other race problems let alone performance problems. I think some work is needed at the architecture level where a way to avoid updating the scheduler properties inside the bundler's handler would streamline the whole thing.
Let me know if I can help! Thanks for looking into this.

@mzanibelli
Copy link
Author

Hey! I have attempted something which is not very clean but the general idea is: the race being on s.bundlers iteration / deletion, when we flush (and stop) we don't need to perform the deletion at all and we can just cleanup the whole state just before returning from Flush*.

Now here is the catch: not sure what happens if we Add during a simple Flush because the guard provided by s.done during FlushAndStop is not effective. I think it should be ok since either the bundler is still here, either everything is clean so we don't have a corrupted state but be careful.

It's not very clean yet and I must admit I don't like this solution, but tests pass with the race detector so I might aswell mention it to you guys.

@hongalex
Copy link
Member

I have a very rough draft of replacing everything with sync.Map and it seems to be fine (as in the tests pass). From some initial benchmarking, replacing map with sync.Map hasn't affected unordered publish performance. I'll send out a PR of this later today after benchmarking with ordered keys as well.

@mzanibelli
Copy link
Author

I retried the sync.Map implementation just out of curiosity and you are right! I must have been sleeping the other day but I had TestPublishScheduler_DoesntRaceWithPublisher failing if I remember correctly. I still look forward to checking the way you handled a specific corner case that leaves me doubtful, but anyways good job! It seems one of the rare occasions sync.Map would be appropriate.

@mzanibelli
Copy link
Author

Better solution here.

@mzanibelli mzanibelli closed this Jun 22, 2021
@mzanibelli mzanibelli deleted the publish-scheduler-leak branch June 22, 2021 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants