Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: discover topic/channel paused state on new topic discovery #1274

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maplessssy
Copy link

Fix #1199

@maplessssy maplessssy changed the title Sync topic channel state nsqd: discover topic/channel paused state on new topic discovery Aug 8, 2020
@maplessssy
Copy link
Author

@jehiah @mreiferson @ploxiln This PR is ready for review. PTAL.

@maplessssy
Copy link
Author

Here is the design:

  1. Supports send topic/channel paused state through command[register], see go-nsq pr.
  2. Nsqd send states to nsqlookupd when topic/channel create,delete and paused,unpaused.
  3. Nsqlookupd store these states, and nsqd can discover topic/channel paused state on new topic discovery through the HTTP API that nsqlookupd provides.

see more details in the code. : )

nsqd/lookup.go Outdated Show resolved Hide resolved
nsqlookupd/http.go Outdated Show resolved Hide resolved
@mreiferson
Copy link
Member

Thanks for taking a first pass at this @thekingofworld! 🙏

@maplessssy
Copy link
Author

@jehiah @mreiferson @ploxiln This PR is ready for review. PTAL.

@maplessssy
Copy link
Author

@mreiferson This PR is ready for review. PTAL.
/cc @ploxiln @jehiah

@maplessssy
Copy link
Author

/cc @mreiferson @ploxiln @jehiah any of you can help to take a look at this?🙏

@maplessssy
Copy link
Author

@mreiferson Ping.

nsqd/lookup.go Outdated Show resolved Hide resolved
nsqd/nsqd.go Outdated Show resolved Hide resolved
nsqlookupd/http.go Outdated Show resolved Hide resolved
@maplessssy
Copy link
Author

/cc @mreiferson
Thank you for suggestions above, it's great. I have finished all, the code seems to more clearly, please take another look.
Thanks again for your patience.

Copy link
Member

@mreiferson mreiferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to improve! Thanks for your patience 🙏

My big high-level concern now is that these new paths in nsqlookupd are going to be significantly more expensive. In the /channels case, we're doing multiple passes over data in RegistrationDB to filter and lookup paused state in producers. Considering #1208, we should think about better data structures to use to store this data to make these newer, more complex queries performant.

cc @andyxning

nsqlookupd/lookup_protocol_v1.go Outdated Show resolved Hide resolved
nsqlookupd/registration_db.go Outdated Show resolved Hide resolved
nsqd/context.go Outdated Show resolved Hide resolved
nsqd/nsqd.go Outdated Show resolved Hide resolved
nsqd/nsqd.go Outdated Show resolved Hide resolved
nsqlookupd/http.go Outdated Show resolved Hide resolved
nsqlookupd/http.go Outdated Show resolved Hide resolved
nsqlookupd/http.go Outdated Show resolved Hide resolved
nsqd/lookup.go Outdated Show resolved Hide resolved
nsqd/lookup.go Outdated Show resolved Hide resolved
@maplessssy
Copy link
Author

maplessssy commented Aug 17, 2020

Continuing to improve! Thanks for your patience 🙏

My big high-level concern now is that these new paths in nsqlookupd are going to be significantly more expensive. In the /channels case, we're doing multiple passes over data in RegistrationDB to filter and lookup paused state in producers. Considering #1208, we should think about better data structures to use to store this data to make these newer, more complex queries performant.

cc @andyxning

Agree.

In the /channels case, we're doing multiple passes over data in RegistrationDB to filter and lookup paused state in producers.

After this, seems like handle the "*" case for FindRegistrations is not enough. I have to think about how to solve this whole problem. @mreiferson @andyxning Any ideas?

like we can add two additional maps to handle "*" case for FindRegistrations, we can also add additional map handle the state lookup for topic/channel, doesn't it?

@maplessssy
Copy link
Author

maplessssy commented Aug 29, 2020

@mreiferson
I added some benchmarks for the /channels case, the execution results are as follows:

image

It seems still fast enough, I doubt if the situation of write lock starvation really exists?

@maplessssy
Copy link
Author

@mreiferson Ping.

@maplessssy
Copy link
Author

@mreiferson @jehiah @ploxiln Anyone can help to take a look at this? It has waited for more than 10 days...

@maplessssy
Copy link
Author

Hi folks, let's keep going.[Doge] if there are any problems, please let me know and I will try my best to solve them.
Let us make nsq more powerful.
/cc @mreiferson @jehiah @ploxiln

@mreiferson
Copy link
Member

mreiferson commented Sep 13, 2020

hi @thekingofworld, sorry for the delay, I'll try to find some time this week to take a closer look. We do know that at larger scale there are performance issues that exist (#1208), and this PR will make some queries perform worse.

@maplessssy
Copy link
Author

maplessssy commented Sep 14, 2020

Well, I read the related discussion about the PR #1208.
I'm wondering if there is more detailed data that indicate how slow these queries are, such as how long does a query take, and how many topics, how many channels, how many producers...
After recurring this slow situation, we can verify how much efficiency the new optimization scheme can improve.
/cc @mreiferson

@maplessssy
Copy link
Author

@mreiferson
I added some benchmarks for the /channels case, the execution results are as follows:

image

It seems still fast enough, I doubt if the situation of write lock starvation really exists?

By the way, benchmarks show that these queries are still fast.

@ploxiln
Copy link
Member

ploxiln commented Sep 14, 2020

I think that the situation that #1208 is attempting to optimize, which the existing benchmark setups here do not emulate, is when there are many channels per topic, and they are constantly being added and removed (as ephemeral channels tend to do when consumers connect and disconnect). Each add and remove does a "write" lock/unlock of the RWMutex, and this blocks all the "read" locks of the lookups. The current benchmarks setup just two channels per topic, and do lookups without concurrent updates, and only time the lookups part.

@maplessssy
Copy link
Author

@ploxiln yeah, the situation you described is right.
But I think the key is how FindRegistrations will behave in * case when db is very large.
So I added some benchmarks for this, the results are shown below:
image
When topic=64, channel=512, producer=512, each operation takes 11ms.
When topic=512, channel=512, producer=512, Memory exceeds 10G, I think this is beyond normal.
It seems like we can add additional maps to improve performance for * case.
/cc @ploxiln @mreiferson

@maplessssy maplessssy force-pushed the sync_topic_channel_state branch 2 times, most recently from f513954 to 0ceec31 Compare December 5, 2020 08:11
@maplessssy
Copy link
Author

maplessssy commented Dec 5, 2020

@mreiferson @ploxiln
Let's continue to this.
Our previous discussion stopped at the performance of the case of /channels. Recently I added an additional map to improve the performance of the query, which is essentially optimizing the function FindRegistrations, which is called with * .
PTAL.

Comparison before and after optimization:

Performance increased by 40%.

image

image

Corns:

  • This will add some pressure on the memory since a new memory will be allocated.

@maplessssy
Copy link
Author

@mreiferson @ploxiln Ping.

@mreiferson
Copy link
Member

Thanks for continuing to push this forward @thekingofworld.

The approach here will now be much more costly on the write (registration) side, because each write will now have to walk the whole structure and produce the cached "high performance" map.

Instead, we should concurrently update both the standard and new "high performance" maps during each mutation.

The discussion in #1208 (comment) (from "if sync.Map everywhere is too much overhead" down) and #1208 (comment) has the most relevant context for what needs to happen here.

@maplessssy
Copy link
Author

maplessssy commented Jan 1, 2021

Instead, we should concurrently update both the standard and new "high performance" maps during each mutation.

agree, update highspeed map during each mutation instead walk the whole structure and produce the cached "high performance" map.

@mreiferson done, PTAL.

@maplessssy
Copy link
Author

maplessssy commented Feb 9, 2021

@mreiferson ping.
let's finish it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nsqd: discover topic/channel paused state on new topic discovery
3 participants