Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can zk support high-frequency operations, and whether zk will be a bottleneck? #2

Open
long1208 opened this issue Aug 29, 2022 · 2 comments

Comments

@long1208
Copy link

No description provided.

@bdyx123
Copy link
Collaborator

bdyx123 commented Aug 29, 2022

the operations for zk are not very frequently, one is that css workers update status, another is the create/delete operations when register/unregister shuffle. Currently we have 7 zk nodes for a CSS cluster which has hundreds of workers.
The zk pressure(memory) mainly comes from lots of zk watches, which we used for tracking the shuffleId lifetime to clean data on css workers. We are doing the optimization for this.

@a140262
Copy link

a140262 commented Oct 30, 2022

@bdyx123 , have you seen the following exceptions from Spark application logs? It seems CSS worker has deleted the shuffleID before it tries to update it. Is this behavior normal?

ERROR ZookeeperExternalShuffleMeta: Update zk shuffle node spark-00000003119c3rhae9v-117 failed.
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /css/mycss/shuffles/spark-00000003119c3rhae9v-117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants