Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot run css cluster #5

Open
Lobo2008 opened this issue Sep 28, 2022 · 4 comments
Open

cannot run css cluster #5

Lobo2008 opened this issue Sep 28, 2022 · 4 comments

Comments

@Lobo2008
Copy link

Lobo2008 commented Sep 28, 2022

  1. if I use all default settings, and start as sbin/start-all.sh, a Worker and a Master are running
    but when I submit a Spark app, it throw Caused by: java.lang.RuntimeException: replica num must less than worker num

  2. if I run zk mode by changing conf/css-default.cnf as:

css.zookeeper.address=MyZkIP:2181
css.worker.registry.type=zookeeper

and start as sbin/start-workers.sh or sbin/start-worker.sh or sbin/start-all.sh
it throw

com.bytedance.css.service.deploy.worker.Worker --host xxx07v.xxxx.net
failed to launch: nice -n 0 /yy/java8/bin/java -Xmx1024m -XX:MaxDirectMemorySize=4096m -Dcss.log.dir=/home/aa/css/logs -Dcss.log.filename=css-aa-worker-1.out -classpath /yyy/java8/lib:/home/aa/css/lib/* com.bytedance.css.service.deploy.worker.Worker --host  xxx07v.xxxx.net
tail: 无法打开"/home/aa/css/logs/css-aa-worker-1.out" 读取数据: 没有那个文件或目录
full log in /home/aa/css/logs/css-aa-worker-1.out
  1. if I deploy as the README.md and set 3 workers in conf/workers + zk mode and start-workers.sh
    then entering my password of the 3 workers, it returns pemission denied
    I am sure my password is correct.

Any suggestion? I think the README is ambiguous

@bdyx123
Copy link
Collaborator

bdyx123 commented Sep 28, 2022

  1. css push data should use two replicas, so we should start two workers at least
  2. does dir /home/aa/css/logs exist? or dir permission issues?
  3. the machine which exec start-workers.sh should be set ssh-without-password with all of the workers

@Lobo2008
Copy link
Author

  1. css push data should use two replicas, so we should start two workers at least
  2. does dir /home/aa/css/logs exist? or dir permission issues?
  3. the machine which exec start-workers.sh should be set ssh-without-password with all of the workers

no permission issues because I change nothing and start as start-all.sh can produce the corresponding master and worker log, but when I change to zk, it failed.

I have 3 nodes with ip IP_A,IP_B,IP_C and want to use zk mode

  • conf/workers
    IP_A
    IP_B
    IP_C
  • conf/css-default.conf
    css.cluster.name=bytedance
    css.commit.threads=128
    css.flush.timeout=360s
    css.network.timeout=600s
    css.disk.dir.num.min=5
    css.extMeta.expire.interval=600s
    css.zookeeper.address=My_Zk_IP:2181
    css.worker.registry.type=zookeeper
    css.cluster.name=bytedance

then the dir is sent to the 3 nodes,
how should I change other settings and run the scripts to make them work ?

I suppose run start-workers.sh on one of the 3 nodes should work, css will check the workers to start all the 3 workers.
OR
run start-worker.sh on every node and the worker will start its worker process(at this point, the other 2 IP should be deleted?)

@bdyx123
Copy link
Collaborator

bdyx123 commented Oct 8, 2022

yes, start-workers.sh can work, haven't you start it?

@a140262
Copy link

a140262 commented Oct 29, 2022

My CSS cluster is up running with the zookeeper registry type in k8s now. Everything looks fine until I run a Spark app. The application log shows the same error message:

java.lang.RuntimeException: replica num must less than worker num

The stats in my zookeeper is

[zk: localhost:2181(CONNECTED) 4] ls /css/my2css/workers
[css-0:39477:32875:35149, css-0:41865:46557:46199, css-0:43149:36579:33897, css-0:46573:36469:44793, css-0:46679:46533:41791, css-1:35421:36815:43879, css-1:39127:39883:44297, css-1:42185:42751:44815, css-1:43769:41983:33951]

the environment variable has set export CSS_WORKER_INSTANCES=2
Could you please let us know which configuration sets the replica number and which one measures the worker number? anything else I have missed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants