Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code is not working when data size is large #1

Open
souvik82 opened this issue Jan 18, 2017 · 3 comments
Open

Code is not working when data size is large #1

souvik82 opened this issue Jan 18, 2017 · 3 comments

Comments

@souvik82
Copy link

I have 4 node cluster - Master has 56 GB of RAM and Data nodes have 32 GB RAM. When I am going to use small data set like 200 MB. It works fine. When I am using 10 GB of Data, it is getting hang.

Sorting of data is horrible - It is taking too much time (For 10 GB - around 2.9 Mins). Below is my spark submit script:

spark-submit --jars /usr/lib/hbase-client-1.2.0-IBM-7.jar,/usr/lib/hbase-hadoop-compat-1.2.0-IBM-7.jar,/usr/lib/htrace-core-3.1.0-incubating.jar,/usr/lib/hbase-common-1.2.0-IBM-7.jar,/usr/lib/hbase-hadoop2-compat-1.2.0-IBM-7.jar,/usr/lib/hbase-protocol.jar,/usr/lib/hbase-server-1.2.0-IBM-7.jar,/usr/lib/metrics-core-2.2.0.jar,/usr/lib/hbase-annotations-1.2.0-IBM-7.jar --class testHbaseRDDUtil souvik-0.0.1-SNAPSHOT.jar --driver-memory 20G --executor-memory 4G --num-executors 32

@zeyuanxy
Copy link
Owner

Hello, sorting is accompanied with repartition, which means there is a lot of data transportation between machines, which is definitely a performance killer. Can you share the detailed metrics(or status) of your Spark Job?

@zeyuanxy
Copy link
Owner

Hi @souvik82 , I've added a new interface toHBaseBulkWithFamilies to let you define families rather than iterate through all the families, which would greatly improve performance when the data size is huge. Can you try it? Thanks~

@ghost
Copy link

ghost commented Jun 20, 2017

How do I set the param numFilesPerRegionPerFamily? I have dozens of GB needs to be saved in HBase,,,
请问一下我该怎么设置numFilesPerRegionPerFamily的大小啊?我只知道我有十几个GB的数据要保存,这个数字应该怎么动态调节呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants