Spark-overflow

A collection of Spark related information, solutions, debugging tips and tricks, etc. PR are always welcome! Share what you know about Apache Spark.

Knowledge

Spark executor memory(Reference Link)

spark-submit --verbose(Reference Link)

Always add --verbose options on spark-submit to print the following information:
- All default properties.
- Command line options.
- Settings from spark conf file.

Spark Executor on YARN(Reference Link)

Following is the memory relation config on YARN:

YARN container size - yarn.nodemanager.resource.memory-mb.
Memory Overhead - spark.yarn.executor.memoryOverhead.

An example on how to set up Yarn and launch spark jobs to use a specific numbers of executors (Reference Link)

Tunings

Tune the shuffle partitions

Tune the numbers of spark.sql.shuffle.partitions.

Avoid using jets3t 1.9(Reference Link)

It's a jar default on Hadoop 2.0.
Inexplicably terrible performance.

Use `reduceBykey()` instead of `groupByKey()`

reduceByKey

groupByKey

GC policy(Reference Link)

G1GC is a new feature that you can use.
Used by -XX:+UseG1GC.

Join a large Table with a small table(Reference Link)

By default it's using ShuffledHashJoin, the problem here is that all the data of big ones will be shuffled.
Use BroadcasthashJoin:
- It will broadcast the small one to all workers.
- Set spark.sql.autoBroadcastJoinThreshold.

Use `forEachPartition`

If your task involves a large setup time, use forEachPartition instead.
For example: DB connection, Remote Call, etc.

Data Serialization

The default Java Serialization is too slow.
Use Kyro:
- conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

Solutions

java.io.IOException: No space left on device

The /tmp is probably full, check spark.local.dir in spark-conf.default.
How to fix it?
- Mount more disk space:
  spark.local.dir /data/disk1/tmp,/data/disk2/tmp,/data/disk3/tmp,/data/disk4/tmp

java.lang.OutOfMemoryError: GC overhead limit exceeded(ref)

Too much GC time, you can check that on Spark metrics.
How to fix it?
- Increase executor heap size by --executor-memory.
- Increase spark.storage.memoryFraction.
- Change GC policy(ex: use G1GC).

shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space(ref)

OOM on Spark driver.
This usually happens when you fetch a huge data to driver(client).
Spark SQL and Streaming is a typical workload which needs large heap on driver
How to fix?
- Increase --driver-memory.

java.lang.NoClassDefFoundError(ref)

Compiled okay, but got error on run-time.
How to fix it?
- Use --jars to upload and place on the classpath of your application.
- Use --packages to include comma-sparated list of Maven coordinates of JARs.
  EX: --packages com.google.code.gson:gson:2.6.2
  This example will add a jar of gson to both executor and driver classpath.

Serialization stack error

Error message:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: com.spark.demo.MyClass Serialization stack:
- Object is not serializable (class: com.spark.demo.MyClass, value: com.spark.demo.MyClass@6951e281)
- Element of array (index: 0)
- Array (class [Ljava.lang.Object;, size 6)
How to fix it?
- Make com.spark.demo.MyClass to implement java.io.Serializable.

java.io.FileNotFoundException: spark-assembly.jar does not exist

How to fix it?

Upload Spark-assembly.jar to Hadoop.
Set spark.yarn.jar, there are two ways to configure it:
- Add --conf spark.yarn.jar when launching spark-submit.
- Set spark.yarn.jar on SparkConf in your spark driver.

java.io.IOException: Resource spark-assembly.jar changed on src filesystem (Reference Link)

Spark-assembly.jar exists in HDFS, but still get assembly jar changed error.
How to fix it?

Upload Spark-assembly.jar to Hadoop.
Set spark.yarn.jar, there are two ways to configure it:
- Add --conf spark.yarn.jar when launching spark-submit.
- Set spark.yarn.jar on SparkConf in your spark driver.

How to find the size of dataframe in Spark

In Java, you can use org.apache.spark.util.SizeEstimator.
In Pyspark, one way to do it is to persist the dataframe to disk, then go to the SparkUI Storage tab and see the size.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Spark-overflow

Knowledge

Spark executor memory(Reference Link)

spark-submit --verbose(Reference Link)

Spark Executor on YARN(Reference Link)

Tunings

Tune the shuffle partitions

Avoid using jets3t 1.9(Reference Link)

Use `reduceBykey()` instead of `groupByKey()`

GC policy(Reference Link)

Join a large Table with a small table(Reference Link)

Use `forEachPartition`

Data Serialization

Solutions

java.io.IOException: No space left on device

java.lang.OutOfMemoryError: GC overhead limit exceeded(ref)

shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space(ref)

java.lang.NoClassDefFoundError(ref)

Serialization stack error

java.io.FileNotFoundException: spark-assembly.jar does not exist

java.io.IOException: Resource spark-assembly.jar changed on src filesystem (Reference Link)

How to find the size of dataframe in Spark

About

Releases

Packages

Contributors 5

AllenFang/spark-overflow

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Spark-overflow

Knowledge

Spark executor memory(Reference Link)

spark-submit --verbose(Reference Link)

Spark Executor on YARN(Reference Link)

Tunings

Tune the shuffle partitions

Avoid using jets3t 1.9(Reference Link)

Use reduceBykey() instead of groupByKey()

GC policy(Reference Link)

Join a large Table with a small table(Reference Link)

Use forEachPartition

Data Serialization

Solutions

java.io.IOException: No space left on device

java.lang.OutOfMemoryError: GC overhead limit exceeded(ref)

shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space(ref)

java.lang.NoClassDefFoundError(ref)

Serialization stack error

java.io.FileNotFoundException: spark-assembly.jar does not exist

java.io.IOException: Resource spark-assembly.jar changed on src filesystem (Reference Link)

How to find the size of dataframe in Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Use `reduceBykey()` instead of `groupByKey()`

Use `forEachPartition`

Packages