Creating Spark Data Sources with the Java API

Quickstart

gradle clean build

spark-shell \
--jars=./lib/build/libs/dataguidebook-1.0-SNAPSHOT.jar

Data Sources Version 1 API

This is the original way of defining a data source. It is still available in Spark 3.2.1.

import org.apache.spark.sql.SaveMode

val df = spark.read.format("com.dataguidebook.spark.datasource.v1").load("")

df.printSchema()
df.count()

df.write.mode(SaveMode.Append).format("com.dataguidebook.spark.datasource.v1").save("newpath")

// InsertInto command Only works when you have a table defined USING your
// custom data source.

//spark.sql("CREATE TABLE myTable(column01 int, column02 int, column03 int ) USING com.dataguidebook.spark.datasource.v1 LOCATION custom/insertinto")
df.write.mode(SaveMode.Append).format("com.dataguidebook.spark.datasource.v1").insertInto("myTable")

Working with Data Sources V1

JavaDoc for Spark SQL Sources provides you with all the classes you can use in your custom data source to define the behaviors. Need to provide two classes:

DefaultSource which implements RelationProvider
YourCustomClass which extends BaseRelation and implements at least TableScan

Used for Reading or Writing

RelationProvider: Used in the DefaultSource and defines how you initialize a custom data source without a user defined schema.
- Defines createRelation and is used when you do spark.read and takes in the options provided.
SchemaRelationProvider: Used in the DefaultSource and defines how you initialize a custom data source with a user defined schema.
- Defines createRelation and is used when you do spark.read and requires that you provide a schema.
DataSourceRegister: Data sources should implement this trait so that they can register an alias to their data source.

Reading Data

TableScan: A BaseRelation that can produce all of its tuples as an RDD of Row objects.
PrunedScan: A BaseRelation that can eliminate unneeded columns before producing an RDD containing all of its tuples as Row objects.
PrunedFilteredScan: A BaseRelation that can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects.

Writing Data

CreatetableRelationProvider: Used on the DefaultSource class to define a data writing behavior
- Requires you to implement createRelation with an additional SaveMode parameter.
- You define all the business logic to overwrite, append, etc. a dataframe to your custom data source.
- Use df.foreachPartition to execute your business logic on each partition when writing.
InsertableRelation: Used on the custom data source's class (e.g. CustomDataRelation) that inherits from BaseRelation to insert into a hive metastore backed datasource.
- Requires you to implement insert which takes in a dataframe and you apply the business logic to store it inside your Hive metastore (e.g. write it to your proprietary format) or send the dataframe to a different datastore.
- This only works if you have defined a custom table inside your hive metastore with the USING keyword specifying your custom data source.
```
spark.sql("CREATE TABLE myTable(column01 int, column02 int, column03 int ) USING com.dataguidebook.spark.datasource.v1 LOCATION custom/insertinto")
```

Spark 2.4 Data Sources V2

In Spark 2.3, the Data Sources V2 API (JavaDoc) was released in beta (Spark JIRA) but was not marked as stable until 2.4. So, we'll only talk about the Spark 2.4.7 Data Sources V2 API (JavaDoc)

Used for Reading or Writing

DataSourceV2 is a "marker interface" which essentially tags the class but doesn't define any behavior.
(org.apache.spark.sql.catalyst)InternalRow is a binary format used inside of Apache Spark.
- TODO: How do you make an internal row

Reading Data

ReadSupport: Requires you implement the createReader method to return a DataSourceReader object.
DataSourceReader: Requires you implement readSchema (when no schema is provided) and planInputPartitions which returns the set of partitions being used. Each partition would create its own data source reader to handle that partition of data.
InputPartition:
InputPartitionReader:

Writing Data

Spark 3 Data Sources V2

In Spark 3, the Data Sources V2 API was revised AGAIN and should really be called the V3 API.

Used for Reading or Writing

Reading Data

Writing Data

Other References

These blogs, videos, and repos have been extremely helpful in improving my understanding of the history of the Data Source API in Apache Spark.

Example Data Sources

Data Source V1
Data Source V2 (Spark 3)
- Cosmos DB OLTP Spark Connector

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
gradle/wrapper		gradle/wrapper
spark2		spark2
spark3		spark3
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

gradle/wrapper

gradle/wrapper

spark2

spark2

spark3

spark3

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

gradlew

gradlew

gradlew.bat

gradlew.bat

settings.gradle

settings.gradle

Repository files navigation

Creating Spark Data Sources with the Java API

Quickstart

Data Sources Version 1 API

Working with Data Sources V1

Spark 2.4 Data Sources V2

Spark 3 Data Sources V2

Other References

Spark 3 DataSources V2 References

Spark 2 DataSources V2 References

Spark DataSources V1 References

Example Data Sources

About

Releases

Packages

Languages

wjohnson/SparkCustomDataSourceExamples

Folders and files

Latest commit

History

Repository files navigation

Creating Spark Data Sources with the Java API

Quickstart

Data Sources Version 1 API

Working with Data Sources V1

Spark 2.4 Data Sources V2

Spark 3 Data Sources V2

Other References

Spark 3 DataSources V2 References

Spark 2 DataSources V2 References

Spark DataSources V1 References

Example Data Sources

About

Resources

Stars

Watchers

Forks

Languages