Skip to content

wjohnson/SparkCustomDataSourceExamples

Repository files navigation

Creating Spark Data Sources with the Java API

Quickstart

gradle clean build

spark-shell \
--jars=./lib/build/libs/dataguidebook-1.0-SNAPSHOT.jar

Data Sources Version 1 API

This is the original way of defining a data source. It is still available in Spark 3.2.1.

import org.apache.spark.sql.SaveMode

val df = spark.read.format("com.dataguidebook.spark.datasource.v1").load("")

df.printSchema()
df.count()

df.write.mode(SaveMode.Append).format("com.dataguidebook.spark.datasource.v1").save("newpath")

// InsertInto command Only works when you have a table defined USING your
// custom data source.

//spark.sql("CREATE TABLE myTable(column01 int, column02 int, column03 int ) USING com.dataguidebook.spark.datasource.v1 LOCATION custom/insertinto")
df.write.mode(SaveMode.Append).format("com.dataguidebook.spark.datasource.v1").insertInto("myTable")

Working with Data Sources V1

JavaDoc for Spark SQL Sources provides you with all the classes you can use in your custom data source to define the behaviors. Need to provide two classes:

  • DefaultSource which implements RelationProvider
  • YourCustomClass which extends BaseRelation and implements at least TableScan

Used for Reading or Writing

  • RelationProvider: Used in the DefaultSource and defines how you initialize a custom data source without a user defined schema.
    • Defines createRelation and is used when you do spark.read and takes in the options provided.
  • SchemaRelationProvider: Used in the DefaultSource and defines how you initialize a custom data source with a user defined schema.
    • Defines createRelation and is used when you do spark.read and requires that you provide a schema.
  • DataSourceRegister: Data sources should implement this trait so that they can register an alias to their data source.

Reading Data

  • TableScan: A BaseRelation that can produce all of its tuples as an RDD of Row objects.
  • PrunedScan: A BaseRelation that can eliminate unneeded columns before producing an RDD containing all of its tuples as Row objects.
  • PrunedFilteredScan: A BaseRelation that can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects.

Writing Data

  • CreatetableRelationProvider: Used on the DefaultSource class to define a data writing behavior
    • Requires you to implement createRelation with an additional SaveMode parameter.
    • You define all the business logic to overwrite, append, etc. a dataframe to your custom data source.
    • Use df.foreachPartition to execute your business logic on each partition when writing.
  • InsertableRelation: Used on the custom data source's class (e.g. CustomDataRelation) that inherits from BaseRelation to insert into a hive metastore backed datasource.
    • Requires you to implement insert which takes in a dataframe and you apply the business logic to store it inside your Hive metastore (e.g. write it to your proprietary format) or send the dataframe to a different datastore.
    • This only works if you have defined a custom table inside your hive metastore with the USING keyword specifying your custom data source.
    spark.sql("CREATE TABLE myTable(column01 int, column02 int, column03 int ) USING com.dataguidebook.spark.datasource.v1 LOCATION custom/insertinto")

Spark 2.4 Data Sources V2

In Spark 2.3, the Data Sources V2 API (JavaDoc) was released in beta (Spark JIRA) but was not marked as stable until 2.4. So, we'll only talk about the Spark 2.4.7 Data Sources V2 API (JavaDoc)

Used for Reading or Writing

  • DataSourceV2 is a "marker interface" which essentially tags the class but doesn't define any behavior.
  • (org.apache.spark.sql.catalyst)InternalRow is a binary format used inside of Apache Spark.
    • TODO: How do you make an internal row

Reading Data

  • ReadSupport: Requires you implement the createReader method to return a DataSourceReader object.
  • DataSourceReader: Requires you implement readSchema (when no schema is provided) and planInputPartitions which returns the set of partitions being used. Each partition would create its own data source reader to handle that partition of data.
  • InputPartition:
  • InputPartitionReader:

Writing Data

Spark 3 Data Sources V2

In Spark 3, the Data Sources V2 API was revised AGAIN and should really be called the V3 API.

Used for Reading or Writing

Reading Data

Writing Data

Other References

These blogs, videos, and repos have been extremely helpful in improving my understanding of the history of the Data Source API in Apache Spark.

Spark 3 DataSources V2 References

Spark 2 DataSources V2 References

Spark DataSources V1 References

Example Data Sources

About

Explaining Spark Custom Data Sources to Better Understand How Spark Works

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages