Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Avro as schema in TypedDataSet #282

Open
SemanticBeeng opened this issue Apr 14, 2018 · 6 comments
Open

Use Avro as schema in TypedDataSet #282

SemanticBeeng opened this issue Apr 14, 2018 · 6 comments

Comments

@SemanticBeeng
Copy link

SemanticBeeng commented Apr 14, 2018

Would it make sense to be able to introduce support for avro schema for TypedDataSet?

The current code defines schema based on the SparkSQL "language":

def schema: StructType =
dataset.schema

On the other hand frameless use a Scala types based "schema" to define data sets.

Using something like avro4s the avro schema can be derived from types.

It is quite useful to be able to use avro as schema in parquet files for example: https://dzone.com/articles/understanding-how-parquet

$ export HADOOP_CLASSPATH=parquet-avro-1.4.3.jar:parquet-column-1.4.3.jar:parquet-common-1.4.3.jar:parquet-encoding-1.4.3.jar:parquet-format-2.0.0.jar:parquet-generator-1.4.3.jar:parquet-hadoop-1.4.3.jar:parquet-hive-bundle-1.4.3.jar:parquet-jackson-1.4.3.jar:parquet-tools-1.4.3.jar

$ hadoop parquet.tools.Main meta stocks.parquet
creator:     parquet-mr (build 3f25ad97f209e7653e9f816508252f850abd635f)
extra:       avro.schema = {"type":"record","name":"Stock","namespace" [more]...

file schema: hip.ch5.avro.gen.Stock
--------------------------------------------------------------------------------
symbol:      REQUIRED BINARY O:UTF8 R:0 D:0
date:        REQUIRED BINARY O:UTF8 R:0 D:0
open:        REQUIRED DOUBLE R:0 D:0
high:        REQUIRED DOUBLE R:0 D:0
low:         REQUIRED DOUBLE R:0 D:0
close:       REQUIRED DOUBLE R:0 D:0
volume:      REQUIRED INT32 R:0 D:0
adjClose:    REQUIRED DOUBLE R:0 D:0

See also "Write Avro records to a Parquet file.":
https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L34

In spark-bigquery there is already a schema converter that could be use to map to and from SparkSql based schema.

See "convert between sparkSQL schemas to avro data schema"
https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131

Somewhat related to #280.

@imarios
Copy link
Contributor

imarios commented Apr 15, 2018

Everything around shapeless and implicit derivation works around case classes, so we can't really take that out of the picture. The right way to approach this is to keep the case classes as the main way to define a schema but have different ways to get to how this case classes are created.

There are ways (typically code generators) that take a schema defined externally (say, avro or protocol buffers) and convert it to a case class.

This will flow similar to this:

.avro --> codeGen --> case class --> TypedDataset schema.

@SemanticBeeng Is this what you had in mind?

@SemanticBeeng
Copy link
Author

SemanticBeeng commented Apr 15, 2018

"shapeless and implicit (type class) derivation"

Yes, this is central, ofc.

"right way to approach this is to keep the case classes as the main way to define a schema"

Yes, indeed, main design principle.
This is already at the basis of frameless, it seems - have not see it stated explicitly, though.
And the direct dependency on the spark schema seems to raise a question it, so, worth confirming.

"code generators"

Yes, avro4s does that very nicely.

But, on top of this, I am suggesting:

  1. Putting the avro schema at a bit higher level than spark schema in the design of frameless.
  2. Providing first class support to write the avro schema in parquet files - please see again from " useful to be able to use avro as schema in parquet files ".

We know avro is an accepted way to define microservice apis (schema) and that it does schema evolution well.

Sometimes the integration is at the data level (not api): but same principles, techniques and benefits will apply if we use avro for data level integration.

Think analytics platform interfacing with a data lake: would it not be much better to depend on an avro based data schema then spark specific one?

There will be components in the analytics area that do not "know" spark but still need to validate the schema they depend on, in the data itself.

Throughts?

@imarios
Copy link
Contributor

imarios commented May 4, 2018

@SemanticBeeng sorry for taking long to reply. I am on board to making the proper changes, if they need to happen, so that the code generated by avro4s is compatible with Frameless schemas (which for now are just simple case classes). I think there was a similar ask by @codeexplorer regarding scalaPB.

@SemanticBeeng
Copy link
Author

SemanticBeeng commented May 12, 2018

The gist of my suggestions was a design one and less about implementation.

"Putting the avro schema at a bit higher level than spark schema in the design of frameless."

At this time frameless is bound to the spark schema but does not have to be - seems to me.
It could use a data schema expressed at a higher level, closer to the domain, given that there are conversions between higher (avro, protobuf) and spark.

Did you get a chance to see this? "convert between sparkSQL schemas to avro data schema"
https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131

@SemanticBeeng
Copy link
Author

SemanticBeeng commented May 12, 2018

From an implementation point of view alone, avro4s and scalapb are quite different: scalapb generates very complex classes, mostly because the protobuf workings require it.

Because of that, a few people consider it a poor way define the type system behind one's domain models.

Please review

  1. protoless https://github.com/julien-lafont/protoless

"ScalaPB, a protocol buffers compiler for scala, was the only serious library to work with protobuf in Scala, but it comes with: "
"Two step code generation (protobuf -> java, java -> scala)"
"And if you want to map your own model, you need a third wrapping level."
"Heavy builder interface"
"Custom lenses library"

  1. strictify https://github.com/cakesolutions/strictify

"Because Protbuf is not your typesystem"

@SemanticBeeng
Copy link
Author

SemanticBeeng commented Jul 6, 2019

Been thinking about this more.

Any interest to make frameless about more than spark only but cross frameworks?

Petastorm "enables single / distributed training and evaluation of deep learning models directly from Apache Parquet format"
Unischema : unifies #DataFabric schema over frameworks: #ApacheSpark StructType, Tensorflow tf.DType, #Numpy numpy.dtype

https://twitter.com/semanticbeeng/status/1142400720324431873
Source https://eng.uber.com/petastorm/, https://github.com/uber/petastorm

Or somehow make frameless typed schema work with Unischema.

This would advance the "beyond data pipeline" agenda and get away from current "stringly typed" technologies like Airflow and the like and back into functional programming, so that this think kind of thinking would be mainstream.

At least bring data schema under strong typed control.

Related : https://twitter.com/semanticbeeng/status/1139789288856571904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants