Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] DatasetT #704

Open
jpassaro opened this issue Apr 16, 2023 · 1 comment
Open

[feature] DatasetT #704

jpassaro opened this issue Apr 16, 2023 · 1 comment
Labels

Comments

@jpassaro
Copy link

jpassaro commented Apr 16, 2023

I'm reading about this library and think I'm going to use it in my next spark project. I'm really motivated by the ability to reduce needless runtime errors that should be detectable at compile time, and equally by wanting an ergonomic error channel for true runtime errors.

What i see gives me confidence that I can accomplish that using the cats integration with typed datasets. there's one thing that could make it a bunch more ergonomic: one of the biggest places i tend to get runtime errors is at the read-write boundary. say I'm trying to read a table that doesn't exist, or read/write where the schema on disk is incompatible with the one i expect. i can obviously handle this with the existing TypedDataset API after wrapping the IO boundaries in Sync[F].delay... but it would be nice to wrap not only Dataset manipulation but also the generation and subsequent manipulation in a type-safe DSL.

To that end two more-or-less isomorphic ideas come to mind. Both expect at a minimum evidence of Monad[F] (maybe only flatmap for the first one and TypedEncoder[A].

  1. additional syntax for F[TypedDataset[A]] that adds all the TypedDataset methods, also wrapped in F[_].

  2. an OptionT-like data class wrapping F[TypedDataset[A]]. Naming can be debated but for the sake of presentation, DatasetT. it has default constructor

def apply[F[_]: FlatMap: Ask[*[_], SparkSession], A:TypedEncoder](f: SparkSession => F[TypedDataset[A]])

and various syntactically or situationally preferable variations.

Have either of these patterns been considered? is there any reason they wouldn't make sense to adopt?

Assuming not, I'll try writing in a new project, and -- assuming it proves itself -- will create a PR.

@pomadchin
Copy link
Member

Hey @jpassaro, I'd be happy what you will compe up with! Passing SparkSession through the context is definitely a good idea and works nicely!

I'll just add that Dataframe / Dataset itself is a DSL and represent an execution description. Ideally operations on it are not effectful until the reduce is invoked explicitly.
In reality some operations may still be effectful (i.e. df metadata interactions). So to make a nice API can be challenging there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants