Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier withColumn method #453

Open
MrPowers opened this issue Oct 26, 2020 · 3 comments
Open

Easier withColumn method #453

MrPowers opened this issue Oct 26, 2020 · 3 comments

Comments

@MrPowers
Copy link

Great work on this lib! It's a great way to write Spark code!

As discussed here and in the docs, withColumn requires a full schema when a column is added.

Here's the example in the docs:

case class CityBedsOther(city: String, bedrooms: Int, other: List[String])

cityBeds.
   withColumn[CityBedsOther](lit(List("a","b","c"))).
   show(1).run()

Couldn't we just assume that the schema stays the same for the existing columns and only supply the schema for the column that's being added?

cityBeds.
   withColumn[List[String]](lit(List("a","b","c"))).
   show(1).run()

I think this'd be a lot more use friendly. I'm often dealing with schemas that have tons of columns and add lots of columns with withColumn. Let me know your thoughts!

@imarios
Copy link
Contributor

imarios commented Dec 9, 2020

Hey @MrPowers, sorry I missed this comment. I hear what you are saying. I definitely see this being easier, but unfortunately, this is nearly impossible to do. When you add a new column to CityBeds you essentially defining a new class, which, unless you have already defined it, it does not exist. That's why you have to define a new class pass it as a type parameter inside withColumn.

@imarios
Copy link
Contributor

imarios commented Dec 11, 2020

@MrPowers I think I have a better answer for you. Say you have TypedDataset[X] where case class X(i: Int, j: Int) and you want to add an extra column that adds i with j.

val x: TypedDataset[X] = ???
val xNew: TypedDataset[(X, Int)] = x.select(x.asCol, x('i)+x('j))

As you see, your schema became from X to Tuple2[X, Int]. In this way, you defined a new column without losing the structure of X.

@MrPowers
Copy link
Author

MrPowers commented Feb 6, 2021

@imarios - Thanks for the detailed responses.

I started brainstorming the idea of typed columns, see here, and think this idea might be useful for frameless as well.

Columns are untyped and that's a big reason why Spark is so type unsafe. Typed columns can help us catch a lot more errors at compile time. When we run df("some_date") it returns an untyped column, but it should really return a DateColumn.

We could add a withTypedColumn function that'd take two arguments, a string and a typed column. We could infer the case class for the resulting Dataset (from the starting case class and the typed column) to build the new case class under the hood without making the user specify it manually. It could make this a lot more usable. Let me know your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants