GitHub - pauldeschacht/SparkCustomWindowOperator: Example of custom Spark SQL Window operator

CustomSparkWindowOperator

This is an example how to implement a custom SQL operator over a Spark SQL window, such as rank, ntile.

This implementation detects changes in a number of columns with respect to the previous row, within a window group. If a change in any of the columns is detected, the aggregate column hasChanged will be set to true, otherwise it will be set to false.

The custom SQL operator changed takes a number of column names over which a change needs to be detected. The result will be stored in the column hasChanged.

 val rowsWithChanged = df.withColumn("hasChanged", changed("status", "title").over(Window.partitionBy("id").orderBy("date")))
 val changedRows = rowsWithChanged.filter("hasChanged == true")

Details

The implementation of the example is inspired by the rank operator. The important difference is that the rank function uses the orderBy expressions for the aggregation, which are already resolved. The initial calls (plan analysis, code generation phase) to ChangedOverPreviousRow contains non resolved children (the datatype is not yet known), therefore the child expressions are checked if resolved or not.

The case class ChangesOverPreviousRow implements the DeclarativeAggregate, which implements the AggregrateFunction contract.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src

src

.gitignore

.gitignore

README.md

README.md

build.sbt

build.sbt

Repository files navigation

CustomSparkWindowOperator

Details

About

Releases

Packages

Languages

pauldeschacht/SparkCustomWindowOperator

Folders and files

Latest commit

History

Repository files navigation

CustomSparkWindowOperator

Details

About

Resources

Stars

Watchers

Forks

Languages