Spark 4.0 / DBR 14.2+ - bleeding edge changes #787

chris-twiner · 2024-01-26T12:29:18Z

Per my comment on #755 DBR 14.2, 4.0 and likely all later versions includes SPARK-44913 StaticInvoke changes.

Whilst this hasn't yet been backported to 3.5 branch it could well end up there.

I'm happy to fork and publish a bleeding edge / non-standard frameless if needed but I also wonder if a compat layer as a separate swappable jar is the best route similar to #300 for example.

What is the collective preferred route to fixing / working around this?

pomadchin · 2024-02-08T03:47:37Z

Hmm, swappable jar?

I'm super open to add any neccessary compat layers; shoot a PR and we'll get it in if you have any nice ideas!

chris-twiner · 2024-02-22T19:25:15Z

~~13.3 LTS backported the 3.5 isNativeType change as well so that's reflected in the title.~~ (I was mistaken, 0.16 spark33 builds work fine)

fyi - I've created shim to handle the abstraction ~~barring isNative~~ the approach seems workable. I'll start on frameless shims in the next days. lots to do there.

chris-twiner · 2024-02-27T14:06:01Z

fyi - With the 1st shim snapshot push, compiling against 3.5.oss works when running against 14.3.dbr, only StaticInvoke needed doing. (so the same frameless jar can be run against both 14.0/14.1 and 14.2/14.3 by swapping the shim to the right dbr version. or indeed users stay with the oss version as per a normal dependency)

The code for StaticInvoke handling and shims etc. is branch here and diff here

I'll target the major pain points impacted in each OSS major/minor release next (i.e. TypedEncoder, TypedExpressionEncoder and RecordEncoder) to have each internal api usage pulled out (e.g. [Un]WrapOption, Invoke, NewInstance, GetStructField, ifisnull, GetColumnByOrdinal, MapObjects and probably TypedExpressionEncoder itself). It's probably worth doing them in advance of any pull request.

What I'll attempt with this is to see how much of the encoding logic can be re-used from the current frameless codebase and targetted major versions on older dbrs (e.g. can we get a 3.5 oss frameless jar running on a 3.1.2 Databricks runtime)

If you'd like me to add FramelessInternals.objectTypeFor, ScalaReflection.dataTypeFor etc. as well I think that'd make sense but Reflection had been fairly stable code before they ripped it out :)

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed - attempt build

chris-twiner · 2024-02-28T19:31:20Z

@pomadchin -
So at time of writing, building the current 0.16 based fork branch (rev 7944fe9 is pre-reformatting) against the 3.5 correct shim_runtime version and testing the encoding functionality (used by Quality tests built against 0.16 frameless with 3.1.3 oss base) with the shim_runtime for 9.1.dbr works despite the very different impl.

I'd not want to advertise that it's possible to jump versions so much (there are other issues like kmeans and join interface changes of course) but it proves the approach works at least and may ease 4.x support.

Pre-reformatting functional change diff is here. Key mima change is removal of frameless.MapGroups, it could of course be kept and just forwarding to a forward if needed.

…teStruct, and allow shims for the deprecated functions

…se rc1

…se rc1, so1 not a default repo it seems

…se rc2

…oxy - deeply nested also possible

chris-twiner · 2024-03-27T11:08:12Z

per b880261, #803 and #804 are confirmed as working on all LTS versions of Databricks, Spark 4 and the latest 15.0 runtime - test combinations are documented here

chris-twiner · 2024-04-09T10:33:09Z

A number of test issues appear when running on a cluster, these do not appear on a single node server (e.g. github runners, dev box or even Databricks Community Edition).

all double generated values used in tests
the OrderByTest "derives a CatalystOrdered for case classes when all fields are comparable"

doubles lose precision on serialisation, e.g.:

stddev_samp *** FAILED *** (19 seconds, 196 milliseconds)
  GeneratorDrivenPropertyCheckFailedException was thrown during property evaluation.
   (AggregateFunctionsTests.scala:591)
    Falsified after 5 successful property evaluations.
    Location: (AggregateFunctionsTests.scala:591)
    Occurred when passed generated values (
      arg0 = List("X2(1,-2147483648)", "X2(1,654883454)", "X2(-1,-2147483648)", "X2(1,0)") // 4 shrinks
    )
    Label of failing property:
      Expected Map(1 -> Some(1.4659365454162877E9), -1 -> None) but got Map(1 -> Some(1.4659365454162874E9), -1 -> None)

the very last digit didn't match, as such all double gens have to be serializable, the same occurs for BigDecimals on other tests (like AggregateFunctionsTest first/last) but this is likely due to lack of the package arbitraries being correct in the testless shade (they are correct when used via TestlessSingle in the ide).

for the order by:

import frameless.{X2, X3}
import spark.implicits._
val v = Vector(X3(-1,false,X2(586394193,6313416569807298536L)), X3(2147483647,false,X2(1,-1L)), X3(729528245,false,X2(1,-1L)))
v.toDS.orderBy("c").collect().toVector

the error that can occur is:

derives a CatalystOrdered for case classes when all fields are comparable *** FAILED *** (11 seconds, 784 milliseconds)
  GeneratorDrivenPropertyCheckFailedException was thrown during property evaluation.
   (OrderByTests.scala:177)
    Falsified after 5 successful property evaluations.
    Location: (OrderByTests.scala:177)
    Occurred when passed generated values (
      arg0 = Vector(X3(-1,false,X2(586394193,6313416569807298536)), X3(2147483647,false,X2(1,-1)), X3(729528245,false,X2(1,-1))) // 2 shrinks
    )
    Label of failing property:
      Expected Vector(X3(729528245,false,X2(1,-1)), X3(2147483647,false,X2(1,-1)), X3(-1,false,X2(586394193,6313416569807298536))) but got Vector(X3(2147483647,false,X2(1,-1)), X3(729528245,false,X2(1,-1)), X3(-1,false,X2(586394193,6313416569807298536)))
testless.org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException:

i.e. (1,-1) can be in any order and both are acceptable results. The test needs to be re-written to account for this to just compare c's.

…clusters

…clusters - inifinity protection

…0 databricks doesn't process them on ordered dataset

pomadchin added the enhancement label Feb 8, 2024

pomadchin added the feature label Feb 8, 2024

chris-twiner mentioned this issue Feb 21, 2024

DBR 14.3 support sparkutils/quality#57

Open

chris-twiner changed the title ~~Spark 4.0 / DBR 14.2+ - bleeding edge changes~~ DBR 13.3 LTS - Spark 4.0 / DBR 14.2+ - bleeding edge changes Feb 22, 2024

chris-twiner changed the title ~~DBR 13.3 LTS - Spark 4.0 / DBR 14.2+ - bleeding edge changes~~ Spark 4.0 / DBR 14.2+ - bleeding edge changes Feb 23, 2024

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 26, 2024

typelevel#787 - base required for shim and 14.3.dbr

cb259fa

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 28, 2024

typelevel#787 - [Un]WrapOption, Invoke, NewInstance, GetStructField, …

b8d4f05

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 28, 2024

typelevel#787 - [Un]WrapOption, Invoke, NewInstance, GetStructField, …

7944fe9

…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed - attempt build

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 28, 2024

typelevel#787 - forced reformatting

71bb38c

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 28, 2024

typelevel#787 - forced reformatting

c843c6a

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 28, 2024

typelevel#787 - forced reformatting

9a0c55b

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Feb 28, 2024

typelevel#787 - mima MapGroups removal

0616953

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 1, 2024

typelevel#787 - Spark 4 starter pack

a70d5c3

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 1, 2024

typelevel#787 - Spark 4 starter pack

1ef1d9b

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 1, 2024

typelevel#787 - Spark 4 starter pack

7a96748

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 1, 2024

typelevel#787 - Spark 4 starter pack, doh

7d0e131

chris-twiner mentioned this issue Mar 1, 2024

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

Open

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - resolve conflict for auto merge

c6a4341

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - reduce the case class api usage even further and Crea…

4933a90

…teStruct, and allow shims for the deprecated functions

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - disable local maven again

0f9b7cf

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - remove all sql package private code

059a8e6

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - remove all sql package private code

9c506df

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - ml internals removal - all public - typelevel#300

11aece0

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - ml internals removal - all public - typelevel#300 - u…

089cb3a

…se rc1

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - ml internals removal - all public - typelevel#300 - u…

c7fa1c7

…se rc1, so1 not a default repo it seems

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - ml internals removal - all public - typelevel#300 - u…

28071ff

…se rc2

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - ml internals removal - all public - typelevel#300 - u…

1c1d370

…se rc2

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - ml internals removal - all public - typelevel#300 - u…

728c935

…se rc2

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - rc2

768d467

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - rc2

5a30614

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - rc2 - seems each sub object needs adding

95c66cc

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 8, 2024

typelevel#787 - rc2 - doc is now an issue?

2e11b6d

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 9, 2024

typelevel#787 - rc2 - mc reflection

1888f4e

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 9, 2024

typelevel#787 - rc2 - add test artefacts

d146f00

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 11, 2024

typelevel#787 - allow testing of all frameless logic

692475f

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 11, 2024

typelevel#787 - compilation issue on interface

1008b85

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 12, 2024

typelevel#787 - fix test to run on dbr 14.3

dd10cee

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 14, 2024

typelevel#787 typelevel#803 - rc4 usage and fix udf with expressionproxy

f253d45

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 14, 2024

typelevel#787 typelevel#803 - rc4 usage and fix udf with expressionpr…

7c1e603

…oxy - deeply nested also possible

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 14, 2024

typelevel#787 typelevel#803 - rc4 usage and fix udf with expressionpr…

b161067

…oxy - deeply nested also possible

chris-twiner mentioned this issue Mar 20, 2024

#804 - correct types for Set and Seq derived types with interpreted serde - basis for #803 #805

Open

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 21, 2024

typelevel#787 - merge typelevel#803 / typelevel#804

aa1e6de

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 21, 2024

typelevel#787 - Seq can be stream, fails on dbr, do the same as for arb

be4c35e

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Mar 21, 2024

typelevel#787 typelevel#804 - stream

b880261

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 10, 2024

typelevel#787 - tests have ordering and precision issues when run on …

f793fc7

…clusters

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 10, 2024

typelevel#787 - tests have ordering and precision issues when run on …

e582962

…clusters

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 10, 2024

typelevel#787 - tests have ordering and precision issues when run on …

986891a

…clusters - inifinity protection

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 11, 2024

typelevel#787 - attempt to solve all but covar_pop and kurtosis

66b31e9

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 11, 2024

typelevel#787 - attempt covar_pop and kurtosis through tolerances

80de4f2

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 11, 2024

typelevel#787 - tolerance on map members and on vectors for cluster runs

a89542e

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 12, 2024

typelevel#787 - pivottest was random ordering

271e953

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 12, 2024

typelevel#787 - ensure last/first are run on a single partition - 15.…

fa75889

…0 databricks doesn't process them on ordered dataset

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 12, 2024

typelevel#787 - ensure last/first are run on a single partition - 15.…

b6189b1

…0 databricks doesn't process them on ordered dataset

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Apr 12, 2024

typelevel#787 - ensure last/first are run on a single partition - 15.…

25cc5c3

…0 databricks doesn't process them on ordered dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 4.0 / DBR 14.2+ - bleeding edge changes #787

Spark 4.0 / DBR 14.2+ - bleeding edge changes #787

chris-twiner commented Jan 26, 2024

pomadchin commented Feb 8, 2024

chris-twiner commented Feb 22, 2024 •

edited

chris-twiner commented Feb 27, 2024

chris-twiner commented Feb 28, 2024

chris-twiner commented Mar 27, 2024

chris-twiner commented Apr 9, 2024 •

edited

Spark 4.0 / DBR 14.2+ - bleeding edge changes #787

Spark 4.0 / DBR 14.2+ - bleeding edge changes #787

Comments

chris-twiner commented Jan 26, 2024

pomadchin commented Feb 8, 2024

chris-twiner commented Feb 22, 2024 • edited

chris-twiner commented Feb 27, 2024

chris-twiner commented Feb 28, 2024

chris-twiner commented Mar 27, 2024

chris-twiner commented Apr 9, 2024 • edited

chris-twiner commented Feb 22, 2024 •

edited

chris-twiner commented Apr 9, 2024 •

edited