Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Scala 2.13 support? #2132

Open
kg005 opened this issue Apr 4, 2024 · 4 comments
Open

[FEAT] Scala 2.13 support? #2132

kg005 opened this issue Apr 4, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@kg005
Copy link

kg005 commented Apr 4, 2024

Is your proposal related to a problem?

I am getting following error:

24/04/04 14:26:47 WARN TaskSetManager: Lost task 4.0 in stage 538.0 (TID 7052) (10.132.0.177 executor 1): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (UDFRegistration$Lambda$4595/0x00007f30033f42d8: (string, string) => double).
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:217)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.project_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.NoSuchMethodError: 'scala.collection.GenMap scala.collection.mutable.Map$.apply(scala.collection.Seq)'
	at uk.gov.moj.dash.linkage.LevDamerauDistance.call(Similarity.scala:265)
	at uk.gov.moj.dash.linkage.LevDamerauDistance.call(Similarity.scala:254)
	at org.apache.spark.sql.UDFRegistration.$anonfun$register$354(UDFRegistration.scala:767)
	... 18 more

With no prior knowledge of scala, after some exploration of:

  • https://github.com/moj-analytical-services/splink_scalaudfs
  • Environment I am using (spark 3.4.0, scala 2.13)
  • Differences between scala 2.13 and 2.12 version
    I assume the error comes from the scala version mismatch. Splink jars being build with scala 2.12 and my current environment using 2.13.

Describe the solution you'd like

Building .jar files from https://github.com/moj-analytical-services/splink_scalaudfs for scala 2.13.

Describe alternatives you've considered

Changing my environment to use scala 2.12 but I am currently not in a position to be able to change the environment I am running the splink on.

@kg005 kg005 added the enhancement New feature or request label Apr 4, 2024
@RobinL
Copy link
Member

RobinL commented Apr 17, 2024

Thanks for the request. We're pretty stretched at the moment so we're unlikely to be able to get round to this soon. If you're willing/able, feel free to do a PR, which would be gratefully accepted!

@kg005
Copy link
Author

kg005 commented Apr 18, 2024

Hi @RobinL, here is a PR for the changes needed to build the splink_scalaudfs for Scala 2.13. As I am new to Scala, I would be happy to have it reviewed so I can adjust it as needed.

@RobinL
Copy link
Member

RobinL commented Apr 19, 2024

@kg005 Thank very much. Just to say we're taking a look at this. I'm also not a scala person myself, but the code looks ok to me at least.

One thing we need to be careful with is accepting an external PR that includes the jar, since we have no easy way of knowing whether it contains malicious code. (The diff looks ok, and the code you're wrote looks fine btw, so this is no reflection on you, just security policy!)

I'm going to try and get a colleague to build it on their machine. But if you happen to work for somewhere 'trusted' (e.g. uk gov, let me know and it'll make it a little easier - robinlinacre@hotmail.com!)

@kg005
Copy link
Author

kg005 commented Apr 19, 2024

Thanks for the heads up @RobinL. I understand the policies. Feel free to override the jar with a new version that you manage to build using your infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants