AWS Glue Compatibility #6

VitorNoro · 2024-01-29T12:08:50Z

Hello everyone, I know that AWS Glue is not in the supported platforms list, but I decided to give it a try and see if it would work.
This attempt failed, resulting in an error when initializing the Spark Context.
I was wondering if this is a known issue, or if anyone managed to get this working.

Environment
spark version: 3.3
platform: Glue 4.0

To Reproduce
Steps to reproduce the behavior:

Download jar from maven repo
Upload to S3
Add to job's dependent jars
Set plugin config in SparkSession builder (or set as --conf property)
Run the script
See the error

Expected behavior
Session and context initialized and job running successfully.

Additional context
Returned error:
File "/tmp/job.py", line 78, in
.getOrCreate()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 491, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 197, in init
self._do_init(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 282, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 410, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in call
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.dataflint.DataflintSparkUILoader$.install(DataflintSparkUILoader.scala:17)
at io.dataflint.spark.SparkDataflintDriverPlugin.registerMetrics(SparkDataflintPlugin.scala:26)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1(PluginContainer.scala:75)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1$adapted(PluginContainer.scala:74)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.internal.plugin.DriverPluginContainer.registerMetrics(PluginContainer.scala:74)
at org.apache.spark.SparkContext.$anonfun$new$41(SparkContext.scala:681)
at org.apache.spark.SparkContext.$anonfun$new$41$adapted(SparkContext.scala:681)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext.(SparkContext.scala:681)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)

menishmueli · 2024-01-31T09:51:37Z

Hi @VitorNoro! sorry for the long response

The issue with supporting AWS Glue with DataFlint OSS is that the Spark UI is not enabled on the cluster.

When you use the "Spark UI" it's actually a managed history server (which there is no way to run custom code on, such as DataFlint), that reads from a S3 Bucket events that the Spark Driver write to every 30 seconds.

See more at https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html

You could host a history server yourself with DataFlint plugin installed (see instructions here https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server) and point it to the S3 bucket with the events. See instructions here:
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html. You can also initially host this history server locally from your laptop to test DataFlint our.

Another options that I'm currently working on a SaaS offering for DataFlint, that will send the summary of your spark job to a SaaS solution with additional features (graph of job duration/resource usage/input size over time, recommendations, alerts etc...).
In this SaaS portal when you select a job run you could also see it's Spark UI & DataFlint UI.
This offering will also support AWS Glue.

If this is something that interests you please let me know.

menishmueli · 2024-01-31T09:52:16Z

I'm keeping this issue open until I will add a better error message when trying to run DataFlint on AWS glue

VitorNoro · 2024-01-31T11:35:37Z

Thank you for the response! We'll consider our options, though it's likelier we move away from Glue in time.

menishmueli · 2024-01-31T11:44:25Z

Cool!
if there anything else I can do to help you, you can contact me via the DataFlint slack community (join link in the README) or via linkedin (https://www.linkedin.com/in/meni-shmueli-developer/)

menishmueli · 2024-05-19T07:50:00Z

Added an alert "No UI detected, skipping installation" if UI is turned off

menishmueli self-assigned this Jan 31, 2024

menishmueli closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Glue Compatibility #6

AWS Glue Compatibility #6

VitorNoro commented Jan 29, 2024

menishmueli commented Jan 31, 2024 •

edited

menishmueli commented Jan 31, 2024

VitorNoro commented Jan 31, 2024

menishmueli commented Jan 31, 2024

menishmueli commented May 19, 2024

AWS Glue Compatibility #6

AWS Glue Compatibility #6

Comments

VitorNoro commented Jan 29, 2024

menishmueli commented Jan 31, 2024 • edited

menishmueli commented Jan 31, 2024

VitorNoro commented Jan 31, 2024

menishmueli commented Jan 31, 2024

menishmueli commented May 19, 2024

menishmueli commented Jan 31, 2024 •

edited