-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Glue Compatibility #6
Comments
Hi @VitorNoro! sorry for the long response The issue with supporting AWS Glue with DataFlint OSS is that the Spark UI is not enabled on the cluster. When you use the "Spark UI" it's actually a managed history server (which there is no way to run custom code on, such as DataFlint), that reads from a S3 Bucket events that the Spark Driver write to every 30 seconds. See more at https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html You could host a history server yourself with DataFlint plugin installed (see instructions here https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server) and point it to the S3 bucket with the events. See instructions here: Another options that I'm currently working on a SaaS offering for DataFlint, that will send the summary of your spark job to a SaaS solution with additional features (graph of job duration/resource usage/input size over time, recommendations, alerts etc...). If this is something that interests you please let me know. |
I'm keeping this issue open until I will add a better error message when trying to run DataFlint on AWS glue |
Thank you for the response! We'll consider our options, though it's likelier we move away from Glue in time. |
Cool! |
Added an alert "No UI detected, skipping installation" if UI is turned off |
Hello everyone, I know that AWS Glue is not in the supported platforms list, but I decided to give it a try and see if it would work.
This attempt failed, resulting in an error when initializing the Spark Context.
I was wondering if this is a known issue, or if anyone managed to get this working.
Environment
spark version: 3.3
platform: Glue 4.0
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Session and context initialized and job running successfully.
Additional context
Returned error:
File "/tmp/job.py", line 78, in
.getOrCreate()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 491, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 197, in init
self._do_init(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 282, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 410, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in call
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.dataflint.DataflintSparkUILoader$.install(DataflintSparkUILoader.scala:17)
at io.dataflint.spark.SparkDataflintDriverPlugin.registerMetrics(SparkDataflintPlugin.scala:26)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1(PluginContainer.scala:75)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1$adapted(PluginContainer.scala:74)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.internal.plugin.DriverPluginContainer.registerMetrics(PluginContainer.scala:74)
at org.apache.spark.SparkContext.$anonfun$new$41(SparkContext.scala:681)
at org.apache.spark.SparkContext.$anonfun$new$41$adapted(SparkContext.scala:681)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext.(SparkContext.scala:681)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
The text was updated successfully, but these errors were encountered: