Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Glue Compatibility #6

Closed
VitorNoro opened this issue Jan 29, 2024 · 5 comments
Closed

AWS Glue Compatibility #6

VitorNoro opened this issue Jan 29, 2024 · 5 comments
Assignees

Comments

@VitorNoro
Copy link

Hello everyone, I know that AWS Glue is not in the supported platforms list, but I decided to give it a try and see if it would work.
This attempt failed, resulting in an error when initializing the Spark Context.
I was wondering if this is a known issue, or if anyone managed to get this working.

Environment
spark version: 3.3
platform: Glue 4.0

To Reproduce
Steps to reproduce the behavior:

  1. Download jar from maven repo
  2. Upload to S3
  3. Add to job's dependent jars
  4. Set plugin config in SparkSession builder (or set as --conf property)
  5. Run the script
  6. See the error

Expected behavior
Session and context initialized and job running successfully.

Additional context
Returned error:
File "/tmp/job.py", line 78, in
.getOrCreate()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 491, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 197, in init
self._do_init(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 282, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 410, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in call
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.dataflint.DataflintSparkUILoader$.install(DataflintSparkUILoader.scala:17)
at io.dataflint.spark.SparkDataflintDriverPlugin.registerMetrics(SparkDataflintPlugin.scala:26)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1(PluginContainer.scala:75)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1$adapted(PluginContainer.scala:74)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.internal.plugin.DriverPluginContainer.registerMetrics(PluginContainer.scala:74)
at org.apache.spark.SparkContext.$anonfun$new$41(SparkContext.scala:681)
at org.apache.spark.SparkContext.$anonfun$new$41$adapted(SparkContext.scala:681)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext.(SparkContext.scala:681)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)

@menishmueli
Copy link
Contributor

menishmueli commented Jan 31, 2024

Hi @VitorNoro! sorry for the long response

The issue with supporting AWS Glue with DataFlint OSS is that the Spark UI is not enabled on the cluster.

When you use the "Spark UI" it's actually a managed history server (which there is no way to run custom code on, such as DataFlint), that reads from a S3 Bucket events that the Spark Driver write to every 30 seconds.

See more at https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html
image

You could host a history server yourself with DataFlint plugin installed (see instructions here https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server) and point it to the S3 bucket with the events. See instructions here:
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html. You can also initially host this history server locally from your laptop to test DataFlint our.

Another options that I'm currently working on a SaaS offering for DataFlint, that will send the summary of your spark job to a SaaS solution with additional features (graph of job duration/resource usage/input size over time, recommendations, alerts etc...).
In this SaaS portal when you select a job run you could also see it's Spark UI & DataFlint UI.
This offering will also support AWS Glue.

If this is something that interests you please let me know.

@menishmueli
Copy link
Contributor

I'm keeping this issue open until I will add a better error message when trying to run DataFlint on AWS glue

@menishmueli menishmueli self-assigned this Jan 31, 2024
@VitorNoro
Copy link
Author

Thank you for the response! We'll consider our options, though it's likelier we move away from Glue in time.

@menishmueli
Copy link
Contributor

Cool!
if there anything else I can do to help you, you can contact me via the DataFlint slack community (join link in the README) or via linkedin (https://www.linkedin.com/in/meni-shmueli-developer/)

@menishmueli
Copy link
Contributor

Added an alert "No UI detected, skipping installation" if UI is turned off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants