Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart. #2028

Open
peter-mcclonski opened this issue May 14, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@peter-mcclonski
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

The Spark History server is a valuable debugging and process tracing tool. Currently, deployment of the history server would have to occur independently from the operator. It would be a convenience to manage the Spark History Server (SHS) via the Spark Operator helm chart.

Describe the solution you would like

A new section shall be added to the spark operator helm chart to define parameters for the SHS deployment. We note that a confounding element of this feature is storage layers. SHS is dependent on some accessible storage layer where spark logs can be found. The simplest implementation is a shared NFS volume, but blob storage such as S3 or an Azure storage account are common solutions that should be easy to use with our implementation. These third party solutions require additional libraries to be loaded into the classpath-- a task that SHS fails to trivialize.

Describe alternatives you have considered

The alternative involves individuals rolling their own deployments for SHS-- a non-trivial process.

Additional context

If we choose to pursue this, we may also wish to consider managing deployment of the Hive Thrift Server.

@peter-mcclonski peter-mcclonski added the enhancement New feature or request label May 14, 2024
@peter-mcclonski
Copy link
Contributor Author

peter-mcclonski commented May 14, 2024

Suggested Architecture

  • SHS will exist as a wholly separate deployment from spark-operator, as a disjoint chart.
  • In order to resolve the problem of dynamically pulling in dependencies/packages, an initcontainer shall be spun up which populates a volume with the union of the default $SPARK_HOME/jars and the result of java -Divy.cache.dir=$SPARK_HOME -Divy.home=$SPARK_HOME -jar $SPARK_HOME/jars/ivy-2.5.1.jar -dependency [PACKAGE]. This populated volume shall be mounted in the SHS container as $SPARK_HOME/jars
  • $SPARK_HOME/conf/spark.conf shall be mounted as a volume populated by a raw text block in the helm chart.
  • Log storage shall default to a PVC.
  • If SHS is enabled, that does not necessarily imply that logging is enabled in your spark job configuration.

peter-mcclonski added a commit to peter-mcclonski/spark-on-k8s-operator that referenced this issue May 15, 2024
…ator chart.

Signed-off-by: Peter McClonski <mcclonski.peter@gmail.com>
@peter-mcclonski
Copy link
Contributor Author

peter-mcclonski commented May 15, 2024

Did some initial work on this just to feel it out-- Got automatic resolution of packages working via initcontainers. It's a bit gross, but it works as a start.

Major TODO items:

  • Add arbitrary volume/volumeMount support
  • Add support for pulling jars, rather than solely packages
  • Add a clean mechanism for mounting spark-defaults.conf
  • Create an example that works out of the box-- The hard part being a zero-barrier-to-entry Volume accessible across nodes.
  • Docs updates
  • General cleanup / hardening

@peter-mcclonski
Copy link
Contributor Author

peter-mcclonski commented May 17, 2024

Alternatively-- @yuchaoran2011 Do you think it would be worth reviving https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server and the associated chart and (potentially) having it live here, adjacent to but disconnected from the actual operator chart? I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild. We're working on one as part of boozallen/aissemble#66 (https://github.com/boozallen/aissemble/pull/80/files), covered by our BAPL (not as permissive as, say, Apache) solely because we couldn't find an existing OSS solution that was up to date, maintained, and flexible.

@yuchaoran2011
Copy link
Contributor

I'm not sure if it's a good idea to have history server co-deployed with operator. A single history server can aggregate jobs managed by multiple Spark operator deployments across multiple k8s clusters

I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild.

I agree. I haven't looked at the quality of https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server, but if it's something you have used, I'm for that idea

@peter-mcclonski
Copy link
Contributor Author

peter-mcclonski commented May 17, 2024

I'm not sure if it's a good idea to have history server co-deployed with operator. A single history server can aggregate jobs managed by multiple Spark operator deployments across multiple k8s clusters

I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild.

I agree. I haven't looked at the quality of https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server, but if it's something you have used, I'm for that idea

Sounds reasonable to me. Wrt the helm chart I linked, I wasn't sure if you had specific thoughts, given that you're listed as the maintainer on artifacthub

@yuchaoran2011
Copy link
Contributor

Ah upon a closer look, now I remember that I initially created this chart many years ago. I haven't used it for a long time though and won't count on it still being production ready

@peter-mcclonski
Copy link
Contributor Author

I think there's both interest and clearly an unfilled need in the community for a production ready, standalone spark history chart that's well maintained. Would kubeflow and the spark operator maintainers be open to one being created in this repo, or would it be better housed somewhere totally separate?

@KhASQ
Copy link

KhASQ commented May 19, 2024

Kindly make the spark history server part of the operator.
I think targeting this operator as single point for spark on K8s eco system will add much better momentum for the development.

For example integrating spark operator to manage an external shuffle service on K8s.

Sorry for interrupting but I am so excited about the new development on this operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants