[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart. #2028

peter-mcclonski · 2024-05-14T01:55:32Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

The Spark History server is a valuable debugging and process tracing tool. Currently, deployment of the history server would have to occur independently from the operator. It would be a convenience to manage the Spark History Server (SHS) via the Spark Operator helm chart.

Describe the solution you would like

A new section shall be added to the spark operator helm chart to define parameters for the SHS deployment. We note that a confounding element of this feature is storage layers. SHS is dependent on some accessible storage layer where spark logs can be found. The simplest implementation is a shared NFS volume, but blob storage such as S3 or an Azure storage account are common solutions that should be easy to use with our implementation. These third party solutions require additional libraries to be loaded into the classpath-- a task that SHS fails to trivialize.

Describe alternatives you have considered

The alternative involves individuals rolling their own deployments for SHS-- a non-trivial process.

Additional context

If we choose to pursue this, we may also wish to consider managing deployment of the Hive Thrift Server.

peter-mcclonski · 2024-05-14T02:32:09Z

Suggested Architecture

SHS will exist as a wholly separate deployment from spark-operator, as a disjoint chart.
In order to resolve the problem of dynamically pulling in dependencies/packages, an initcontainer shall be spun up which populates a volume with the union of the default $SPARK_HOME/jars and the result of java -Divy.cache.dir=$SPARK_HOME -Divy.home=$SPARK_HOME -jar $SPARK_HOME/jars/ivy-2.5.1.jar -dependency [PACKAGE]. This populated volume shall be mounted in the SHS container as $SPARK_HOME/jars
$SPARK_HOME/conf/spark.conf shall be mounted as a volume populated by a raw text block in the helm chart.
Log storage shall default to a PVC.
If SHS is enabled, that does not necessarily imply that logging is enabled in your spark job configuration.

…ator chart. Signed-off-by: Peter McClonski <mcclonski.peter@gmail.com>

peter-mcclonski · 2024-05-15T05:11:42Z

Did some initial work on this just to feel it out-- Got automatic resolution of packages working via initcontainers. It's a bit gross, but it works as a start.

Major TODO items:

Add arbitrary volume/volumeMount support
Add support for pulling jars, rather than solely packages
Add a clean mechanism for mounting spark-defaults.conf
Create an example that works out of the box-- The hard part being a zero-barrier-to-entry Volume accessible across nodes.
Docs updates
General cleanup / hardening

peter-mcclonski · 2024-05-17T18:02:39Z

Alternatively-- @yuchaoran2011 Do you think it would be worth reviving https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server and the associated chart and (potentially) having it live here, adjacent to but disconnected from the actual operator chart? I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild. We're working on one as part of boozallen/aissemble#66 (https://github.com/boozallen/aissemble/pull/80/files), covered by our BAPL (not as permissive as, say, Apache) solely because we couldn't find an existing OSS solution that was up to date, maintained, and flexible.

yuchaoran2011 · 2024-05-17T22:36:02Z

I'm not sure if it's a good idea to have history server co-deployed with operator. A single history server can aggregate jobs managed by multiple Spark operator deployments across multiple k8s clusters

I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild.

I agree. I haven't looked at the quality of https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server, but if it's something you have used, I'm for that idea

peter-mcclonski · 2024-05-17T23:16:44Z

I'm not sure if it's a good idea to have history server co-deployed with operator. A single history server can aggregate jobs managed by multiple Spark operator deployments across multiple k8s clusters

I think the real problem here isn't so much that operator should be managing the history server directly, and more that history server, a valuable part of the spark ecosystem, doesn't have any good helm charts out in the wild.

I agree. I haven't looked at the quality of https://artifacthub.io/packages/helm/cloudnativeapp/spark-history-server, but if it's something you have used, I'm for that idea

Sounds reasonable to me. Wrt the helm chart I linked, I wasn't sure if you had specific thoughts, given that you're listed as the maintainer on artifacthub

yuchaoran2011 · 2024-05-17T23:21:10Z

Ah upon a closer look, now I remember that I initially created this chart many years ago. I haven't used it for a long time though and won't count on it still being production ready

peter-mcclonski · 2024-05-17T23:28:03Z

I think there's both interest and clearly an unfilled need in the community for a production ready, standalone spark history chart that's well maintained. Would kubeflow and the spark operator maintainers be open to one being created in this repo, or would it be better housed somewhere totally separate?

KhASQ · 2024-05-19T00:17:11Z

Kindly make the spark history server part of the operator.
I think targeting this operator as single point for spark on K8s eco system will add much better momentum for the development.

For example integrating spark operator to manage an external shuffle service on K8s.

Sorry for interrupting but I am so excited about the new development on this operator

peter-mcclonski added the enhancement New feature or request label May 14, 2024

peter-mcclonski added a commit to peter-mcclonski/spark-on-k8s-operator that referenced this issue May 15, 2024

kubeflow#2028 Initial work at managing spark history through the Oper…

725e0a8

…ator chart. Signed-off-by: Peter McClonski <mcclonski.peter@gmail.com>

peter-mcclonski mentioned this issue May 15, 2024

Draft: #2028 Manage spark history through the Operator chart. #2030

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart. #2028

[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart. #2028

peter-mcclonski commented May 14, 2024

peter-mcclonski commented May 14, 2024 •

edited

peter-mcclonski commented May 15, 2024 •

edited

peter-mcclonski commented May 17, 2024 •

edited

yuchaoran2011 commented May 17, 2024

peter-mcclonski commented May 17, 2024 •

edited

yuchaoran2011 commented May 17, 2024

peter-mcclonski commented May 17, 2024

KhASQ commented May 19, 2024

[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart. #2028

[FEATURE] Manage Spark History Server as a deployment via the Spark Operator helm chart. #2028

Comments

peter-mcclonski commented May 14, 2024

Community Note

What is the outcome that you are trying to reach?

Describe the solution you would like

Describe alternatives you have considered

Additional context

peter-mcclonski commented May 14, 2024 • edited

Suggested Architecture

peter-mcclonski commented May 15, 2024 • edited

peter-mcclonski commented May 17, 2024 • edited

yuchaoran2011 commented May 17, 2024

peter-mcclonski commented May 17, 2024 • edited

yuchaoran2011 commented May 17, 2024

peter-mcclonski commented May 17, 2024

KhASQ commented May 19, 2024

peter-mcclonski commented May 14, 2024 •

edited

peter-mcclonski commented May 15, 2024 •

edited

peter-mcclonski commented May 17, 2024 •

edited

peter-mcclonski commented May 17, 2024 •

edited