Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814

Draft
wants to merge 16 commits into
base: sparkConnect
Choose a base branch
from

Conversation

SemyonSinchenko
Copy link
Contributor

The main idea is to make python part lazy. All the classes, like Argument, Pipe, etc. should be pure python classes that do not contain any interactions with JVM.

Interaction with JVM in this case should be encapsulated into 2-3 methods of the new Client and will contain passing the full JSON generated from python Arguments, Pipes, FieldDefintions, etc.

sonalgoyal and others added 9 commits December 5, 2023 22:22
A temporary solution, based on the following:

- introducing a new env variable ZINGG_DRY_RUN
- if the variable is set:
  + mimic globally used JVM-stuff
  + otherwise do nothing

++ slightly update ignore and docs/Makefile
++ apply formatting to client.py

 On branch 762-fix_sphinx_build
 Changes to be committed:
	modified:   .gitignore
	modified:   python/docs/Makefile
	new file:   python/pyproject.toml
	modified:   python/zingg/client.py
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   python/requirements.txt
	new file:   python/zingg_v2/__init__.py
	new file:   python/zingg_v2/client.py
	new file:   python/zingg_v2/errors.py
	new file:   python/zingg_v2/structs.py
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   python/zingg_v2/structs.py
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	deleted:    python/zingg_v2/client.py
	new file:   python/zingg_v2/pipes.py
	modified:   python/zingg_v2/structs.py
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   python/pyproject.toml
	new file:   python/zingg_v2/client.py
	modified:   python/zingg_v2/errors.py
	new file:   python/zingg_v2/models.py
	modified:   python/zingg_v2/pipes.py
	deleted:    python/zingg_v2/structs.py
format: FileFormat
preprocessors: Optional[FieldPreprocessor] = None
props: dict[str, Any] = {}
schema: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a csv pipe has the schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why the default value is None. If I understand correctly the Java part, even if I pass the schema with format equal parquet this schema will be just ignored.

trainingSamples: Optional[list[Pipe]] = None
fieldDefinition: Optional[list[FieldDefinition]] = None
numPartitions: int = 10
labelDataSampleSize: float = 0.01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we ensure that stuff we define in java remains the same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just about default values. I can use Optional[T] = None in every place, but it creates a small overhead and makes code less readable. So, I just took values from Java Code, but user can change any field - all these classes are mutable.

SemyonSinchenko and others added 7 commits April 11, 2024 19:11
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   python/zingg_v2/models.py
	modified:   python/zingg_v2/pipes.py
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	new file:   buf.gen.yaml
	new file:   buf.work.yaml
	new file:   protobuf/connect_plugins.proto
	modified:   python/zingg_v2/client.py
	new file:   python/zingg_v2/connect.py
	modified:   python/zingg_v2/models.py
	new file:   python/zingg_v2/proto/connect_plugins_pb2.py
	new file:   python/zingg_v2/proto/connect_plugins_pb2.pyi
	new file:   python/zingg_v2/proto/connect_plugins_pb2_grpc.py
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   python/zingg_v2/client.py
 On branch main
 Your branch is ahead of 'origin/main' by 1 commit.
   (use "git push" to publish your local commits)

 Changes to be committed:
	modified:   .gitignore
	modified:   buf.gen.yaml
	modified:   common/core/src/main/java/zingg/common/core/executor/LabelUpdater.java
	modified:   protobuf/connect_plugins.proto
	modified:   python/pyproject.toml
	modified:   python/zingg_v2/client.py
	modified:   python/zingg_v2/models.py
	modified:   python/zingg_v2/proto/connect_plugins_pb2.py
	new file:   scripts/get-spark-connect-local.sh
	new file:   scripts/run-spark-connect-local.sh
	new file:   spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java
	new file:   spark/client/src/main/java/zingg/spark/connect/proto/ConnectPlugins.java
	new file:   spark/client/src/main/java/zingg/spark/connect/proto/SubmitZinggJob.java
	new file:   spark/client/src/main/java/zingg/spark/connect/proto/SubmitZinggJobOrBuilder.java
	modified:   spark/pom.xml
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   buf.gen.yaml
	new file:   python/test_spark_connect.py
	modified:   python/zingg_v2/client.py
	modified:   python/zingg_v2/models.py
	modified:   scripts/run-spark-connect-local.sh
	deleted:    spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java
	new file:   spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala
	modified:   spark/pom.xml

 Untracked files:
	spark-3.5.1-bin-hadoop3.tgz
	spark-3.5.1-bin-hadoop3/
 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   .gitignore
	modified:   python/requirements.txt
	modified:   python/test_spark_connect.py
	modified:   python/zingg_v2/client.py
	modified:   python/zingg_v2/errors.py
	modified:   scripts/run-spark-connect-local.sh
	modified:   spark/client/pom.xml
	modified:   spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala
	modified:   spark/pom.xml
+ drop scala from spark-client
+ rewrite plugin in java
+ update to scala 2.13 and corresponding fixes
+ small changes

 On branch main
 Your branch is up to date with 'origin/main'.

 Changes to be committed:
	modified:   pom.xml
	modified:   spark/client/pom.xml
	new file:   spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java
	deleted:    spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala
	modified:   spark/core/src/main/java/zingg/spark/core/block/SparkBlockFunction.java
	modified:   spark/core/src/test/java/zingg/TestUDFDoubleWrappedArr.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants