New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814
base: sparkConnect
Are you sure you want to change the base?
Conversation
A temporary solution, based on the following: - introducing a new env variable ZINGG_DRY_RUN - if the variable is set: + mimic globally used JVM-stuff + otherwise do nothing ++ slightly update ignore and docs/Makefile ++ apply formatting to client.py On branch 762-fix_sphinx_build Changes to be committed: modified: .gitignore modified: python/docs/Makefile new file: python/pyproject.toml modified: python/zingg/client.py
Make sphinx works
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/requirements.txt new file: python/zingg_v2/__init__.py new file: python/zingg_v2/client.py new file: python/zingg_v2/errors.py new file: python/zingg_v2/structs.py
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: deleted: python/zingg_v2/client.py new file: python/zingg_v2/pipes.py modified: python/zingg_v2/structs.py
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/pyproject.toml new file: python/zingg_v2/client.py modified: python/zingg_v2/errors.py new file: python/zingg_v2/models.py modified: python/zingg_v2/pipes.py deleted: python/zingg_v2/structs.py
python/zingg_v2/models.py
Outdated
format: FileFormat | ||
preprocessors: Optional[FieldPreprocessor] = None | ||
props: dict[str, Any] = {} | ||
schema: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a csv pipe has the schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is why the default value is None
. If I understand correctly the Java part, even if I pass the schema with format equal parquet
this schema will be just ignored.
trainingSamples: Optional[list[Pipe]] = None | ||
fieldDefinition: Optional[list[FieldDefinition]] = None | ||
numPartitions: int = 10 | ||
labelDataSampleSize: float = 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we ensure that stuff we define in java remains the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just about default values. I can use Optional[T] = None
in every place, but it creates a small overhead and makes code less readable. So, I just took values from Java Code, but user can change any field - all these classes are mutable.
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/models.py modified: python/zingg_v2/pipes.py
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: new file: buf.gen.yaml new file: buf.work.yaml new file: protobuf/connect_plugins.proto modified: python/zingg_v2/client.py new file: python/zingg_v2/connect.py modified: python/zingg_v2/models.py new file: python/zingg_v2/proto/connect_plugins_pb2.py new file: python/zingg_v2/proto/connect_plugins_pb2.pyi new file: python/zingg_v2/proto/connect_plugins_pb2_grpc.py
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/client.py
On branch main Your branch is ahead of 'origin/main' by 1 commit. (use "git push" to publish your local commits) Changes to be committed: modified: .gitignore modified: buf.gen.yaml modified: common/core/src/main/java/zingg/common/core/executor/LabelUpdater.java modified: protobuf/connect_plugins.proto modified: python/pyproject.toml modified: python/zingg_v2/client.py modified: python/zingg_v2/models.py modified: python/zingg_v2/proto/connect_plugins_pb2.py new file: scripts/get-spark-connect-local.sh new file: scripts/run-spark-connect-local.sh new file: spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java new file: spark/client/src/main/java/zingg/spark/connect/proto/ConnectPlugins.java new file: spark/client/src/main/java/zingg/spark/connect/proto/SubmitZinggJob.java new file: spark/client/src/main/java/zingg/spark/connect/proto/SubmitZinggJobOrBuilder.java modified: spark/pom.xml
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: buf.gen.yaml new file: python/test_spark_connect.py modified: python/zingg_v2/client.py modified: python/zingg_v2/models.py modified: scripts/run-spark-connect-local.sh deleted: spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java new file: spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala modified: spark/pom.xml Untracked files: spark-3.5.1-bin-hadoop3.tgz spark-3.5.1-bin-hadoop3/
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: .gitignore modified: python/requirements.txt modified: python/test_spark_connect.py modified: python/zingg_v2/client.py modified: python/zingg_v2/errors.py modified: scripts/run-spark-connect-local.sh modified: spark/client/pom.xml modified: spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala modified: spark/pom.xml
+ drop scala from spark-client + rewrite plugin in java + update to scala 2.13 and corresponding fixes + small changes On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: pom.xml modified: spark/client/pom.xml new file: spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java deleted: spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala modified: spark/core/src/main/java/zingg/spark/core/block/SparkBlockFunction.java modified: spark/core/src/test/java/zingg/TestUDFDoubleWrappedArr.java
The main idea is to make python part lazy. All the classes, like Argument, Pipe, etc. should be pure python classes that do not contain any interactions with JVM.
Interaction with JVM in this case should be encapsulated into 2-3 methods of the new
Client
and will contain passing the full JSON generated from python Arguments, Pipes, FieldDefintions, etc.