[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814

SemyonSinchenko · 2024-04-08T19:17:31Z

The main idea is to make python part lazy. All the classes, like Argument, Pipe, etc. should be pure python classes that do not contain any interactions with JVM.

Interaction with JVM in this case should be encapsulated into 2-3 methods of the new Client and will contain passing the full JSON generated from python Arguments, Pipes, FieldDefintions, etc.

A temporary solution, based on the following: - introducing a new env variable ZINGG_DRY_RUN - if the variable is set: + mimic globally used JVM-stuff + otherwise do nothing ++ slightly update ignore and docs/Makefile ++ apply formatting to client.py On branch 762-fix_sphinx_build Changes to be committed: modified: .gitignore modified: python/docs/Makefile new file: python/pyproject.toml modified: python/zingg/client.py

Make sphinx works

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/requirements.txt new file: python/zingg_v2/__init__.py new file: python/zingg_v2/client.py new file: python/zingg_v2/errors.py new file: python/zingg_v2/structs.py

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/structs.py

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: deleted: python/zingg_v2/client.py new file: python/zingg_v2/pipes.py modified: python/zingg_v2/structs.py

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/pyproject.toml new file: python/zingg_v2/client.py modified: python/zingg_v2/errors.py new file: python/zingg_v2/models.py modified: python/zingg_v2/pipes.py deleted: python/zingg_v2/structs.py

python/zingg_v2/models.py

sonalgoyal · 2024-04-09T17:39:53Z

python/zingg_v2/models.py

+    format: FileFormat
+    preprocessors: Optional[FieldPreprocessor] = None
+    props: dict[str, Any] = {}
+    schema: Optional[str] = None


only a csv pipe has the schema

Yes, that is why the default value is None. If I understand correctly the Java part, even if I pass the schema with format equal parquet this schema will be just ignored.

sonalgoyal · 2024-04-09T17:40:35Z

python/zingg_v2/models.py

+    trainingSamples: Optional[list[Pipe]] = None
+    fieldDefinition: Optional[list[FieldDefinition]] = None
+    numPartitions: int = 10
+    labelDataSampleSize: float = 0.01


how do we ensure that stuff we define in java remains the same here?

It is just about default values. I can use Optional[T] = None in every place, but it creates a small overhead and makes code less readable. So, I just took values from Java Code, but user can change any field - all these classes are mutable.

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/models.py modified: python/zingg_v2/pipes.py

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: new file: buf.gen.yaml new file: buf.work.yaml new file: protobuf/connect_plugins.proto modified: python/zingg_v2/client.py new file: python/zingg_v2/connect.py modified: python/zingg_v2/models.py new file: python/zingg_v2/proto/connect_plugins_pb2.py new file: python/zingg_v2/proto/connect_plugins_pb2.pyi new file: python/zingg_v2/proto/connect_plugins_pb2_grpc.py

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/client.py

On branch main Your branch is ahead of 'origin/main' by 1 commit. (use "git push" to publish your local commits) Changes to be committed: modified: .gitignore modified: buf.gen.yaml modified: common/core/src/main/java/zingg/common/core/executor/LabelUpdater.java modified: protobuf/connect_plugins.proto modified: python/pyproject.toml modified: python/zingg_v2/client.py modified: python/zingg_v2/models.py modified: python/zingg_v2/proto/connect_plugins_pb2.py new file: scripts/get-spark-connect-local.sh new file: scripts/run-spark-connect-local.sh new file: spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java new file: spark/client/src/main/java/zingg/spark/connect/proto/ConnectPlugins.java new file: spark/client/src/main/java/zingg/spark/connect/proto/SubmitZinggJob.java new file: spark/client/src/main/java/zingg/spark/connect/proto/SubmitZinggJobOrBuilder.java modified: spark/pom.xml

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: buf.gen.yaml new file: python/test_spark_connect.py modified: python/zingg_v2/client.py modified: python/zingg_v2/models.py modified: scripts/run-spark-connect-local.sh deleted: spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java new file: spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala modified: spark/pom.xml Untracked files: spark-3.5.1-bin-hadoop3.tgz spark-3.5.1-bin-hadoop3/

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: .gitignore modified: python/requirements.txt modified: python/test_spark_connect.py modified: python/zingg_v2/client.py modified: python/zingg_v2/errors.py modified: scripts/run-spark-connect-local.sh modified: spark/client/pom.xml modified: spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala modified: spark/pom.xml

+ drop scala from spark-client + rewrite plugin in java + update to scala 2.13 and corresponding fixes + small changes On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: pom.xml modified: spark/client/pom.xml new file: spark/client/src/main/java/zingg/spark/connect/ZinggConnectPlugin.java deleted: spark/client/src/main/scala/zingg/spark/connect/ZinggConnectPlugin.scala modified: spark/core/src/main/java/zingg/spark/core/block/SparkBlockFunction.java modified: spark/core/src/test/java/zingg/TestUDFDoubleWrappedArr.java

sonalgoyal and others added 9 commits December 5, 2023 22:22

Update codeql.yml

a83a2c3

Merge branch '0.4.0'

ad9446c

Merge branch '0.4.0'

318a1cd

Merge pull request zinggAI#806 from SemyonSinchenko/762-fix_sphinx_build

e02e25c

Make sphinx works

Update

7ab47be

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/structs.py

Update the new implementation

25f6ccd

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: deleted: python/zingg_v2/client.py new file: python/zingg_v2/pipes.py modified: python/zingg_v2/structs.py

sonalgoyal reviewed Apr 9, 2024

View reviewed changes

python/zingg_v2/models.py Outdated Show resolved Hide resolved

sonalgoyal reviewed Apr 9, 2024

View reviewed changes

python/zingg_v2/models.py Outdated Show resolved Hide resolved

sonalgoyal reviewed Apr 9, 2024

View reviewed changes

SemyonSinchenko and others added 7 commits April 11, 2024 19:11

Update from comments in PR

23df1ee

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/models.py modified: python/zingg_v2/pipes.py

Batch of changes

ecb06b9

On branch main Your branch is up to date with 'origin/main'. Changes to be committed: modified: python/zingg_v2/client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814

[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814

SemyonSinchenko commented Apr 8, 2024

sonalgoyal Apr 9, 2024

SemyonSinchenko Apr 9, 2024

sonalgoyal Apr 9, 2024

SemyonSinchenko Apr 9, 2024

[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814

Are you sure you want to change the base?

[WIP][DO NOT MERGE] Draft implementation of the new PySpark API for support of both Spark Classic and Spark Connect #814

Conversation

SemyonSinchenko commented Apr 8, 2024

sonalgoyal Apr 9, 2024

Choose a reason for hiding this comment

SemyonSinchenko Apr 9, 2024

Choose a reason for hiding this comment

sonalgoyal Apr 9, 2024

Choose a reason for hiding this comment

SemyonSinchenko Apr 9, 2024

Choose a reason for hiding this comment