Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Connect Support #513

Open
wchau opened this issue Jan 2, 2024 · 3 comments
Open

Spark Connect Support #513

wchau opened this issue Jan 2, 2024 · 3 comments
Labels
Type: New Feature ➕ Introduction of a completely new addition to the codebase

Comments

@wchau
Copy link

wchau commented Jan 2, 2024

Feature Description

Spark Connect Support

Is your feature request related to a problem?

In Spark Connect, RDD is not supported, so PipelineDP does not work. See https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/dataframe.py

What alternatives have you considered?

N/A

Additional Context

Add any other context or screenshots about the feature request here.

@wchau wchau added the Type: New Feature ➕ Introduction of a completely new addition to the codebase label Jan 2, 2024
@dvadym
Copy link
Collaborator

dvadym commented Jan 2, 2024

Thanks for filing the issue!

There is already an experimental Spark DataFrame support:

  1. End2End example, building DP query in this example.
  2. QueryBuilder class API (with an example in the docstring): the main API for supporting DataFrames.

We're planning to make it as official API in the next release in ~2 months (including adding documentation for this API).

Any feedback on the this API is welcome.

@wchau
Copy link
Author

wchau commented Jan 2, 2024

Hey Vadym,

Is there a reason why you use RDD instead of DataFrame directly? I think the QueryBuilder tries to convert to RDD, which Spark Connect does not support.

Thanks!

@dvadym
Copy link
Collaborator

dvadym commented Jan 3, 2024

Ah, I see, Spark Connect doesn't have RDD at all.

The main reason why RDD is used is that the direct handling of Spark DataFrame is not yet impelemented. The PipelineDP main logic is pretty agnostic to the input colllection type. So it should be reasonable simple to extend to DataFrames. I'll check how it can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: New Feature ➕ Introduction of a completely new addition to the codebase
Projects
None yet
Development

No branches or pull requests

2 participants