Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Modin for return of a block #4372

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

CLCG-Data-Engineering
Copy link
Collaborator

@CLCG-Data-Engineering CLCG-Data-Engineering commented Jan 16, 2024

Description

Added Modin pandas DataFrame as type for accepting in return of a block. This way a developer doesn't need to manually move it to a pandas DF and also allows them to load/write data in a distributed way.

Able to read the output from the parent block and keep the data type as Modin.
image

How Has This Been Tested?

  • Created unit tests in the test_variable.py that are passing.
  • Tested through the UI

Checklist

  • The PR is tagged with proper labels (bug, enhancement, feature, documentation)
  • I have performed a self-review of my own code
  • I have added unit tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • If new documentation has been added, relative paths have been added to the appropriate section of docs/mint.json

cc:

@CLCG-Data-Engineering CLCG-Data-Engineering added feature New feature or request dependencies Pull requests that update a dependency file labels Jan 16, 2024
@CLCG-Data-Engineering CLCG-Data-Engineering self-assigned this Jan 16, 2024
@CLCG-Data-Engineering
Copy link
Collaborator Author

Some questions, since not everything is working as intended yet.

I tried to follow all the steps that are being done with Polars & Pandas, for writing, reading, and deleting files.

  1. Output files are somehow not being deleted, and I am unsure where I need to make a change/modify my changes (Running the pipeline twice results in: Is a directory: '/root/.mage_data/default_repo/pipelines/effortless_feather/.variables/load_titanic/output_0/data.parquet'
  2. Should I try to downgrade the requirement, to retry the test_backend (3.8) even if it could be at the cost of the level of Pandas API coverage on Modin?

@wangxiaoyou1993
Copy link
Member

Hi @CLCG-Data-Engineering , do you still have the issues or questions on this PR?

@CLCG-Data-Engineering
Copy link
Collaborator Author

Yes, Modin is creating a folder with parquet files, and outside that I should test if the behavior to get to a single parquet file, instead of a folder, is possible in the same way, with both Ray and Dask.

But so far I didn't have time to look at it again.

It's definitely still something I would love to have integrated inside of Mageai :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants