Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SubtaskGraph generation #3342

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zhongchun
Copy link
Contributor

@zhongchun zhongchun commented Apr 21, 2023

What do these changes do?

In gen_subtask_graph, Mars always create new out chunks even if the out chunk already exists. It costs a lot of time if there are plenty of chunks.

Related issue number

Fixes #3341

I did a comparison, in which one creates new out chunks and the other does not. The test scripts are:

import mars.tensor as mt
import mars.dataframe as md

size = 50000
da1 = mt.random.random((size, 2), chunk_size=(1, 2))
df1 = md.DataFrame(da1, columns=list("AB"))
df2 = df1 + 10
df3 = df2.sum()
ret = df3.execute()

Cost time of Subtask generation are: 122.92s, 56.63s.

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@zhongchun zhongchun force-pushed the optimize-subtask-graph-generation branch from 491de95 to 8fb4545 Compare May 12, 2023 02:12
# Note: `dtypes`, `index_value`, and `columns_value` are lazily
# initialized, so we should call property `params` to initialize
# these fields.
[o.params for o in out_chunks]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wired, what would happen without these codes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will no columns_value, index_value which are used in MainPool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These field are lazily initialized, but field a or b are lazily initialized by params. Can you make a initialized by a, b initialized by b? Then we can lazily initialize them in Worker Main Pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GRAPH][ENHANCEMENT] Optimize SubtakGraph generation
7 participants