Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable eager execution #830

Open
1 of 4 tasks
GeorgesLorre opened this issue Jan 31, 2024 · 3 comments
Open
1 of 4 tasks

Enable eager execution #830

GeorgesLorre opened this issue Jan 31, 2024 · 3 comments

Comments

@GeorgesLorre
Copy link
Collaborator

GeorgesLorre commented Jan 31, 2024

In order to further optimize the development cycle eager execution will be big feature. The idea is that you can run partial pipelines / single components easily and get instant feedback on how your data is moving through your pipeline.

This is a interactive feature which makes the most sense in a notebook like environment which allows for partial code execution. I see a couple of blocks we need to solve:

Execution environment

Where will the code (eagerly) run. Can we use the runners for this? Or is this too slow and we will need a virtual runner?

Interface

How will we design the interface to allow this feature ? Ideally we do not disturb the current fondant pipeline definition code AKA we should be able to parts of a pipeline eagerly while still preserving a full pipeline definition.

Some ideas

Some pseudo pipeline code

pipeline = fondant.Pipeline(...)

dataset_1 = pipeline.read("some_read_component")

dataset_2= dataset_1.apply("some_transform_component")

_ = dataset_2.write("some_write_component")

We could call execute on the intermediate datasets

res = dataset_2.execute()

We could have a way to pass dummy data to the execution to avoid having to run all previous steps or/and we can handle the dependencies smartly by leveraging caching.

I did some experimentation on this for the xmas project week (see here)

Note:

  • this was before the lightweight components
  • this worked but ran in the current environment (So no docker)

Tasks

  1. Core
    mrchtr
  2. 7 of 7
  3. Infrastructure
    GeorgesLorre
@mrchtr
Copy link
Contributor

mrchtr commented Feb 8, 2024

I quickly went through your PoC code for the Christmas project. Here are a few thoughts from my end.

When it comes to notebooks, in my opinion, we should aim for super-fast execution of components. We should be able to execute a cell and see results immediately. I would execute component in the interactive environment directly. I don't think we need to develop a new runner. If we apply some limitations to notebooks, only allow one Python version, and install all dependencies within the environment, we can come up with something else.

I considered constructing two additional classes, e.g., InteractivePipeline and InteractiveDataset . This would create a clear separation between the docker-based pipeline execution and an eager execution in the notebook. We have already implemented some checks to determine if we are executing code in an interactive environment or not. When we are in an interactive environment, I would use the InteractivePipeline and InteractiveDataset as default. The interface does not change.

dataset = Pipeline(...).read("some_component")

In the case of lightweight components, we can directly execute the component code. We must ensure that all extra requirements are installed locally (if not, we can install them using subprocess calls).

In the case of reusable components, we should be able to load the component code and execute it. Currently, the components are part of the source folder. The difference is that we would have to use the yaml specification to evaluate the schema.

I'm unsure if this is scalable when we don't have all available reusable components in our repository, e.g., a community member pushing components to a different Docker Hub namespace, etc. However, this isn't possible at the moment either.

In both cases, the distinction from the existing classes would be executing the component immediately instead of just generating a ComponentOp.

I propose adding an option to limit the read operation. This would load only a single partition or a restricted size of the dataset, facilitating faster development iterations.

Another reason to use InteractiveDataset is that it allows us to contain a pandas dataframe and provide methods to share the dataset with others.

We could even override some functions, such as _repr_html_, to display the dataframe within the notebook using colab features.

class InteractiveDataset(Dataset):
  def __init__(self, dataframe: dd.DataFrame, pipeline):
    self.dataframe = dataframe
    self.pipeline = pipeline
  
  def _repr_html_(self):
    # using compute() here for demo purpose
    # if we work with small dataframes in interactive manner it should be fine
    # maybe we find a other way to handle this
    return self.dataframe.compute()._repr_html_()
  
  def apply(self, ref):
    # build ComponentOp and extend to self.pipeline using super methods
    # evolved_dataset = self._apply(...)
    # self.dataset = evolved_dataset
    component = ref()
    dataframe = component.transform(self.dataframe)
    return InteractiveDataset(dataframe=dataframe, pipeline=self.pipeline)
  
  def view(self):
    return self.dataframe.compute()

Using the view function could be a nice addition for data scientists to visualize and explore the data with the tools they are accustomed to. If we transform the data explorer into an explorer sdk, we might find a good overlap here as well. We could leverage this for visualization within the notebook.

The same strategy I would apply to the apply and write methods. We could implement the same behavior in the InteractiveDataset.

During the call of read, apply, and write in the InteractivePipeline, I would still add the component to a “real” pipeline. With the normal runner, we could retrieve the normal pipeline and execute it normally by submitting it to the remote runners.

A big downside of this approach is that the implementation differs from the real pipeline execution. We wouldn't write files to the base_path , we can't use the data explorer to investigate data, and we can't leverage caching.

However, I would argue that this isn't so dramatic since it is a development tool for notebook users. I expect people could use this with small datasets to build fast pipelines, execute specific steps sequentially (eliminating the need for caching), and utilize pandas features within the notebook environment (like the data explorer feature).

We could recommend starting development in a notebook, testing on a small sample, and then scaling it out using Vertex or Sagemaker.


Here are some small code snippets in a Colab to show what it could look like:

https://colab.research.google.com/drive/1H3KbEkypUDyKyBx4zVuXjbpiCLEvjHkt?usp=sharing

@GeorgesLorre
Copy link
Collaborator Author

GeorgesLorre commented Feb 14, 2024

Thx @mrchtr! You are on the right track but I would be careful on how we make everything possible while keeping our code clean and single responsible:

I see 3 things we could tackle separately:

  1. Enable eager execution:
    I think we can make some choices here to make this feature manageable we can still evolve it later. This is mostly for interactive development environments (mostly notebooks). I would make some assumptions here to make this possible:
  • We run the component the moment it is defined this means we don't need to call a .execute() method on it. This also means that we can assume that all previous components have ran (since they are chained) so we don't need to start checking the graph to make sure all dependencies have ran. This means that we can limit eager execution to single component pipelines.
  • It should work with every runner (we build single component pipelines under the hood). Not all runners are equally fast but that does not need to be a problem. This way we can also keep our code clean (runner agnostic). (also we would need a direct runner to really support speed).
  • We keep the interface as is (as much as possible). Either we specify a global eager flag or we can set on a component level but ideally the interface is the same for eager execution.
  • We will need to rethink the usage of the Pipeline object since now we start with it but maybe we can alter it to make eager execution more intuitive. We are moving to a a more dataset centered approach then a pipeline one.
  • OPTIONAL: we could add support for feeding in dummy data samples for real quick iteration.
  1. Support direct runner (Colab support) #749

  2. Make Dataset the first class interface #853

@mrchtr
Copy link
Contributor

mrchtr commented Feb 15, 2024

OPTIONAL: we could add support for feeding in dummy data samples for real quick iteration.

I think this one is highly related to #806. When we pass dummy data to the component we can infer the produces.
Maybe we could implement #752 as part of this epic here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants