Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a docs page about adding code beyond starter files #3852

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

yury-fedotov
Copy link

@yury-fedotov yury-fedotov commented May 6, 2024

Description

Closes #3418

Per this discussion in Slack with @noklam, I took a task of creating a guide on how to add code beyond starter files in Kedro.

P.S. It's my first contribution to the project, so I'd be happy to iterate as much as needed until it satisfies core team's needs. And appreciate any feedback.

Development notes

  • Built docs locally to test that hyperlinks and formatting work as expected.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
@yury-fedotov yury-fedotov marked this pull request as ready for review May 6, 2024 06:13
@yury-fedotov
Copy link
Author

@yury-fedotov
Copy link
Author

@noklam, @astrojuanlu, could you please have a look?

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution @yury-fedotov! This is already looking great ⭐
I left some comments, mostly small nits.

`__main__.py` file of your project, but such advanced customizations are not in scope of this article.
```

This being the only constraint means that you can, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if the points below are linked to the section that describe them in more detail. A bit like how we setup the configuration page with "how to.." https://kedro--3852.org.readthedocs.build/en/3852/configuration/configuration_basics.html#how-to-use-kedro-configuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Merel, great suggestion, just implemented this. Rendered build here. Let me know if that's what you had in mind.

yury-fedotov and others added 9 commits May 16, 2024 19:31
Comment on customizations

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
For example in splitting

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Make provided examples title not bold

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
And as where appropriate

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Organise spelling

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Proper wording for pip install note

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Implies to consists replacement

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>

# Conflicts:
#	RELEASE.md
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
@yury-fedotov
Copy link
Author

Hey @merelcht - thanks for comments again, I just finished implementing them.

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

@merelcht
Copy link
Member

QQ also about the vale usage in CI. I noticed that it produces some warnings on this PR, but apparently on many other existing MD files. So it's like a soft CI check?

Yes, it's a soft check. I usually treat it as a guidance on grammar/word usage. It does also check for spelling mistakes, which is helpful. But you definitely don't need to address it all.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for the contribution!

It would be great if @stichbury can help review the style of the docs a little bit, otherwise have you seen https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide already?

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

Comment on lines +8 to +10
read and collaborate on as your codebase grows.
Those files also sometimes make new users think that Kedro requires code
to be located only in those starter files, which is not true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the clarification, but I found this sounds like coming from an user rather than the official docs.

#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?

Comment on lines +29 to +32
```{note}
You actually can make Kedro look for pipeline registry in a different place by modifying the
`__main__.py` file of your project, but such advanced customisations are not in scope for this section.
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually do this? My impressions is no because

def configure_project(package_name: str) -> None:
    """Configure a Kedro project by populating its settings with values
    defined in user's settings.py and pipeline_registry.py.
    """
    settings_module = f"{package_name}.settings"
    settings.configure(settings_module)

    pipelines_module = f"{package_name}.pipeline_registry"
    pipelines.configure(pipelines_module)

I think these are the only two files that you cannot change the location, the comment I link above has a demo project that show what a "minimal Kedro Project" looks like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed:

pipelines_module = f"{package_name}.pipeline_registry"
pipelines.configure(pipelines_module)

```

This being the only constraint means that you can, for example:
* Add `utils.py` file to a pipeline folder and import utilities used by multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid utils.py as much as possible as they are the bin for everything. It's ironic because kedro do have utils module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.

This being the only constraint means that you can, for example:
* Add `utils.py` file to a pipeline folder and import utilities used by multiple
functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro that we can link to?

Even just the tree structure of a project would be great.

* Add `utils.py` file to a pipeline folder and import utilities used by multiple
functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Those could be utility functionalities, or your standalone module responsible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Those could be utility functionalities, or your standalone module responsible for
Those could be utility functionalities, or standalone module responsible for

functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Those could be utility functionalities, or your standalone module responsible for
the domain logic of the industry you work at.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the domain logic of the industry you work at.
the domain specific logic.

or the domain logic of the specific industry?

Comment on lines +43 to +44
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
into `historical_pipeline.py` and `inference_pipeline.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we usually go for the namespace/modular structure more:

  • pipelines
    • historical
    • inference
      • pipeline.py

@astrojuanlu thoughts?

* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example,
into `historical_pipeline.py` and `inference_pipeline.py`.
* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
create a few `dict[str, Pipeline]` objects in different places of the project

into `historical_pipeline.py` and `inference_pipeline.py`.
* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
and then make `register_pipelines()` return a union of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about find_pipelines()?

@stichbury
Copy link
Contributor

I am happy to review/edit this if you decide to retain as docs, but please let me know what you decide about this:

I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu

LMK if you want to have a blog post instead, which sounds like a great idea, and then you could extract from the post if you wanted a longer-term piece of content that you maintain officially as docs. I wouldn't be able to craft the post without some notice and a ticket, but would definitely support the idea and help with later stages to edit and post, if you need it.

Just ping me again when you decide how to go forward!

@merelcht
Copy link
Member

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution @yury-fedotov ! I left a couple of comments of outstanding things I saw.

I agree it would be nice to have this as a "How-To" guide (pretty much aligned with the https://diataxis.fr/ taxonomy and other parts of our docs). I would have it maybe under "Advanced usage", somewhere, although just for continuity, not really a fan of the "Advanced" name there.

Comment on lines +29 to +32
```{note}
You actually can make Kedro look for pipeline registry in a different place by modifying the
`__main__.py` file of your project, but such advanced customisations are not in scope for this section.
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed:

pipelines_module = f"{package_name}.pipeline_registry"
pipelines.configure(pipelines_module)

│ ├── optimizer <-- Standalone package.
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline.
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages.
├── pyproject.toml <-- Settings for those, like `[tool.isort]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that having this astray pyproject.toml file here automatically communicates to pip that this is potentially a Python package. This has been an ongoing source of pain in the ecosystem, see my summary here #2280 (comment)

Now what's the best practice to manage tool configs in monorepos for Python given this issue... I have no idea, to be honest

@stichbury
Copy link
Contributor

Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏

For sure! I'm not sure if I'll be able to get to this in this sprint, but I'll add to my list; otherwise it'll be next week. To save holding you up, probably best to get it into the state where you want me to do a final review/update and let me know it is ready.

@yury-fedotov
Copy link
Author

Thanks for the comments team! I'll get to them next week and come back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Describe how to use custom code with Kedro
5 participants