Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script for customizing pipelines #147

Closed
wants to merge 9 commits into from

Conversation

polm
Copy link
Contributor

@polm polm commented Nov 11, 2022

This project includes a script to merge pipelines, handling issues like whether to use replace_listeners and giving warnings when replace_listeners isn't suitable.

Currently this should work well if a secondary pipeline has only one listener, or if it's ok that a merged pipeline can't be trained. However, there are still many things that can be improved, and questions about how to do things like ensure compatibility between the pipelines to be merged.

I'm also not sure a project is the best structure for this - it might also make sense as a spaCy command. This seemed to be the best way to share it for discussion though.

This project includes a script to merge pipelines, handling issues like
whether to use replace_listeners and giving warnings when
replace_listeners isn't suitable.

Currently this should work well if a secondary pipeline has only one
listener, or if it's ok that a merged pipeline can't be trained.
However, there are still many things that can be improved, and questions
about how to do things like ensure compatibility between the pipelines
to be merged.

I'm also not sure a project is the best structure for this - it might
also make sense as a spaCy command. This seemed to be the best way to
share it for discussion though.
@polm polm changed the title Add a script for merging pipelines Add a script for customizing pipelines Nov 22, 2022
@polm
Copy link
Contributor Author

polm commented Nov 22, 2022

Still thinking about this and pretty sure it should be a spaCy command rather than a project, but it works as is and should be fine for demonstrating the feature and goals.

This now has the feature that it can take a pipeline and generate a config modified to use a transformer/CNN tok2vec, regardless of what the original pipeline used. This is pretty different from the pipeline merging feature, but because it needs to find details of the pipeline that aren't directly exposed - like a list of listener components - there ended up being overlap.

@adrianeboyd
Copy link
Collaborator

I can see that there are shared utility methods, but I think it's confusing to have a script that sometimes outputs a pipeline and sometimes outputs a config.

@polm
Copy link
Contributor Author

polm commented Nov 30, 2022

You're absolutely right that having them in the same script is weird. I have split them up.

I am still not sure that doing this as a project makes the most sense, and I am trying to come up with better names, especially for the tok2vec rewriting component.

It feels like these should be spaCy commands, but I'm not sure where they would fit in the hierarchy, and I don't want to give them each their own place.

@polm
Copy link
Contributor Author

polm commented Dec 2, 2022

I added a script to generate a config for resuming training, which ended up being really simple to generate.

I think I am getting a better idea of how to present these, I'll gather my thoughts about them.

polm added a commit to polm/spaCy that referenced this pull request Dec 23, 2022
This continues work started in
explosion/projects#147,
which provides features for automatically manipulating pipelines and
configs. The functions included are:

- merge: combine components from two pipelines and handle listeners
- use_transformer: use transformer as feature source
- use_tok2vec: use CNN tok2vec as feature source
- resume: make a version of a config for resuming training

Currently these are all grouped under a new `spacy configure` command.
That may not be the best place for them; in particular, `merge` may
belong elsewhere, since it outputs a pipeline rather than a config.

The current state of the PR is that the commands run, but there's only
one small test, and docs haven't been written yet. Docs can be started
but will depend somewhat on how the naming issues work out.
@polm
Copy link
Contributor Author

polm commented Dec 23, 2022

Closing this in favor of explosion/spaCy#12020, since in the end these do make more sense as commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants