Is there an example "project" for one of the SpaCy models? #12600

hpcpony · 2023-05-06T19:55:38Z

hpcpony
May 6, 2023

Is there an example project for a distributed model (e.g., en_core_web_*)? I'd like to see a big example and experiment with recreating a model from "scratch". It looks like there's quite a bit of detail in spacy.info and nlp.conf, but it doesn't look to me like I have everything I'd need to do this.

adrianeboyd · 2023-05-08T07:59:51Z

adrianeboyd
May 8, 2023

No, right now there's not an example project for this. There's nothing secret about the details, it's just currently only set up for internal use and isn't in a state that would make sense to release publicly. We would like to release it someday and it's on our longer-term to-do list. And for a number of the languages we have our own licenses for the training corpora that allow us to distribute models, but not any part of the original training corpus.

The basics are that independent parts of the pipeline are trained separately and then combined, so typically for CNN models there are three separate pipelines trained for:

syntax (tok2vec..parser)
senter
ner

And then the final model is collated and rule-based components are added as described here:

#3056 (reply in thread)

We also do some lowercasing and whitespace augmentation that is removed from the published configs so that people who are fine-tuning from their own data aren't surprised by unexpected augmentation.

If you're starting with a UD corpus, this project is similar for the syntax part, which is the most complicated part:

https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there an example "project" for one of the SpaCy models? #12600

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is there an example "project" for one of the SpaCy models? #12600

hpcpony May 6, 2023

Replies: 1 comment

adrianeboyd May 8, 2023

hpcpony
May 6, 2023

adrianeboyd
May 8, 2023