Is there an example "project" for one of the SpaCy models? #12600
Replies: 1 comment
-
No, right now there's not an example project for this. There's nothing secret about the details, it's just currently only set up for internal use and isn't in a state that would make sense to release publicly. We would like to release it someday and it's on our longer-term to-do list. And for a number of the languages we have our own licenses for the training corpora that allow us to distribute models, but not any part of the original training corpus. The basics are that independent parts of the pipeline are trained separately and then combined, so typically for CNN models there are three separate pipelines trained for:
And then the final model is collated and rule-based components are added as described here: We also do some lowercasing and whitespace augmentation that is removed from the published configs so that people who are fine-tuning from their own data aren't surprised by unexpected augmentation. If you're starting with a UD corpus, this project is similar for the syntax part, which is the most complicated part: https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud |
Beta Was this translation helpful? Give feedback.
-
Is there an example project for a distributed model (e.g., en_core_web_*)? I'd like to see a big example and experiment with recreating a model from "scratch". It looks like there's quite a bit of detail in spacy.info and nlp.conf, but it doesn't look to me like I have everything I'd need to do this.
Beta Was this translation helpful? Give feedback.
All reactions