Releases · explosion/spaCy

20 Oct 08:13

adrianeboyd

v3.4.2

3d0e895

v3.4.2: Latin and Luganda support, Python 3.11 wheels and more

✨ New features and improvements

NEW: Luganda language support (#10847).
NEW: Latin language support (#11349).
NEW: spacy.ConsoleLogger.v2 optionally saves training logs to JSONL (#11214).
NEW: New operators for the DependencyMatcher to include matching parents or children to the left or the right of the node (#10371).
Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.
Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).
Support CuPy v11 and add extras for cuda11x and cuda-autodetect (using cupy-wheel) (#11279).
Support custom attributes for tokens and spans in Doc.to_json() and Doc.from_json() (#11125).
Make the enable and disable options for spacy.load() more consistent (#11459).
Allow a single string argument for disable/enclude/exclude for spacy.load() (#11406).
New --url flag for spacy info to print the direct download URL for a pipeline (#11175).
Add a check for missing requirements in the spacy project CLI (#11226).
Add a Levenshtein distance function (#11418).
Improvements to the spacy debug data CLI for spancat data (#11504).
Allow overriding spacy_version in spacy package metadata (#11552).
Improve the error message when using the wrong command for spacy project assets (#11458).
Ensure parent directories are created when storing the results of the spacy pretrain command (#11210).
Extend support to newer versions of natto-py for the ko extra (#11222).

📦 Trained pipelines updates

This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_* v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).

Use spacy download to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0. You can check that you are using the new version (v3.4.1) with spacy validate:

NAME                     SPACY            VERSION
en_core_web_md           >=3.4.0,<3.5.0   3.4.1     ✔

🔴 Bug fixes

#11275: Fix Dutch noun chunks to skip overlapping spans.
#11276: Fix regex invalid escape sequences.
#11312: Better handling of unexpected types in SetPredicate.
#11460: Fix config validation failures caused by NVTX pipeline wrappers.
#11506: Avoid unwanted side effects in Doc.__init__.
#11540: Preserve missing entity annotation in augmenters.
#11592: Fix issues with DVC commands.
#11631: Fix initialization for pymorphy2_lookup lemmatizer mode for Russian and Ukrainian.

⚠️ Backwards incompatibilities

If you're using a custom component that does not return a Doc type, an error will now be raised (#11424).
If you're using a dot in a factory name, an error is raised as this is not supported (#11336).

📖 Documentation and examples

Added documentation for the new experimental coref component.
Added Ukrainian trained pipelines to the website.
Added documentation for the spacy.models_and_pipes_with_nvtx_range.v1 callback.
Fix English pipeline names in v3.4 release notes.
Various fixes to the Example API documentation.
Extensions and improvements to the displacy docs.
Fix the example command for spacy project dvc.
Update example code for spacy-wordnet.
Improve API documentation around the initialize() function for pipeline components.
Fix various typos and inconsistencies.
spaCy universe additions:
- concepCy: A spaCy wrapper for ConceptNet.
- spaCy partial tagger: build a CRF tagger with a partially annotated dataset.
- Zshot: Zero and Few shot named entity & relationships recognition.

👥 Contributors

@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy

Contributors

danieldk, shadeMe, and 23 other contributors

Assets 2

19 Oct 08:08

adrianeboyd

v2.3.8

ca0cae2

v2.3.8: Updates for Python 3.10 and 3.11

✨ New features and improvements

Updates and binary wheels for Python 3.10 and 3.11.

👥 Contributors

@adrianeboyd, @honnibal, @ines

Contributors

adrianeboyd, honnibal, and ines

Assets 2

26 Jul 13:08

adrianeboyd

v3.4.1

5c2a00c

v3.4.1: Fix compatibility with CuPy v9.x

🔴 Bug fixes

Fix issue #11137: Fix compatibility with CuPy v9.x.

📖 Documentation and examples

spaCy universe additions:
- BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.
- English Interpretation Sentence Pattern: English interpretation for accurate translation from English to Japanese.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic

Contributors

danieldk, polm, and 8 other contributors

Assets 2

12 Jul 06:16

adrianeboyd

v3.4.0

d583626

v3.4.0: Updated types, speed improvements and pipelines for Croatian

✨ New features and improvements

Support for mypy 0.950+ and pydantic v1.9 (#10786).
Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
Min/max {n,m} operator for Matcher patterns (#10981).
Language updates:
- Improve tokenization for Cyrillic combining diacritics (#10837).
- Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
Improved speed of vector lookups (#10992).
For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).
Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).
Improved speed of StringStore lookups (#10938).
Updated spacy project clone to try both main and master branches by default (#10843).
Added confidence threshold for named entity linker (#11016).
Improved handling of Typer optional default values for init_config_cli (#10788).
Added cycle detection in parser projectivization methods (#10877).
Added counts for NER labels in debug data (#10960).
Support for adding NVTX ranges to TrainablePipe components (#10965).
Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).

📦 Trained pipelines updates

We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

Package	UPOS	Parser LAS	NER F
`hr_core_news_sm`	96.6	77.5	76.1
`hr_core_news_md`	97.3	80.1	81.8
`hr_core_news_lg`	97.5	80.4	83.0

🙏 Special thanks to @gtoffoli for help with the new pipelines!

The English pipelines have new word vectors:

Package	Model Version	TAG	Parser LAS	NER F
`en_core_news_md`	v3.3.0	97.3	90.1	84.6
`en_core_news_md`	v3.4.0	97.2	90.3	85.5
`en_core_news_lg`	v3.3.0	97.4	90.1	85.3
`en_core_news_lg`	v3.4.0	97.3	90.2	85.6

All CNN pipelines have been extended to add whitespace augmentation.

🔴 Bug fixes

Fix issue #10960: Support hyphens in NER labels.
Fix issue #10994: Fix horizontal spacing for spans in displaCy.
Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.
Fix issue #11056: Don't use get_array_module in textcat.
Fix issue #11092: Fix vertical alignment for spans in displaCy.

🚀 Notes about upgrading from v3.3

Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

📖 Documentation and examples

spaCy universe additions:
- Aim-spacy: An Aim-based spaCy experiment tracker.
- Asent: Fast, flexible and transparent sentiment analysis.
- spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
- spacy-report: Generates interactive reports for spaCy models.

👥 Contributors

@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere

Contributors

ericholscher, danieldk, and 23 other contributors

Assets 2

07 Jun 17:23

danieldk

v3.3.1

5fb597f

v3.3.1: New Span Ruler component, JSON (de)serialization of Doc, span analyzer and more

✨ New features and improvements

Add the SpanRuler component. This component saves a list of matched spans to Doc.spans[spans_key].
Support for JSON serialization and deserialization of Doc objects.
Add span analysis to debug data.
Allow data assets to be made optional in a spaCy project.
Prebuilt macOS ARM64 wheels are now available for all spaCy dependencies distributed by @explosion.

🔴 Bug fixes

Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted Doc objects.
Fix issue #10685: Fix serialization of SpanGroup objects that share the same name within one SpanGroups container.
Fix issue #10718: Remove debug print statements in walk_head_nodes to avoid acquiring the GIL.
Fix issue #10741: Make the StringStore.__getitem__ return type dependent on its parameter type.
Fix issue #10734: Support removal of overlapping terms in PhraseMatcher.
Fix issue #10772: Override SpanGroups.setdefault to also support Iterable[SpanGroup] as the default.
Fix issue #10817: Ensure that the term ROOT is in the glossary.
Fix issue #10830: Better errors for Doc.has_annotation and Matcher.
Fix issue #10864: Avoid pickling Doc inputs passed to Language.pipe().
Fix issue #10898: Fix schemas import in Doc.

⚠️ Backward incompatibilities

Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name attribute. For example, the following pipeline component:
```
[components.transformer]
factory = "transformer"
name = "custom_transformer_name"
```
would be registered erroneously as custom_transformer_name. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer.

👥 Contributors

@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg

Contributors

danieldk, shadeMe, and 16 other contributors

Assets 2

29 Apr 07:49

adrianeboyd

v3.3.0

497a708

v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

✨ New features and improvements

Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197).
- Speed up parser projectivization functions (#10241).
- Replace Ragged with faster AlignmentArray in Example for training (#10319).
- Improve Matcher speed (#10659).
- Improve serialization speed for empty Doc.spans (#10250).
NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with spacy init config -p trainable_lemmatizer or using the quickstart.
Language updates:
- Initial support for Lower Sorbian and Upper Sorbian.
- New noun chunks for Finnish.
- Updated noun chunks for French, Italian and Spanish.
- Additional updates for English, French, Italian, Japanese, Korean, Norwegian, Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
Big endian support with thinc v8.0.14+ and thinc-bigendian-ops.
Config comparisons with spacy debug diff-config.
displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates for debugging span suggesters.
The quickstart now supports adding spancat and trainable_lemmatizer components.

📦 Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

Package	Language	UPOS	Parser LAS	NER F
`fi_core_news_sm`	Finnish	92.5	71.9	75.9
`fi_core_news_md`	Finnish	95.9	78.6	80.6
`fi_core_news_lg`	Finnish	96.2	79.4	82.4
`ko_core_news_sm`	Korean	86.1	65.6	71.3
`ko_core_news_md`	Korean	94.7	80.9	83.1
`ko_core_news_lg`	Korean	94.7	81.3	85.3
`sv_core_news_sm`	Swedish	95.0	75.9	74.7
`sv_core_news_md`	Swedish	96.3	78.5	79.3
`sv_core_news_lg`	Swedish	96.3	79.1	81.1

🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

Model	v3.2 Lemma Acc	v3.3 Lemma Acc
`da_core_news_md`	84.9	94.8
`de_core_news_md`	73.4	97.7
`el_core_news_md`	56.5	88.9
`fi_core_news_md`	-	86.2
`it_core_news_md`	86.6	97.2
`ko_core_news_md`	-	90.0
`lt_core_news_md`	71.1	84.8
`nb_core_news_md`	76.7	97.1
`nl_core_news_md`	81.5	94.0
`pl_core_news_md`	87.1	93.7
`pt_core_news_md`	76.7	96.9
`ro_core_news_md`	81.8	95.5
`sv_core_news_md`	-	95.5

🔴 Bug fixes

Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
Fix issue #9443: Fix Scorer.score_cats for missing labels.
Fix issue #9669: Fix entity linker batching.
Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
Fix issue #9904: Fix textcat loss scaling.
Fix issue #9956: Compare all Span attributes consistently.
Fix issue #10073: Add "spans" to the output of doc.to_json.
Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
Fix issue #10189: Allow Example to align whitespace annotation.
Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
Fix issue #10324: Fix Tok2Vec for empty batches.
Fix issue #10347: Update basic functionality for rehearse.
Fix issue #10394: Fix Vectors.n_keys for floret vectors.
Fix issue #10400: Use meta in util.load_model_from_config.
Fix issue #10451: Fix Example.get_matching_ents.
Fix issue #10460: Fix initial special cases for Tokenizer.explain.
Fix issue #10521: Stream large assets on download in spaCy projects.
Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
Fix issue #10551: Add automatic vector deduplication for init vectors.

🚀 Notes about upgrading from v3.2

To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

📖 Documentation and examples

spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.