Sync `docs/llm_main` with `master` #13286

rmitsch · 2024-01-29T14:31:53Z

Description

Sync docs/llm_main with master.

Types of change

Chore.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

* Restore spacy.cli.project API * Fix typing errors, add simple import test

* Add examples for binary classification. * Fix example. * Remove binary textcat example. Format. * Rephrase.

* Support Any comparisons for Token and Span * Preserve previous behavior for None

* update all for pipeline.init * add all in training.init * add all in kb.init * alphabetically

…ns_key` for SpanCat (#13093) * Add note on score_weight if using a non-default span_key for SpanCat. * Fix formatting. * Fix formatting. * Fix typo. * Use warning infobox. * Fix infobox formatting.

* add comment that pipeline is a custom one * add link to NEL tutorial * prettier * revert prettier reformat * revert prettier reformat (2) * fix typo Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* fix typo * add examples to specify custom model for task-specific factory

* Fix displacy span stacking. * Format. Remove counter. * Remove test files. * Add unit test. Refactor to allow for unit test. * Fix off-by-one error in tests.

- Replace `np.trapz` with vendored `trapezoid` from scipy - Replace `np.float_` with `np.float64`

* Update Tokenizer.explain for special cases with whitespace Update `Tokenizer.explain` to skip special case matches if the exact text has not been matched due to intervening whitespace. Enable fuzzy `Tokenizer.explain` tests with additional whitespace normalization. * Add unit test for special cases with whitespace, xfail fuzzy tests again

Co-authored-by: Ridge Kimani <ridgekimani@gmail.com>

Build with `build` if available. Warn and fall back to previous `setup.py`-based builds if `build` build fails.

* Update the "Missing factory" error message This accounts for model installations that took place during the current Python session. * Add a note about Jupyter notebooks * Move error to `spacy.cli.download` Add extra message for Jupyter sessions * Add additional note for interactive sessions * Remove note about `spacy-transformers` from error message * `isort` * Improve checks for colab (also helps displacy) * Update warning messages * Improve flow for multiple checks --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* add language extensions for norwegian nynorsk and faroese * update docstring for nn/examples.py * use relative imports * add fo and nn tokenizers to pytest fixtures * add unittests for fo and nn and fix bug in nn * remove module docstring from fo/__init__.py * add comments about example sentences' origin * add license information to faroese data credit * format unittests using black * add __init__ files to test/lang/nn and tests/lang/fo * fix import order and use relative imports in fo/__nit__.py and nn/__init__.py * Make the tests a bit more compact * Add fo and nn to website languages * Add note about jul. * Add "jul." as exception --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: explosion/thinc#754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import

* correct char_span output type - can be None * unify type of exclude parameter * black * further fixes to from_dict and to_dict * formatting

…y blog. (#13197) * Update README.md to include links for GPU processing, LLM, and spaCy's blog. * Create ojo4f3.md * corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog. * Delete .github/contributors/ojo4f3.md * changed LLM icon Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add spacy.TextCatParametricAttention.v1 This layer provides is a simplification of the ensemble classifier that only uses paramteric attention. We have found empirically that with a sufficient amount of training data, using the ensemble classifier with BoW does not provide significant improvement in classifier accuracy. However, plugging in a BoW classifier does reduce GPU training and inference performance substantially, since it uses a GPU-only kernel. * Fix merge fallout

# Conflicts: # website/docs/api/large-language-models.mdx

Sync `master` with `docs/llm_main`

Before this change, the workers of pipe call with n_process != 1 were stopped by calling `terminate` on the processes. However, terminating a process can leave queues, pipes, and other concurrent data structures in an invalid state. With this change, we stop using terminate and take the following approach instead: * When the all documents are processed, the parent process puts a sentinel in the queue of each worker. * The parent process then calls `join` on each worker process to let them finish up gracefully. * Worker processes break from the queue processing loop when the sentinel is encountered, so that they exit. We need special handling when one of the workers encounters an error and the error handler is set to raise an exception. In this case, we cannot rely on the sentinel to finish all workers -- the queue is a FIFO queue and there may be other work queued up before the sentinel. We use the following approach to handle error scenarios: * The parent puts the end-of-work sentinel in the queue of each worker. * The parent closes the reading-end of the channel of each worker. * Then: - If the worker was waiting for work, it will encounter the sentinel and break from the processing loop. - If the worker was processing a batch, it will attempt to write results to the channel. This will fail because the channel was closed by the parent and the worker will break from the processing loop.

macOS now uses port 5000 for the AirPlay receiver functionality, so this test will always fail on a macOS desktop (unless AirPlay receiver functionality is disabled like in CI).

Fix typo in method name

* add line to ensure that apple is in fact in the vocab * add that the vocab may be empty

* attempt to clarify additional annotations on .spacy file * suggestion by Daniël * pipeline instead of pipe

* add custom code support to CLI speed benchmark * sort imports * better copying for warmup docs

The doc/token extension serialization tests add extensions that are not serializable with pickle. This didn't cause issues before due to the implicit run order of tests. However, test ordering has changed with pytest 8.0.0, leading to failed tests in test_language. Update the fixtures in the extension serialization tests to do proper teardown and remove the extensions.

ines and others added 30 commits October 6, 2023 14:22

Inline displaCy visualizations in docs (#13050) [ci skip]

b83f1e3

Update usage sidebar and nav alert [ci skip]

65e7bd5

Restore spacy.cli.project API (#13053)

77c568e

* Restore spacy.cli.project API * Fix typing errors, add simple import test

Add binary examples for Textcat task in spacy-llm (#13051)

d72029d

* Add examples for binary classification. * Fix example. * Remove binary textcat example. Format. * Rephrase.

Support Any comparisons for Token and Span (#13058)

ea1befa

* Support Any comparisons for Token and Span * Preserve previous behavior for None

Update __all__ fields (#13063)

699dd8b

* update all for pipeline.init * add all in training.init * add all in kb.init * alphabetically

Set version to v3.7.2 (#13066)

a89eae9

Update LICENSE (#13078)

d717123

Add note in docs on score_weight config if using a non-default `spa…

9deaac9

…ns_key` for SpanCat (#13093) * Add note on score_weight if using a non-default span_key for SpanCat. * Fix formatting. * Fix formatting. * Fix typo. * Use warning infobox. * Fix infobox formatting.

Fix spancat typo. (#13095)

0c15876

Update llm docs to clarify task-specific factories (#13082)

a804b83

* fix typo * add examples to specify custom model for task-specific factory

Fix displacy span stacking (#13068)

c4e2daf

* Fix displacy span stacking. * Format. Remove counter. * Remove test files. * Add unit test. Refactor to allow for unit test. * Fix off-by-one error in tests.

CI: Switch to stable python 3.12 and limit 3.11 runs (#13104)

92f1d0a

Update for numpy 2.0 deprecations (#13103)

c096c5c

- Replace `np.trapz` with vendored `trapezoid` from scipy - Replace `np.float_` with `np.float64`

Unskip python 3.12 remote tests (#13110)

ff9ddb6

feat: add extra lexical attributes (#13106)

2b8da84

Co-authored-by: Ridge Kimani <ridgekimani@gmail.com>

Add preferred use of build for package CLI (#13109)

513bbd5

Build with `build` if available. Warn and fall back to previous `setup.py`-based builds if `build` build fails.

Add Redfield NLP Nodes to the Spacy Universe (#13133)

9f2ce6b

Add swag [ci skip]

8f69e56

Add merch link [ci skip]

bf7c2ea

Docs: update trf_data examples and pipeline design info (#13164)

e467573

Update links [ci skip]

f78b91c

Update links [ci skip]

8cfccdd

Update README.md [ci skip]

7df328f

Type documentation fixes for Doc (#13187)

56fc3bc

* correct char_span output type - can be None * unify type of exclude parameter * black * further fixes to from_dict and to_dict * formatting

ojo4f3 and others added 16 commits December 18, 2023 09:49

Fix typo in method name

c608bae

Merge branch 'docs/llm_main' into chore/sync-master-with-llm_main

256468c

# Conflicts: # website/docs/api/large-language-models.mdx

Fix LLM docs on task factories.

575c405

Merge pull request #13253 from explosion/chore/sync-master-with-llm_main

3b3b5cd

Sync `master` with `docs/llm_main`

Merge remote-tracking branch 'upstream/master' into patch-1

5a2ad4a

test_find_available_port: use port 5001 (#13255)

afac7fb

macOS now uses port 5000 for the AirPlay receiver functionality, so this test will always fail on a macOS desktop (unless AirPlay receiver functionality is disabled like in CI).

Merge pull request #13240 from mauricesvp/patch-1

a8894a8

Fix typo in method name

fix typo (#13254)

a493981

Clarify vocab docs (#13273)

7496e03

* add line to ensure that apple is in fact in the vocab * add that the vocab may be empty

Clarify data_path loading for apply CLI command (#13272)

68b85ea

* attempt to clarify additional annotations on .spacy file * suggestion by Daniël * pipeline instead of pipe

add custom code support to CLI speed benchmark (#13247)

00e938a

* add custom code support to CLI speed benchmark * sort imports * better copying for warmup docs

rmitsch added the docs Documentation and website label Jan 29, 2024

rmitsch self-assigned this Jan 29, 2024

rmitsch merged commit c38fdbe into docs/llm_main Jan 29, 2024
7 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync `docs/llm_main` with `master` #13286

Sync `docs/llm_main` with `master` #13286

rmitsch commented Jan 29, 2024

Sync docs/llm_main with master #13286

Sync docs/llm_main with master #13286

Conversation

rmitsch commented Jan 29, 2024

Description

Types of change

Checklist

Sync `docs/llm_main` with `master` #13286

Sync `docs/llm_main` with `master` #13286