Update `TextCatBOW` to use the fixed `SparseLinear` layer #13149

danieldk · 2023-11-23T13:38:23Z

Description

A while ago, we fixed the SparseLinear layer to use all available parameters: explosion/thinc#754

This change updates TextCatBOW to v3 which uses the new SparseLinear_v2 layer. This results in a sizeable improvement on a text categorization task that was tested.

While at it, this spacy.TextCatBOW.v3 also adds the length_exponent option to make it possible to change the hidden size. Ideally, we'd just have an option called length. But the way that TextCatBOW uses hashes results in a non-uniform distribution of parameters when the length is not a power of two.

Types of change

Bugfix

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

A while ago, we fixed the `SparseLinear` layer to use all available parameters: explosion/thinc#754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two.

svlandeg

Looks great!

Ideally, we'd just have an option called length. But the way that TextCatBOW uses
hashes results in a non-uniform distribution of parameters when the
length is not a power of two.

One alternative would be to take the highest power of two fitting within length if we don't want to expose these internals too much to users.

spacy/tests/pipeline/test_textcat.py

adrianeboyd · 2023-11-27T09:48:01Z

If you do keep length_exponent as a direct setting, I would vote strongly for a different name that's easier for users to understand. I don't immediately know what, though...

danieldk · 2023-11-27T13:08:55Z

One alternative would be to take the highest power of two fitting within length if we don't want to expose these internals too much to users.

Rounding up seems like a good strategy. I'll update the PR to do that.

We now round up the length to the next power of two if it isn't a power of two.

adrianeboyd

I feel like the power-of-2 bit could be emphasized a bit more for people coming back to this code in the future.

spacy/ml/models/textcat.py

…13149) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: explosion/thinc#754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import

svlandeg added enhancement Feature requests and improvements feat / textcat Feature: Text Classifier labels Nov 23, 2023

svlandeg reviewed Nov 27, 2023

View reviewed changes

spacy/tests/pipeline/test_textcat.py Outdated Show resolved Hide resolved

spacy/tests/pipeline/test_textcat.py Show resolved Hide resolved

danieldk added 3 commits November 27, 2023 16:15

Replace TexCatBOW length_exponent parameter by length

d865f9b

We now round up the length to the next power of two if it isn't a power of two.

Remove some tests for TextCatBOW.v2

7d23caf

Fix missing import

4f18f31

svlandeg merged commit da7ad97 into explosion:master Nov 29, 2023
13 checks passed

adrianeboyd reviewed Nov 29, 2023

View reviewed changes

spacy/ml/models/textcat.py Show resolved Hide resolved

spacy/ml/models/textcat.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `TextCatBOW` to use the fixed `SparseLinear` layer #13149

Update `TextCatBOW` to use the fixed `SparseLinear` layer #13149

danieldk commented Nov 23, 2023 •

edited by svlandeg

svlandeg left a comment

adrianeboyd commented Nov 27, 2023

danieldk commented Nov 27, 2023

adrianeboyd left a comment

Update TextCatBOW to use the fixed SparseLinear layer #13149

Update TextCatBOW to use the fixed SparseLinear layer #13149

Conversation

danieldk commented Nov 23, 2023 • edited by svlandeg

Description

Types of change

Checklist

svlandeg left a comment

Choose a reason for hiding this comment

adrianeboyd commented Nov 27, 2023

danieldk commented Nov 27, 2023

adrianeboyd left a comment

Choose a reason for hiding this comment

Update `TextCatBOW` to use the fixed `SparseLinear` layer #13149

Update `TextCatBOW` to use the fixed `SparseLinear` layer #13149

danieldk commented Nov 23, 2023 •

edited by svlandeg