You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation for the Countvectorizer and TfidfVectorizer is not clear about the interaction between token_pattern and passing a custom tokenizer. Currently, when a tokenizer is passed, the token_pattern is ignored. But the docstring entry for the tokenizer parameter only mentions Override the string tokenization step while preserving the preprocessing and n-grams generation steps.. To me, it was not immediately clear that this meant that token_pattern was not used at all.
Sure. The warning (UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None') indeed does show up, my bad for not checking it first. If you want, I can create a PR with some doc edits that state what is going on, but perhaps the warning is enough.
Description
The documentation for the Countvectorizer and TfidfVectorizer is not clear about the interaction between
token_pattern
and passing a customtokenizer
. Currently, when atokenizer
is passed, thetoken_pattern
is ignored. But the docstring entry for the tokenizer parameter only mentionsOverride the string tokenization step while preserving the preprocessing and n-grams generation steps.
. To me, it was not immediately clear that this meant thattoken_pattern
was not used at all.Here' a user that got thrown by this: Stackoverflow
Some things I can think of:
The text was updated successfully, but these errors were encountered: