Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountVectorizer and TfidfVectorizer docs do not mention token_pattern gets ignored when passing a custom tokenizer #15740

Closed
stephantul opened this issue Nov 29, 2019 · 3 comments

Comments

@stephantul
Copy link
Contributor

Description

The documentation for the Countvectorizer and TfidfVectorizer is not clear about the interaction between token_pattern and passing a custom tokenizer. Currently, when a tokenizer is passed, the token_pattern is ignored. But the docstring entry for the tokenizer parameter only mentions Override the string tokenization step while preserving the preprocessing and n-grams generation steps.. To me, it was not immediately clear that this meant that token_pattern was not used at all.

Here' a user that got thrown by this: Stackoverflow

Some things I can think of:

  • raise a warning if the user passes a (non-standard) token pattern and a custom tokenizer
  • update the docstring to be explicit about the interaction
@jnothman
Copy link
Member

jnothman commented Nov 30, 2019 via email

@stephantul
Copy link
Contributor Author

stephantul commented Nov 30, 2019

Sure. The warning (UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None') indeed does show up, my bad for not checking it first. If you want, I can create a PR with some doc edits that state what is going on, but perhaps the warning is enough.

@jnothman
Copy link
Member

jnothman commented Dec 1, 2019

The warning is new. Let's see how it goes

@jnothman jnothman closed this as completed Dec 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants