Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer attribute .tokens_from_list deprecated #152

Open
fishcakebaker opened this issue Jun 18, 2021 · 3 comments
Open

Tokenizer attribute .tokens_from_list deprecated #152

fishcakebaker opened this issue Jun 18, 2021 · 3 comments

Comments

@fishcakebaker
Copy link

fishcakebaker commented Jun 18, 2021

The tokeniser attribute .tokens_from_list has been deprecated in SpaCy.

This is used in Chapter 7, Section 7.8 "Advanced Tokenisation, Stemming and Lemmatization" in block In[39].

I'm using SpaCy version 3.0.6 - Which I am guessing is several versions higher than the book, I just can't find where it is in my copy.

Any suggestions on getting around this function? I'm a bit of a newbie, but the searches online have led to rabbit holes thus far.

@Tanvi09Garg
Copy link

Instead using old_tokenizer.tokens_from_list, you can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer:

from spacy.tokens import Doc
class _PretokenizedTokenizer:
"""Custom tokenizer to be used in spaCy when the text is already pretokenized."""
def init(self, vocab: en_nlp):
"""Initialize tokenizer with a given vocab
:param vocab: an existing vocabulary
"""
self.vocab = vocab
for i in range(0,len(List)):
def call(self, inp: [List[i], str]) -> Doc:
"""Call the tokenizer on input inp.
:param inp: either a string to be split on whitespace, or a list of tokens
:return: the created Doc object
"""
if isinstance(inp, str):
words = inp.split()
spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
elif isinstance(inp, list):
return Doc(self.vocab, words=inp)
else:
raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")

@Tanvi09Garg
Copy link

'List' is used to store input string/text

@ypauchard
Copy link

ypauchard commented Dec 13, 2022

Probably similar to @Tanvi09Garg here is what works for me:

import re
import spacy
from spacy.tokens import Doc

# regexp used in CountVectorizer
# (?u) sets unicode flag, i.e. patterns are unicode
# \\b word boundary: the end of a word is indicated by whitespace or a non-alphanumeric character
# \\w alphanumeric: [0-9a-zA-Z_]

class RegexTokenizer:
    """Spacy custom tokenizer
        Reference https://spacy.io/usage/linguistic-features#custom-tokenizer
    """
    def __init__(self, vocab, regex_pattern='(?u)\\b\\w\\w+\\b'):
        self.vocab = vocab
        self.regexp = re.compile(regex_pattern)

    def __call__(self, text):
        words = self.regexp.findall(text)
        spaces = [True] * len(words)
        spaces[-1] = False #no space after last word

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.tokenizer = RegexTokenizer(nlp.vocab)

def custom_tokenizer(document):
    doc_spacy = nlp(document)
    return [token.lemma_ for token in doc_spacy]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer=custom_tokenizer)

It runs a bit slow, any suggestions to speed this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants