Skip to content

Adding already tokenized document to spaCy pipeline #2606

Locked Answered by ines
Discussion options

You must be logged in to vote

Yes, you can always initialise a Doc object directly with the shared vocab, and pass in a list of words:

import spacy
from spacy.tokens import Doc

nlp = spacy.load('en_core_web_sm')
doc = Doc(nlp.vocab, words=['Eight', 'Iraqi', 'Kurds', 'killed', 'yesterday'])

The spaces keyword argument lets you pass in a list of boolean values that indicate whether the token is followed by whitespace or not. Here's an example:

doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, True, True])
print(doc.text)
# hello world !
doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, False, False])
print(doc.text)
# hello world!

Of course, this only works if your data actually contains…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / tokenizer Feature: Tokenizer
1 participant
Converted from issue

This discussion was converted from issue #2606 on December 10, 2020 13:28.