Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SpaCy support to cover new features #176

Open
frreiss opened this issue Feb 17, 2021 · 0 comments
Open

Update SpaCy support to cover new features #176

frreiss opened this issue Feb 17, 2021 · 0 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@frreiss
Copy link
Member

frreiss commented Feb 17, 2021

SpaCy 3.0's language models now produce some additional features that we don't currently translate to DataFrames. The parse tree information now includes information on children and ancestors. There is an is_sent_start flag to indicate whether a token is at the beginning of a sentence. There is support for embeddings in the vector field of Token. There are probably a few more. See https://spacy.io/api/token for the full list.

We should extend the existing SpaCy support in https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/io/spacy.py to support these additional features if present.

With these additional features, the DataFrame representation of the full output of a SpaCy language model is getting a bit large, so it would be a good idea to also add a facility to produce only the DataFrame columns that your application needs -- say, an additional argument to make_tokens_and_features that replaces and generalizes the existing add_left_and_right argument to control whether multiple columns appear in the output.

@frreiss frreiss added good first issue Good for newcomers help wanted Extra attention is needed labels Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant