Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong span for compounds? #264

Open
ghost opened this issue Jul 30, 2019 · 9 comments
Open

Wrong span for compounds? #264

ghost opened this issue Jul 30, 2019 · 9 comments
Labels

Comments

@ghost
Copy link

ghost commented Jul 30, 2019

Hey there,
I was wondering if get_span_for_compound_noun(noun) is correctly implemented? Iterating over the lefts is not going to work for triple (plus) compounds as they won't be rooted at the same noun but in a cascading fashion (this would not get the compound noun "pool table stick" for example) This is what I ended up doing:

def get_span_for_compound_word(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    end_i = noun.i
    start_i = end_i
    lefts = list(noun.lefts)
    while(lefts and lefts[-1].dep_ == "compound"):
        start_i -= 1
        lefts = list(lefts[-1].lefts)
    return (start_i, end_i)

Technically, for it to be a proper span it probably should have returned end_i+1?

@ghost ghost added the bug label Jul 30, 2019
@bdewilde
Copy link
Collaborator

bdewilde commented Aug 1, 2019

Hey @ardeego , thanks for reporting! I'll fix this. Just to be sure I get this right, though, could you provide a few example sentences containing both 2- and 3+ word compound nouns? (I'm having a hard time coming up with good examples to test...)

@bdewilde
Copy link
Collaborator

bdewilde commented Aug 1, 2019

Another reason for more examples is that I don't think I'm seeing the issue you mentioned in the first example I came up with:

>>> doc = nlp("The cherry blossom trees were in full bloom.")
>>> for tok in doc:
...     print(tok, "=>", [l, l.dep_) for l in tok.lefts])
The => []
cherry => []
blossom => []
trees => [(The, 'det'), (cherry, 'compound'), (blossom, 'compound')]
were => [(trees, 'nsubj')]
in => []
full => []
bloom => [(full, 'amod')]
. => []

You can see that both "cherry" and "blossom" are in the lefts of "trees".

@ghost
Copy link
Author

ghost commented Aug 1, 2019

Hi, I think for the most part it get's it right, but I've seen it here:

doc = nlp(u"The microsoft browser program window shadow isn't looking very good on the screen")
for tok in doc:
    print(tok, "=>", [(l, l.dep_) for l in tok.lefts])
The => []
microsoft => []
browser => [(microsoft, 'compound')]
program => []
window => [(browser, 'compound'), (program, 'compound')]
shadow => [(The, 'det'), (window, 'compound')]
is => []
n't => []
looking => [(shadow, 'nsubj'), (is, 'aux'), (n't, 'neg')]
very => []
good => [(very, 'advmod')]
on => []
the => []
screen => [(the, 'det')]

I used this to better visualize the issue:

from spacy import displacy
displacy.render(doc, style="dep")

Most consistently it happens if a proper noun is part of the compound. Does this parse the same way for you? Otherwise this might be an issue with spaCy?

@ghost
Copy link
Author

ghost commented Aug 1, 2019

The code I posted earlier is not a fix either btw cuz it misses the places where your code skipped left correctly.

@bdewilde
Copy link
Collaborator

bdewilde commented Aug 2, 2019

Mine actually parses differently. (I'm using the small English 2.1 model, you?).

The => []
microsoft => []
browser => [(microsoft, 'amod')]
program => [(browser, 'compound')]
window => [(program, 'compound')]
shadow => [(The, 'det'), (window, 'compound')]
is => []
n't => []
looking => [(shadow, 'nsubj'), (is, 'aux'), (n't, 'neg')]
very => []
good => [(very, 'advmod')]
on => []
the => []
screen => [(the, 'det')]

I'm honestly not sure how to proceed. If the parser were 100% correct, would we expect all nouns in a single compound noun to be in the .lefts of the primary noun? If not, what's the expected form of the parse?

@ghost
Copy link
Author

ghost commented Aug 2, 2019

I was using the large English model:

  • Model: en_core_web_lg
  • spaCy version: 2.1.4

I'm not sure how to proceed either. Nor am I sure what 100% correct would mean for a compound, I thought going over the lefts on a noun chunk would be sufficient, til I found several examples where it was not. I have seen a whole lot of combination of compound dependency arrangements. I guess the full compound would be the longest "compound dependency path" in the parsing tree. I am sitting a bit on the fence on this, as I wonder if a longer compound noun is necessarily better than a partial one. If the compound word gets too long it would be come very specific and very rare over the corpus, so I wonder how much of a downstream value such a compound would serve. Maybe @honnibal can advise on the parsing correctness?

@ghost
Copy link
Author

ghost commented Aug 5, 2019

This is the non-recursive solution I ended up settling on (handles all parsing variations of compound nouns I came across). Not as slick as your itertools solution, but performance should be pretty good, as no generators had to be expanded into lists.

def get_span_for_compound_noun(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    end = noun.i
    start = end
    lefts = noun.lefts
    while(True):
        token = None
        for token in lefts:
            if token.dep_ == "compound":
                start = token.i
                lefts = token.lefts
                break
        if token:
            if start != token.i:
                break
        else:
            break
    return (start, end)

@bdewilde
Copy link
Collaborator

bdewilde commented Aug 9, 2019

I'm still a bit torn about this, but if it works, I'm happy to accept the empirical evidence. :) Could you provide me with a handful of example sentences and their expected outputs? I'll add a test to make sure everything works as expected. Thanks for keeping on this!

@ghost
Copy link
Author

ghost commented Aug 14, 2019

Will make some time for it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant