Wrong span for compounds? #264

ghost · 2019-07-30T22:20:54Z

Hey there,
I was wondering if get_span_for_compound_noun(noun) is correctly implemented? Iterating over the lefts is not going to work for triple (plus) compounds as they won't be rooted at the same noun but in a cascading fashion (this would not get the compound noun "pool table stick" for example) This is what I ended up doing:

def get_span_for_compound_word(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    end_i = noun.i
    start_i = end_i
    lefts = list(noun.lefts)
    while(lefts and lefts[-1].dep_ == "compound"):
        start_i -= 1
        lefts = list(lefts[-1].lefts)
    return (start_i, end_i)

Technically, for it to be a proper span it probably should have returned end_i+1?

The text was updated successfully, but these errors were encountered:

bdewilde · 2019-08-01T14:44:24Z

Hey @ardeego , thanks for reporting! I'll fix this. Just to be sure I get this right, though, could you provide a few example sentences containing both 2- and 3+ word compound nouns? (I'm having a hard time coming up with good examples to test...)

bdewilde · 2019-08-01T21:44:53Z

Another reason for more examples is that I don't think I'm seeing the issue you mentioned in the first example I came up with:

>>> doc = nlp("The cherry blossom trees were in full bloom.")
>>> for tok in doc:
...     print(tok, "=>", [l, l.dep_) for l in tok.lefts])
The => []
cherry => []
blossom => []
trees => [(The, 'det'), (cherry, 'compound'), (blossom, 'compound')]
were => [(trees, 'nsubj')]
in => []
full => []
bloom => [(full, 'amod')]
. => []

You can see that both "cherry" and "blossom" are in the lefts of "trees".

ghost · 2019-08-01T23:20:45Z

Hi, I think for the most part it get's it right, but I've seen it here:

doc = nlp(u"The microsoft browser program window shadow isn't looking very good on the screen")
for tok in doc:
    print(tok, "=>", [(l, l.dep_) for l in tok.lefts])

The => []
microsoft => []
browser => [(microsoft, 'compound')]
program => []
window => [(browser, 'compound'), (program, 'compound')]
shadow => [(The, 'det'), (window, 'compound')]
is => []
n't => []
looking => [(shadow, 'nsubj'), (is, 'aux'), (n't, 'neg')]
very => []
good => [(very, 'advmod')]
on => []
the => []
screen => [(the, 'det')]

I used this to better visualize the issue:

from spacy import displacy
displacy.render(doc, style="dep")

Most consistently it happens if a proper noun is part of the compound. Does this parse the same way for you? Otherwise this might be an issue with spaCy?

ghost · 2019-08-01T23:50:07Z

The code I posted earlier is not a fix either btw cuz it misses the places where your code skipped left correctly.

bdewilde · 2019-08-02T17:56:56Z

Mine actually parses differently. (I'm using the small English 2.1 model, you?).

The => []
microsoft => []
browser => [(microsoft, 'amod')]
program => [(browser, 'compound')]
window => [(program, 'compound')]
shadow => [(The, 'det'), (window, 'compound')]
is => []
n't => []
looking => [(shadow, 'nsubj'), (is, 'aux'), (n't, 'neg')]
very => []
good => [(very, 'advmod')]
on => []
the => []
screen => [(the, 'det')]

I'm honestly not sure how to proceed. If the parser were 100% correct, would we expect all nouns in a single compound noun to be in the .lefts of the primary noun? If not, what's the expected form of the parse?

ghost · 2019-08-02T18:51:12Z

I was using the large English model:

Model: en_core_web_lg
spaCy version: 2.1.4

I'm not sure how to proceed either. Nor am I sure what 100% correct would mean for a compound, I thought going over the lefts on a noun chunk would be sufficient, til I found several examples where it was not. I have seen a whole lot of combination of compound dependency arrangements. I guess the full compound would be the longest "compound dependency path" in the parsing tree. I am sitting a bit on the fence on this, as I wonder if a longer compound noun is necessarily better than a partial one. If the compound word gets too long it would be come very specific and very rare over the corpus, so I wonder how much of a downstream value such a compound would serve. Maybe @honnibal can advise on the parsing correctness?

ghost · 2019-08-05T19:25:57Z

This is the non-recursive solution I ended up settling on (handles all parsing variations of compound nouns I came across). Not as slick as your itertools solution, but performance should be pretty good, as no generators had to be expanded into lists.

def get_span_for_compound_noun(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    end = noun.i
    start = end
    lefts = noun.lefts
    while(True):
        token = None
        for token in lefts:
            if token.dep_ == "compound":
                start = token.i
                lefts = token.lefts
                break
        if token:
            if start != token.i:
                break
        else:
            break
    return (start, end)

bdewilde · 2019-08-09T16:55:06Z

I'm still a bit torn about this, but if it works, I'm happy to accept the empirical evidence. :) Could you provide me with a handful of example sentences and their expected outputs? I'll add a test to make sure everything works as expected. Thanks for keeping on this!

ghost · 2019-08-14T02:44:25Z

Will make some time for it. Thanks.

ghost added the bug label Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong span for compounds? #264

Wrong span for compounds? #264

ghost commented Jul 30, 2019

bdewilde commented Aug 1, 2019 •

edited

bdewilde commented Aug 1, 2019 •

edited

ghost commented Aug 1, 2019

ghost commented Aug 1, 2019

bdewilde commented Aug 2, 2019

ghost commented Aug 2, 2019

ghost commented Aug 5, 2019 •

edited by ghost

bdewilde commented Aug 9, 2019

ghost commented Aug 14, 2019

Wrong span for compounds? #264

Wrong span for compounds? #264

Comments

ghost commented Jul 30, 2019

bdewilde commented Aug 1, 2019 • edited

bdewilde commented Aug 1, 2019 • edited

ghost commented Aug 1, 2019

ghost commented Aug 1, 2019

bdewilde commented Aug 2, 2019

ghost commented Aug 2, 2019

ghost commented Aug 5, 2019 • edited by ghost

bdewilde commented Aug 9, 2019

ghost commented Aug 14, 2019

bdewilde commented Aug 1, 2019 •

edited

bdewilde commented Aug 1, 2019 •

edited

ghost commented Aug 5, 2019 •

edited by ghost