Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy has inconsistency when dividing sentences #13346

Open
DhruvSondhi opened this issue Feb 22, 2024 · 5 comments
Open

Spacy has inconsistency when dividing sentences #13346

DhruvSondhi opened this issue Feb 22, 2024 · 5 comments
Labels
feat / parser Feature: Dependency Parser

Comments

@DhruvSondhi
Copy link

Hello,

I am using Spacy to divide sentences after joining a set of words with whitespaces. But to my dismay, this process has unpredictable and unexplainable behaviour. I have a custom segmentation function where I am trying to set custom sentence boundaries (ie is_sent_start).

Custom Function:

from spacy.language import Language

@Language.component("segm")
def set_custom_segmentation(doc):
    i = 0
    while i < len(doc[:-1]):
        if doc[i].text.lower() in ["eq", "fig", "al", 'table', "fig."]:
            doc[i+1].is_sent_start = False
            i+=1
        elif doc[i].text in ["(", "'s"]:
            doc[i].is_sent_start = False
            i+=1
        elif doc[i].text in [".", ")."]:
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False
        i+=1
    return doc

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("segm", before="parser")
nlp.pipeline

This is my nlp.pipeline.

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x29f4c3ee0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x29f4c3f40>),
 ('segm', <function __main__.set_custom_segmentation(doc)>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x29f8380b0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x29e3ee4c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x29dd1f100>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x29f7f7f40>)]

How to reproduce the behaviour

doc = nlp("Massive ETGs are summarized in a schematic way in Fig. 2 . ##(this is the sentence to consider)## We refer the reader to fig. 1 of Forbes et al. ( 2011 ) and fig. 10 of Faifer et al. ( 2011 ) for real-world examples of our schematic plot, which show not only the mean gradients but also the individual GC data points. Figure 2.")

for sent in doc.sents:
    print(sent)

This is the current output:
Screenshot 2024-02-22 at 13 50 02

The form of the tokens here Fig. 2 . produces different outputs for the sentences. Please see the following examples.

  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 21 . (changed 2 . to 21 . )
Screenshot 2024-02-22 at 13 50 37
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 21. (removed space b/w 21 & period)
Screenshot 2024-02-22 at 13 51 41
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 2. (removed space b/w 2 & period)
Screenshot 2024-02-22 at 13 52 08
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 1 . (changed 2 to 1)
Screenshot 2024-02-22 at 13 53 48
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 3 . (changed 2 to 3)
Screenshot 2024-02-22 at 13 54 26
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 4 . (changed 2 to 4)
Screenshot 2024-02-22 at 13 54 41
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 4. (changed 2 to 4 and removed whitespace)
Screenshot 2024-02-22 at 13 55 04
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 200. (changed 2 to 200 and removed space)
Screenshot 2024-02-22 at 13 55 43
  1. Here if we change the Massive ETGs are summarized in a schematic way in Fig. 200 . (changed 2 to 200)
Screenshot 2024-02-22 at 13 56 19

There is inconsistent behaviour in the way the sentence boundaries are categorised here. I have other examples as well so if needed I can share them here.

Any help in understanding this would be appreciated.

Your Environment

  • spaCy version: 3.6.0
  • Platform: macOS-14.3.1-arm64-arm-64bit
  • Python version: 3.10.12
  • Pipelines: en_core_web_lg (3.6.0), en_core_web_sm (3.6.0)
@danieldk
Copy link
Contributor

danieldk commented Feb 22, 2024

One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines:

https://spacy.io/api/dependencyparser#assigned-attributes

If you have your own pipe that sets boundaries, you may want to run this pipe after the dependency parser for this reason. Could you try to see if this improves things for you?

@danieldk danieldk added feat / parser Feature: Dependency Parser more-info-needed This issue needs more information labels Feb 22, 2024
@DhruvSondhi
Copy link
Author

Hello @danieldk,

Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

I don't think this will work as the parsing being done here interferes with the custom segmentation boundaries that I require to be set due to certain edge cases such as Fig., eg., etc.

I saw another issue here: #3569. Similar issue.

@koder-ua
Copy link

koder-ua commented Feb 24, 2024

I got a bit different, but similar issue

spacy == 3.7.4, mac


In [69]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one").sents))
Out[69]: 1   <<<<<<<<<<<<<<<<<<<<<< WRONG

In [70]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one.").sents))
Out[70]: 3

In [71]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one").sents))
Out[71]: 3

In [72]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one.").sents))
Out[72]: 3

@danieldk
Copy link
Contributor

I got a bit different, but similar issue

This is a different question. Could you open a topic on the discussion forum?

@danieldk
Copy link
Contributor

Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an error.

Ah right, sorry, I overlooked that. The issue with changing the boundaries after parsing is that it could result in dependency relations that cross sentence boundaries, which is one of the reasons why we disallow this. We'll have to look into this more deeply, because the parser should in principle respect boundaries that were set earlier. Also see

#11107
#7716

for more background.

@github-actions github-actions bot removed the more-info-needed This issue needs more information label Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / parser Feature: Dependency Parser
Projects
None yet
Development

No branches or pull requests

3 participants