Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency sentence segmenter handles newlines inconsistently between languages #13059

Open
freddyheppell opened this issue Oct 11, 2023 · 3 comments
Labels
feat / senter Feature: Sentence Recognizer lang / it Italian language data and models

Comments

@freddyheppell
Copy link
Contributor

How to reproduce the behaviour

Colab notebook demonstrating problem

When parsing a sentence that contains newlines, the Italian parser sometimes assigns the newline to a sentence by itself, for example:

Ma regolamenta solo un settore, a differenza dell’azione a largo raggio dell’Inflation Act. \nI tentativi di legiferare per stimolare l’industria non hanno avuto molto successo.

Produces 3 sentences:

'Ma regolamenta solo un settore, a differenza dell’azione a largo raggio dell’Inflation Act (dalla sanità all’industria pesante).'
'\n'
'I tentativi di legiferare per stimolare l’industria non hanno avuto molto successo.'

There are various experiments with different combinations of punctuation in the notebook.

Looking at the tokens and their is_sent_start property, it seems under some circumstances the \n and I tokens are both assigned as the start of a new sentence.

I have not been able to cause this problem with en_core_web_sm, which always correctly identifies 2 sentences.

Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here, and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.

Your Environment

  • spaCy version: 3.6.1
  • Platform: Linux-5.15.120+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Pipelines: it_core_news_sm (3.6.0), en_core_web_sm (3.6.0)
@rmitsch
Copy link
Contributor

rmitsch commented Oct 12, 2023

Thanks for reporting this!

Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here...

Can you elaborate on the inconsistency between languages?

...and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.

While that is a reasonable take, bear in mind that spaCy's pretrained models (such as it_core_news_xx) are trained on corpora of natural language. \n is a control character and not something that appears in natural (Italian or otherwise) text, so the performance of trained models will not be great here.

I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary).

@rmitsch rmitsch added lang / it Italian language data and models feat / senter Feature: Sentence Recognizer labels Oct 12, 2023
@freddyheppell
Copy link
Contributor Author

Can you elaborate on the inconsistency between languages?

I believe this behaviour occurs much more frequently in Italian than other languages. As well as the examples in the notebook where English seems able to identify 2 sentences where Italian gets 3, I'm working on a partially-parallel corpus and Italian has a mean sents/doc that's noticeably higher than any other language (21 vs 14-16), which makes me think it's an Italian-specific issue.

I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary).

I was hoping to use the parser approach because the docs don't have ideal punctuation, but I tried the sentencizer with \n added to the chars list and it seems fine actually. It's also closed the gap a bit between sents/doc for Italian vs others (it's now 19 vs 15-16), so I think that further suggests there's some behaviour difference in the dependency sentenceizer.

@rmitsch
Copy link
Contributor

rmitsch commented Oct 13, 2023

It's possible something is going wrong with the whitespace augmentation, which is only supposed to attach whitespace to the preceding token and not create new sentences. We might look into this at a later point.

We're using this augmentation with the corpus - feel free to have a closer look and/or train your own model with modified settings:

[corpora.train.augmenter]
@augmenters = "spacy.combined_augmenter.v1"
lower_level = 0.1
whitespace_level = 0.1
whitespace_per_token = 0.05
whitespace_variants = "[\" \",\"\\t\",\"\\n\",\"\\u000b\",\"\\f\",\"\\r\",\"\\u001c\",\"\\u001d\",\"\\u001e\",\"\\u001f\",\" \",\"\\u0085\",\"\\u00a0\",\"\\u1680\",\"\\u2000\",\"\\u2001\",\"\\u2002\",\"\\u2003\",\"\\u2004\",\"\\u2005\",\"\\u2006\",\"\\u2007\",\"\\u2008\",\"\\u2009\",\"\\u200a\",\"\\u2028\",\"\\u2029\",\"\\u202f\",\"\\u205f\",\"\\u3000\"]"
orth_level = 0.0
orth_variants = null

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / senter Feature: Sentence Recognizer lang / it Italian language data and models
Projects
None yet
Development

No branches or pull requests

2 participants