Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with sentence segmentation offsets #753

Open
kermitt2 opened this issue Apr 29, 2021 · 7 comments · May be fixed by #701
Open

Issue with sentence segmentation offsets #753

kermitt2 opened this issue Apr 29, 2021 · 7 comments · May be fixed by #701
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera

Comments

@kermitt2
Copy link
Owner

In the following example

https://arxiv.org/pdf/2103.12028v1.pdf

there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:

<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s> 

As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.

@kermitt2 kermitt2 self-assigned this Apr 29, 2021
@kermitt2 kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Apr 29, 2021
@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 29, 2021 via email

@kermitt2
Copy link
Owner Author

Just tested... PR #701 does not fix it unfortunately, same error.
But this is the opportunity to check both this PR and fix the bug :)

@lfoppiano
Copy link
Collaborator

lfoppiano commented May 10, 2021

I've added some tests and the output of the method getSentenceOffsets is correct (PR 701).

I did some debugging, trying to understand where the issue is, and IMHO is at line ~244 of SentenceUtilities, where the upper bound is increased.

newPosition.end += pushedEnd+1;

In this specific case, the synchronisation between layout token and sentences seems containing the problem.
Also, there is a pos=0 after switching to a new sentence, and the synchronisation starts again from scratch... the newPosition.end is modified based on tokens that are not in the current sentence.

@lfoppiano
Copy link
Collaborator

I'm revising this issue and the somehow related PR #701.

The comment states that we need to compose the "text" without the forbidden elements (references), however IMHO we should keep these references in the text as well, run the segmentation and then remove them, isn't it?

I notice also that while the layout tokens contain all the token (including references), the text is a mixture:

CCAligned ) is a 119language 1 parallel dataset built off 68 snapshots of Common Crawl. Documents are aligned if they are in the same language according to FastText LangID (Joulin et al., 2016(Joulin et al., , 2017, and have the same URL but for a differing language code. These alignments are refined with cross-lingual LASER embeddings (Artetxe and Schwenk, 2019). For sentence-level data, they split on newlines and align with LASER, but perform no further filtering. Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. The quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European (da, cr, sl, sk, lt, et) languages of the TED corpus (Qi et al., 2018), where it compares favorably to systems built on crawled sentences from WikiMatrix and ParaCrawl   (Qi et al., 2018); WMT-5: cs, de, fi, lv, ro. POS/DEP-5: part-of-speech labeling and dependency parsing for bg, ca, da, fi, id.

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 13, 2021

(removing comment, it was more for #811 !)
#811 (comment)

but it's relevant to the fact that we don't remove references, just keep track of the positions. The text at this stage is not modified after segmentation. The rest of the method is re-injecting the tags in the segmented text, but don't touch the text.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Aug 16, 2021

but it's relevant to the fact that we don't remove references, just keep track of the positions. The text at this stage is not modified after segmentation. The rest of the method is re-injecting the tags in the segmented text, but don't touch the text.

Ok, so the forbidden spans and the text are in sync.

I've pushed a test (currently ignored) that reproduce the issue. The problem seems to be generated at this point:

if (this.isValidSuperScriptNumericalReferenceMarker(nextToken)) {

The layout token inspection ends up not in sync with the sentences, as you mentioned before.
However, I'm a bit lost here :-)

The layout token at index 25 has "superscript" = true (I set it explicitly in the test, without it would work) and is causing the chain reaction. Although, it is inspected only when we are at the sentence with index = 4, the one that appears with incorrect positions.

This following is unrelated.
I notice that some references (e.g. the first reference in text (El-Kishky et al., 2020)) are present in the layout tokens but missing in the text:

CCAligned (El-Kishky et al., 2020) is a 119-
language 1 parallel dataset built off 68 snapshots 
of Common Crawl. Documents are aligned if they 
[...]
CCAligned ) is a 119language 1 parallel dataset built off 68 snapshots of Common Crawl.[...]

@lfoppiano
Copy link
Collaborator

Since I've did some swimming in this part of the code, I've checked again with a fresh mind.

It seems that the footnote 1 (superscript=True) trigger line 234 if, which increases the upperlimit of the sentence.
Maybe we should just check that such token is not in the reference list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
2 participants