Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of the recovery of Pragmatic Segmenter sentence segmentation text wrt to the original text offsets #701

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Jan 27, 2021

The pragmatic segmenter seems to modify the output string which makes the extraction of the sentence offsets more complicated.

This PR makes two changes:

Recovery of segmentation offset using Pragmatic Segmenter

This implementation uses an external library to compute the diff of the sentence and the original text.

For example

original = "This is the original text. Some spaces are going to be removed. sentence = "This is the original text."

The idea is to get the starting and ending within original taking as reference sentence.
After translating the diff in a char based string, it compute the starting and ending of the sentence within the text.

The diff is a list of string where each element is structured as [operation, space, character], having operation = +, - or (e.g. comparing the string a and ab would result in the following:
diff = [' a', '- b']

After the diff is computed using a two-pass heuristic:

  • starting from left, we collect all the characters starting from the first character that is common a both strings
  • starting from the right, we remove all the characters that do not equal in the diff.

The heuristic also is limited to a subset of the string, should a sentence has been identified before.

@lfoppiano lfoppiano modified the milestone: 0.6.2 Jan 27, 2021
@lfoppiano lfoppiano added this to the 0.6.2 milestone Mar 15, 2021
@kermitt2 kermitt2 changed the title Sentence segmentation detection Improvement of the recovery of Pragmatic Segmenter sentence segmentation text wrt to the original text offsets Mar 19, 2021
@kermitt2 kermitt2 modified the milestones: 0.6.2, 0.7.0 Mar 19, 2021
@coveralls
Copy link

coveralls commented May 11, 2021

Coverage Status

coverage: 39.498% (-0.4%) from 39.903%
when pulling 5aca6b8 on sentence-segmentation-detection
into 5b14536 on master.

@kermitt2 kermitt2 modified the milestones: 0.7.0, 0.7.1 Jul 18, 2021
@lfoppiano
Copy link
Collaborator Author

The last commit a577523 should fix the issue #753

@lfoppiano lfoppiano linked an issue Jul 29, 2022 that may be closed by this pull request
@lfoppiano lfoppiano modified the milestones: 0.7.1, 0.8.0 Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with sentence segmentation offsets
3 participants