Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid splitting URLs between sentences #1097

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Apr 12, 2024

This PR addresses the issue of the sentence segmenter that might split URLs between sentences.
Updating the regex urlPattern is hard to do without high risk of introuducing new bugs (some experiments/attempts here)

The original grobid method to exploit the URI pdf annotations, was extended to support cases where the layout token resulting text and the provided postprocessed text differs which was leading to OutOfBoundException.

We have added/modified the following methods:

  • new method public static List<OffsetPosition> characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations) returns the character offset position in respect of the layout token string (that could be obtained by LayoutTokenUtil.toText(tokens).
  • new method public static List<OffsetPosition> tokenPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations) returns the token offset position.
  • modified the original public static List<OffsetPosition> characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations, String text) returns the character offset position in respect of the text string that is passed in input.

There are often cases where the text string and the aggregated string from the layoutToken are not matching (e.g. the text string is dehypenised), and this causes OutOfBoundException when applying substring.

The last method (characterPositionsUrlPatternWithPdfAnnotations(List<LayoutToken> layoutTokens, List<PDFAnnotation> pdfAnnotations, String text)) is called when the sentence segmenter is running so that we avoid splitting sentences with a URL in between.

The PR #1099 will improve the recognition because, in this PR, by applying the fix in the sentenceSegmenter that takes text as a string, the process is applied to the layout tokens and not to the text that might be dehypenised, and desynchronised with the layout tokens.

@coveralls
Copy link

coveralls commented Apr 12, 2024

Coverage Status

coverage: 40.116% (+0.2%) from 39.924%
when pulling 5bcb8b1 on feature/preserve-urls
into 83f2c81 on master.

@lfoppiano lfoppiano marked this pull request as ready for review April 12, 2024 01:18
@lfoppiano
Copy link
Collaborator Author

This issue was tested by processing all PMC and Biorxiv documents. No error or failures during processing.
Then I inspected the URLs with regexes and verified that no URLs were over sentences.

I also tested a bunch of problematic PDF documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants