Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FoLiA-correct may deliver invalid FoLiA #66

Open
martinreynaert opened this issue Oct 5, 2022 · 1 comment
Open

FoLiA-correct may deliver invalid FoLiA #66

martinreynaert opened this issue Oct 5, 2022 · 1 comment
Assignees

Comments

@martinreynaert
Copy link
Contributor

Hi,

I have a batch of 51 valid FoLiA texts. After running through FoLiA-correct (OCR post-correction), 15 fail folialint.

I provide a test kit with the necessary input files for FoLiA-correct and a single test text here:
https://ticclops.uvt.nl/TESTcorrect.20221005.tar.gz

The README also contains the command-lines I used on this test set.

The kit contains the 'original' text which is one continuous blob of text, no newlines demarcating paragraphs. Also a new version with (rough) paragraphs: newlines inserted after each dot or '.'.

After FoLiA-correct, the new one validates, the original one fails.

I hope this can be resolved.

Thank you!

@kosloot
Copy link
Contributor

kosloot commented Oct 6, 2022

Well, after a quick investigation, it was clear that the problem is NOT in FoLiA-correct. In fact the produced FoLiA is correct!
But libfolia, and thus folialint has some problems extracting text from documents where the last tag in a Sentence is a correction.
I will create a new issue in libfolia to address this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants