Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed fulltext block start #714

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

de-code
Copy link
Collaborator

@de-code de-code commented Feb 15, 2021

resolves #712

@de-code de-code self-assigned this Feb 15, 2021
@de-code
Copy link
Collaborator Author

de-code commented Feb 15, 2021

I am currently running into the issue described in #712 (comment)
I guess the way I am creating the test data isn't quite correct (using createFromText and getDocumentPartsForLayoutTokens, the latter extracted from processShortNew).

@coveralls
Copy link

coveralls commented Feb 15, 2021

Coverage Status

Coverage increased (+0.03%) to 38.19% when pulling 6911ef2 on elifesciences:fix-fulltext-block-start into bfc10f7 on kermitt2:master.

@de-code
Copy link
Collaborator Author

de-code commented Feb 15, 2021

I am currently running into the issue described in #712 (comment)
I guess the way I am creating the test data isn't quite correct (using createFromText and getDocumentPartsForLayoutTokens, the latter extracted from processShortNew).

I got around the problem by manually creating DocumentPiece for the whole document. I therefore reverted getDocumentPartsForLayoutTokens as it wouldn't be used by the test and could be refactored separately.

@de-code de-code marked this pull request as ready for review February 15, 2021 20:00
@@ -1,30 +1,42 @@
package org.grobid.core.engines;

import org.apache.commons.lang3.tuple.Pair;
import org.grobid.core.analyzers.GrobidAnalyzer;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few unused imports. Might be good to tidy it up (and remove commented out code, perhaps mark test to be ignored instead).

@@ -43,6 +55,41 @@ public static void tearDown() {
GrobidFactory.reset();
}

public DocumentPiece getWholeDocumentPiece(Document doc) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe getWholeDocumentPiece and getWholeDocumentParts could be moved to a central place, if it doesn't exist already.

@de-code
Copy link
Collaborator Author

de-code commented Feb 15, 2021

Other parsers are potentially affected as well.

There appear to be a lot of duplication and and could probably be refactored.

@de-code
Copy link
Collaborator Author

de-code commented Feb 16, 2021

Raised suggestion for refactoring the features: #718

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding the whole picture here, so I'm not sure I can really review this part.

@de-code
Copy link
Collaborator Author

de-code commented Feb 19, 2021

I'm having trouble understanding the whole picture here, so I'm not sure I can really review this part.

There is an example in the issue #712 - happy to add more information?

@kermitt2 kermitt2 added this to the 0.7.0 milestone Mar 20, 2021
@kermitt2 kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Apr 18, 2021
@kermitt2
Copy link
Owner

I have regenerated the fulltext model feature files with the fix, but actually there is no difference with before with respect to starting block. I also retrained the model (before checking the difference in features), in branch https://github.com/kermitt2/grobid/tree/fix-712
So apparently the problem in #712 never applies in practice in the training data at least.

It might appear in some new files like in your example, but the benchmark on PMC set only changes at the second decimal, so it should be minor.

So we could simply merge the fix, but no need to update the training files and model for this.

@kermitt2 kermitt2 modified the milestones: 0.7.0, 0.7.1 Jul 9, 2021
@kermitt2 kermitt2 modified the milestones: 0.7.1, 0.7.2 Sep 25, 2022
@kermitt2 kermitt2 modified the milestones: 0.7.2, 0.7.3 Oct 28, 2022
@kermitt2 kermitt2 modified the milestones: 0.7.3, 0.8.0 May 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full text model layout features: BLOCKSTART missing, if very first block token is a new line
4 participants