Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funding, acknowledgement statements are not split into sentences #1090

Open
lfoppiano opened this issue Mar 8, 2024 · 8 comments
Open

Funding, acknowledgement statements are not split into sentences #1090

lfoppiano opened this issue Mar 8, 2024 · 8 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera
Milestone

Comments

@lfoppiano
Copy link
Collaborator

I've noticed that while the data availability is split into sentences, the funding statement is not. Is this by design or should be implemented?

Example:

		</body>
		<back>

			<div type="funding">
<div xml:id="_ERHBmGS"><p xml:id="_WYZCd2J">Funding: This work was supported by the <rs type="funder">National Natural Science Foundation of China</rs> (<rs type="grantNumber">51561009</rs>), the <rs type="funder">Natural Science Foundation of Jiangxi Province</rs> (<rs type="grantNumber">20192BAB206004</rs> and <rs type="grantNumber">20202BAB214003</rs>), the <rs type="funder">Key Research and Development Program of Jiangxi Province</rs> (<rs type="grantNumber">20202BBE53014</rs>), the <rs type="funder">Open Foundation of Guo Rui Scientific Innovation Rare Earth Functional Materials Co</rs>., Ltd.(<rs type="grantNumber">KFJJ-2019-0004</rs>), the <rs type="funder">Doctoral Start-up Foundation of Jiangxi University of Science and Technology (205200100110)</rs>, and the <rs type="funder">Foundation of Jiangxi Educational Department</rs> (<rs type="grantNumber">GJJ200832</rs> and <rs type="grantNumber">GJJ190478</rs>).Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.</p></div>
			</div>
			<listOrg type="funding">
				<org type="funding" xml:id="_3wHAxav">
					<idno type="grant-number">51561009</idno>
				</org>
				<org type="funding" xml:id="_NRNDwrU">
					<idno type="grant-number">20192BAB206004</idno>
				</org>
				<org type="funding" xml:id="_uNWJMnb">
					<idno type="grant-number">20202BAB214003</idno>
				</org>
				<org type="funding" xml:id="_2MPuZAy">
					<idno type="grant-number">20202BBE53014</idno>
				</org>
				<org type="funding" xml:id="_B7kBgef">
					<idno type="grant-number">KFJJ-2019-0004</idno>
				</org>
				<org type="funding" xml:id="_tk7RJ29">
					<idno type="grant-number">GJJ200832</idno>
				</org>
				<org type="funding" xml:id="_mCAyMcx">
					<idno type="grant-number">GJJ190478</idno>
				</org>
			</listOrg>

			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0" xml:id="_Y8sCy4Q"><p xml:id="_8VCfdSN"><s xml:id="_9cHCbev" coords="11,167.27,420.46,292.63,8.63">Data Availability Statement: Data sharing is not applicable to this article.</s></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head xml:id="_N3EAdDh">Conflicts of Interest:</head><p xml:id="_Gha9GTZ"><s xml:id="_uN9vzJZ" coords="11,252.96,438.15,165.99,8.63">The authors declare no conflict of interest.</s></p></div>
			</div>

energies-14-08509.pdf

@scottkerr-dataseer
Copy link

Good question. Maybe we could ask Tim about sentence-level granularity in general in any section. It does come at some cost. Maybe we should support 2 modes.

I don't know what the significance of splitting into sentences is in the system. I know it doesn't play a role in deliverables to customers (unless it influences the rules - e.g. number of sentences with a specific value). It may be primarily for debugging purposes.

@lfoppiano lfoppiano changed the title Funding statement not splitted in sentences Funding, acknowledgement statements are not split into sentences Apr 10, 2024
@lfoppiano
Copy link
Collaborator Author

I also found that the acknowledgement is not split into sentences. I'm assuming can be the same case.

@lfoppiano
Copy link
Collaborator Author

Digging deeper I notice that the funding statement is correctly split into sentences, however they are lost when it's passed through the acknowledgment/funding parser:

fundingStmt = getSectionAsTEI("funding",
                "\t\t\t",
                doc,
                SegmentationLabels.FUNDING,
                teiFormatter,
                resCitations,
                config);
            if (fundingStmt.length() > 0) {
                MutablePair<Element, MutableTriple<List<Funding>,List<Person>,List<Affiliation>>> localResult = 
                    parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config);

                if (localResult != null && localResult.getLeft() != null){
                    String local_tei = localResult.getLeft().toXML();
                    local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", "");
                    annexStatements.add(local_tei);
                } else {
                    annexStatements.add(fundingStmt.toString());
                }

@kermitt2
Copy link
Owner

kermitt2 commented Apr 12, 2024

Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).

As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here:
https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194

One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.

(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Apr 15, 2024

Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).

Understood. It become more clear once I saw the TEIFormatter part related to funding and acknowledgments.

As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194

One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.

Sure, at the moment the current segmentation was just extended to avoid URLs being split between sentences (#1097). Because once the offset positions are collected is just a matter of extending the list of forbidden positions.
Since the work to output the URL into the TEI might take some time and substantially more effort, I made two separate PRs (eventually changes in the segmenter might be reverted in this PR).

(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)

@lfoppiano lfoppiano self-assigned this Apr 24, 2024
@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Apr 26, 2024

I think, with this approach (segmenting after the "final" markup is built) we won't be able to generate coordinate for each sentences because we have lost the layout token information after the transformation to XML.

One solution comes to my. mind would be to work on the layout tokens before the TEI transformation and collect all the item in a list and apply them in order given that they are not overlapping, the same I did here:

if (CollectionUtils.isEmpty(matchedLabelPosition)){

This would require to remove any TEI dependency from the funding/acknowledgment parser and deal with the transformation in TEI outside the parser, instead of processing the Element/Node XML.
I'm planning to cement it with a battery of tests. 😅

@kermitt2 please let me know if you have any comment.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented May 1, 2024

After s few days trying different solutions, I implemented it by modifying the processXMLfragment. In this way the sentences are just reused and the funding-acknowledgment entities are applied on them, rather than on the stripped text from the paragraph.

This approach also preserve the sentence coordinates and the reference markers that were lost as well.

@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label May 1, 2024
@lfoppiano lfoppiano added this to the 0.8.1 milestone May 5, 2024
@lfoppiano
Copy link
Collaborator Author

I've started testing and noticed that in rare cases (although possible), the sentence segmentation, which is performed before the funding-acknowledgment model, result in sentences that fall on funding-acknowledgment annotations.

e.g. The first is the original version, without the sentence segmentation:

<div type="acknowledgement">
    <div>
        <head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
        <p coords="31,191.82,493.44,347.12,9.57;31,72.00,522.72,81.26,9.57">We thank
            <rs type="person">Drs. Carsten Korth</rs> and
            <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
        </p>
    </div>
</div>

Here the first sentence falls on the annotation "Drs.Carsten Korth":

<div type="acknowledgement">
    <div>
        <head>Acknowledgments:</head>
        <p>
            <s>We thank Drs.</s>
            <s>Carsten Korth and
                <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
            </s>
        </p>
    </div>
</div>

I've then worked out a solution that allow merging and updating sentences that are in this situation, including their coordinates.

Here the result:

<div type="acknowledgement">
                <div>
                    <head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
                    <p>
                        <s coords="31,191.82,493.44,63.87,9.57;31,258.46,493.44,280.48,9.57;31,72.00,522.72,81.26,9.57">We thank 
                            <rs type="person">Drs.Carsten Korth</rs> and 
                            <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
                        </s>
                    </p>
                </div>
            </div>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
Development

No branches or pull requests

3 participants