Funding, acknowledgement statements are not split into sentences #1090

lfoppiano · 2024-03-08T04:48:59Z

I've noticed that while the data availability is split into sentences, the funding statement is not. Is this by design or should be implemented?

Example:

		</body>
		<back>

			<div type="funding">
<div xml:id="_ERHBmGS"><p xml:id="_WYZCd2J">Funding: This work was supported by the <rs type="funder">National Natural Science Foundation of China</rs> (<rs type="grantNumber">51561009</rs>), the <rs type="funder">Natural Science Foundation of Jiangxi Province</rs> (<rs type="grantNumber">20192BAB206004</rs> and <rs type="grantNumber">20202BAB214003</rs>), the <rs type="funder">Key Research and Development Program of Jiangxi Province</rs> (<rs type="grantNumber">20202BBE53014</rs>), the <rs type="funder">Open Foundation of Guo Rui Scientific Innovation Rare Earth Functional Materials Co</rs>., Ltd.(<rs type="grantNumber">KFJJ-2019-0004</rs>), the <rs type="funder">Doctoral Start-up Foundation of Jiangxi University of Science and Technology (205200100110)</rs>, and the <rs type="funder">Foundation of Jiangxi Educational Department</rs> (<rs type="grantNumber">GJJ200832</rs> and <rs type="grantNumber">GJJ190478</rs>).Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.</p></div>
			</div>
			<listOrg type="funding">
				<org type="funding" xml:id="_3wHAxav">
					<idno type="grant-number">51561009</idno>
				</org>
				<org type="funding" xml:id="_NRNDwrU">
					<idno type="grant-number">20192BAB206004</idno>
				</org>
				<org type="funding" xml:id="_uNWJMnb">
					<idno type="grant-number">20202BAB214003</idno>
				</org>
				<org type="funding" xml:id="_2MPuZAy">
					<idno type="grant-number">20202BBE53014</idno>
				</org>
				<org type="funding" xml:id="_B7kBgef">
					<idno type="grant-number">KFJJ-2019-0004</idno>
				</org>
				<org type="funding" xml:id="_tk7RJ29">
					<idno type="grant-number">GJJ200832</idno>
				</org>
				<org type="funding" xml:id="_mCAyMcx">
					<idno type="grant-number">GJJ190478</idno>
				</org>
			</listOrg>

			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0" xml:id="_Y8sCy4Q"><p xml:id="_8VCfdSN"><s xml:id="_9cHCbev" coords="11,167.27,420.46,292.63,8.63">Data Availability Statement: Data sharing is not applicable to this article.</s></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head xml:id="_N3EAdDh">Conflicts of Interest:</head><p xml:id="_Gha9GTZ"><s xml:id="_uN9vzJZ" coords="11,252.96,438.15,165.99,8.63">The authors declare no conflict of interest.</s></p></div>
			</div>

energies-14-08509.pdf

scottkerr-dataseer · 2024-03-14T13:52:35Z

Good question. Maybe we could ask Tim about sentence-level granularity in general in any section. It does come at some cost. Maybe we should support 2 modes.

I don't know what the significance of splitting into sentences is in the system. I know it doesn't play a role in deliverables to customers (unless it influences the rules - e.g. number of sentences with a specific value). It may be primarily for debugging purposes.

lfoppiano · 2024-04-10T05:27:52Z

I also found that the acknowledgement is not split into sentences. I'm assuming can be the same case.

lfoppiano · 2024-04-10T05:37:10Z

Digging deeper I notice that the funding statement is correctly split into sentences, however they are lost when it's passed through the acknowledgment/funding parser:

fundingStmt = getSectionAsTEI("funding",
                "\t\t\t",
                doc,
                SegmentationLabels.FUNDING,
                teiFormatter,
                resCitations,
                config);
            if (fundingStmt.length() > 0) {
                MutablePair<Element, MutableTriple<List<Funding>,List<Person>,List<Affiliation>>> localResult = 
                    parsers.getFundingAcknowledgementParser().processingXmlFragment(fundingStmt.toString(), config);

                if (localResult != null && localResult.getLeft() != null){
                    String local_tei = localResult.getLeft().toXML();
                    local_tei = local_tei.replace(" xmlns=\"http://www.tei-c.org/ns/1.0\"", "");
                    annexStatements.add(local_tei);
                } else {
                    annexStatements.add(fundingStmt.toString());
                }

kermitt2 · 2024-04-12T20:17:18Z

Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).

As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here:
https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194

One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.

(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)

lfoppiano · 2024-04-15T03:48:11Z

Hello, indeed, everywhere the funding-acknowledgement parser is applies, the sentence segmentation is ignored. The reason is that it would require to take into account the (numerous) annotations produced by this model when re-segmenting into sentences, which was not supported by the sentence segmentation (it only supports reference marker annotations).

Understood. It become more clear once I saw the TEIFormatter part related to funding and acknowledgments.

As the current sentence segmentation is already quite complex, I thought about another approach, a more generic sentence segmentation, which I developed working on the final TEI XML directly and I think supporting any existing and future inline markup - this is available here: https://github.com/kermitt2/Pub2TEI/blob/master/src/main/java/org/pub2tei/document/XMLUtilities.java#L194

One idea would be to move to this simple generic sentence segmentation, instead of extending and complexifying the existing one.

Sure, at the moment the current segmentation was just extended to avoid URLs being split between sentences (#1097). Because once the offset positions are collected is just a matter of extending the list of forbidden positions.
Since the work to output the URL into the TEI might take some time and substantially more effort, I made two separate PRs (eventually changes in the segmenter might be reverted in this PR).

(as visible on Pub2TEI the other advantage of the generic approach working on TEI XML directly is that it can be applied to any TEI XML from Pub2TEI or from LaTeXML, making possible sentence segmentation consistent for all these sources, even if they introduce unexpected/new markup inline with the text in the future)

lfoppiano · 2024-04-26T23:51:28Z

I think, with this approach (segmenting after the "final" markup is built) we won't be able to generate coordinate for each sentences because we have lost the layout token information after the transformation to XML.

One solution comes to my. mind would be to work on the layout tokens before the TEI transformation and collect all the item in a list and apply them in order given that they are not overlapping, the same I did here:

grobid/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java

Line 1570 in 0b5e232

if (CollectionUtils.isEmpty(matchedLabelPosition)){

This would require to remove any TEI dependency from the funding/acknowledgment parser and deal with the transformation in TEI outside the parser, instead of processing the Element/Node XML.
I'm planning to cement it with a battery of tests. 😅

@kermitt2 please let me know if you have any comment.

lfoppiano · 2024-05-01T20:01:42Z

After s few days trying different solutions, I implemented it by modifying the processXMLfragment. In this way the sentences are just reused and the funding-acknowledgment entities are applied on them, rather than on the stripped text from the paragraph.

This approach also preserve the sentence coordinates and the reference markers that were lost as well.

lfoppiano · 2024-05-05T22:31:34Z

I've started testing and noticed that in rare cases (although possible), the sentence segmentation, which is performed before the funding-acknowledgment model, result in sentences that fall on funding-acknowledgment annotations.

e.g. The first is the original version, without the sentence segmentation:

<div type="acknowledgement">
    <div>
        <head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
        <p coords="31,191.82,493.44,347.12,9.57;31,72.00,522.72,81.26,9.57">We thank
            <rs type="person">Drs. Carsten Korth</rs> and
            <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
        </p>
    </div>
</div>

Here the first sentence falls on the annotation "Drs.Carsten Korth":

<div type="acknowledgement">
    <div>
        <head>Acknowledgments:</head>
        <p>
            <s>We thank Drs.</s>
            <s>Carsten Korth and
                <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
            </s>
        </p>
    </div>
</div>

I've then worked out a solution that allow merging and updating sentences that are in this situation, including their coordinates.

Here the result:

<div type="acknowledgement">
                <div>
                    <head coords="31,72.00,491.09,114.40,12.58">Acknowledgments:</head>
                    <p>
                        <s coords="31,191.82,493.44,63.87,9.57;31,258.46,493.44,280.48,9.57;31,72.00,522.72,81.26,9.57">We thank 
                            <rs type="person">Drs.Carsten Korth</rs> and 
                            <rs type="person">Nick Brandon</rs> for generously providing anti- DISC1 antibodies.
                        </s>
                    </p>
                </div>
            </div>

lfoppiano changed the title ~~Funding statement not splitted in sentences~~ Funding, acknowledgement statements are not split into sentences Apr 10, 2024

lfoppiano self-assigned this Apr 24, 2024

lfoppiano mentioned this issue May 1, 2024

Add missing sentence segmentation in funding and acknowledgement #1106

Open

lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label May 1, 2024

lfoppiano added this to the 0.8.1 milestone May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Funding, acknowledgement statements are not split into sentences #1090

Funding, acknowledgement statements are not split into sentences #1090

lfoppiano commented Mar 8, 2024

scottkerr-dataseer commented Mar 14, 2024

lfoppiano commented Apr 10, 2024

lfoppiano commented Apr 10, 2024

kermitt2 commented Apr 12, 2024 •

edited

lfoppiano commented Apr 15, 2024 •

edited

lfoppiano commented Apr 26, 2024 •

edited

lfoppiano commented May 1, 2024 •

edited

lfoppiano commented May 5, 2024

Funding, acknowledgement statements are not split into sentences #1090

Funding, acknowledgement statements are not split into sentences #1090

Comments

lfoppiano commented Mar 8, 2024

scottkerr-dataseer commented Mar 14, 2024

lfoppiano commented Apr 10, 2024

lfoppiano commented Apr 10, 2024

kermitt2 commented Apr 12, 2024 • edited

lfoppiano commented Apr 15, 2024 • edited

lfoppiano commented Apr 26, 2024 • edited

lfoppiano commented May 1, 2024 • edited

lfoppiano commented May 5, 2024

kermitt2 commented Apr 12, 2024 •

edited

lfoppiano commented Apr 15, 2024 •

edited

lfoppiano commented Apr 26, 2024 •

edited

lfoppiano commented May 1, 2024 •

edited