Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text model layout features: BLOCKSTART missing, if very first block token is a new line #712

Open
de-code opened this issue Feb 12, 2021 · 5 comments · May be fixed by #714
Open

Full text model layout features: BLOCKSTART missing, if very first block token is a new line #712

de-code opened this issue Feb 12, 2021 · 5 comments · May be fixed by #714
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera

Comments

@de-code
Copy link
Collaborator

de-code commented Feb 12, 2021

At least for some documents, the first token of a block seem to be a line feed.

In that case the line feed is filtered out:

	                if (TextUtilities.filterLine(text)) {
						n++;
	                    continue;
	                }

But when it is then getting to process the next "real" token, n will no longer be 0 but 1. Therefore it will not go into the main blockstart block:

	                if (n == 0) {
	                    features.lineStatus = "LINESTART";
	                    // be sure that previous token is closing a line, except if it's a starting line
	                    if (previousFeatures != null) {
	                    	if (!previousFeatures.lineStatus.equals("LINESTART"))
		                    	previousFeatures.lineStatus = "LINEEND";
	                    }
	                    if (token != null)
		                    lineStartX = token.getX();
	                    features.blockStatus = "BLOCKSTART";
	                } else if (n == tokens.size() - 1) {

Example document

475335v1 (DOI: 10.1101/475335)

PDF

image

bioRxiv XML
<sec id="s3c">
    <title>Epidemic synchrony and annual phase coherence</title>
    <p>
        We explored correlations between dengue time series in different regions. Both epidemic synchrony and phase coherence were higher for closer regions and declined with distance (
        <xref rid="fig3" ref-type="fig">Fig 3</xref>
        ). For the Urban-2 (n = 161) spatial level, epidemic synchrony reached the average countrywide correlation at approximately 1,260 kilometres (
        <xref rid="fig3" ref-type="fig">Fig 3A</xref>
        ). This synchrony length represents a substantial part of Brazil’s dimensions as the country extends 4,395 kilometres north to south and 4,319 kilometres west to east. The coherence length had a higher value of 1,590 kilometres (
        <xref rid="fig3" ref-type="fig">Fig 3B</xref>
        ), suggesting that agreement in dengue seasonality spreads further than correlations of epidemic curves.
    </p>
    <fig id="fig3" position="float" fig-type="figure">
        <label>Fig 3.</label>
        <caption>
            <title>
                Epidemic synchrony and annual phase coherence between Brazilian Urban-2 regions.
            </title>
            <p>
                Epidemic synchrony (A) and annual phase coherence (B) summarised using nonparametric spline covariance function. Solid blue line describes the mean pairwise correlation from the data and the dotted lines represent the 95% envelope for bootstrapped correlations of case and annual phase angle time series, respectively. Red line indicates global countrywide correlation.
            </p>
        </caption>
        <graphic xlink:href="475335_fig3.tif"/>
    </fig>
    <p>
        We also looked at epidemic synchrony and annual phase coherence at other spatial levels (S4 and S5 Figs, respectively) and found that both synchrony and coherence lengths tend to decrease for smaller spatial resolutions and stabilise at 1,240 km and 1,500 km.
    </p>
</sec>

The text We also looked at epidemic synchrony... (line 216) doesn't get the BLOCKSTART feature (it will be BLOCKIN), even though it is in its own block (but with a line feed as the first token as described above).


I could try to submit a fix PR for it.

/cc @kermitt2

@de-code de-code added the bug From Hemiptera and especially its suborder Heteroptera label Feb 12, 2021
@kermitt2
Copy link
Owner

Thanks a lot @de-code for raising this error ! The PR would be of course very welcome :)

@de-code de-code self-assigned this Feb 15, 2021
@de-code
Copy link
Collaborator Author

de-code commented Feb 15, 2021

Hi @kermitt2 I am trying to add a unit test for the change.

But I am having a bit of trouble with the following block:

				int lastPos = tokens.size();
				// if it's a last block from a document piece, it may end earlier
				if (blockIndex == dp2.getBlockPtr()) {
					lastPos = dp2.getTokenBlockPos()+1;
					if (lastPos > tokens.size()) {
						LOGGER.error("DocumentPointer for block " + blockIndex + " points to " +
							dp2.getTokenBlockPos() + " token, but block token size is " +
							tokens.size());
						lastPos = tokens.size();
					}
				}

With just one block, it is causing lastPos to end up with 1 rather than the number of tokens.

@kermitt2
Copy link
Owner

With just one block, it is causing lastPos to end up with 1 rather than the number of tokens.

I think dp2.getTokenBlockPos() marks the end of the full text "zone", so dp2.getTokenBlockPos() is either the token position at the end of the block (then lastPos is at tokens.size() - the whole block content is full text) or somewhere in the block (the block is partially full text). If we have dp2.getTokenBlockPos() at 0, the block is normally excluded from the full text "zone".

@de-code de-code linked a pull request Feb 15, 2021 that will close this issue
@de-code
Copy link
Collaborator Author

de-code commented Feb 15, 2021

With just one block, it is causing lastPos to end up with 1 rather than the number of tokens.

I think dp2.getTokenBlockPos() marks the end of the full text "zone", so dp2.getTokenBlockPos() is either the token position at the end of the block (then lastPos is at tokens.size() - the whole block content is full text) or somewhere in the block (the block is partially full text). If we have dp2.getTokenBlockPos() at 0, the block is normally excluded from the full text "zone".

Okay, thank you. I think I may be creating the data incorrectly. I created a draft PR with what I have so far: #714

@de-code
Copy link
Collaborator Author

de-code commented Feb 23, 2021

Hi @lfoppiano is there more detail here that you are looking for to get more context? (I am not quite sure what to add at the moment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants