Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing irrelevant cruft from publisher HTML #32

Open
petermr opened this issue Apr 30, 2016 · 5 comments
Open

Removing irrelevant cruft from publisher HTML #32

petermr opened this issue Apr 30, 2016 · 5 comments

Comments

@petermr
Copy link
Member

petermr commented Apr 30, 2016

Many/most HTML from publishers includes large amounts of material not relevant to the scholarly narrative. These include:

  1. information about the journal
  2. metrics
  3. advertising
  4. links to other resources
  5. over-complex annotation

Much of this can be managed by XSLT stylesheets which "snip off" this cruft. I don't think there is a simple way of tackling this - it has to be a per-publisher or per journal solution. That means we need a way of locating and using stylesheets from the commandline.

Ideally we need:

  1. a means for detecting which publisher/journal has created the document
  2. a means for removing unwanted sections
  3. restructuring (e.g. turning:
<h2>title</h2>
<p>p1</p>
<p>p2</p>

into

<div class="controlled_vocab" title="title">
<p>p1</p>
<p>p2</p>
</div>

I propose XSLT and XPath for the first two. It's possible that the restructuring can also tackle 3; we'd need XSLT2 with Saxon.

@petermr
Copy link
Member Author

petermr commented May 1, 2016

Made good progress yesterday for Taylor and Francis (which is full of cruft and repeated text). Here's briesf stylsheet:

    <xsl:output method="xhtml"/>

    <!--Identity template, strips PIs and comments -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" priority="1.0"/>

<!-- delete these sections and snippets -->
    <!--  header -->
    <xsl:template match="h:div[@id='hd']"/>
    <xsl:template match="h:div[@id='cookieBanner']"/>
    <xsl:template match="h:div[@id='primarySubjects']"/>
    <xsl:template match="h:div[@id='breadcrumb']"/>
    <xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>
    <xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
    <xsl:template match="h:div[contains(@class,'script_only')]"/>
    <xsl:template match="h:div[contains(@class,'access')]"/> 
    <xsl:template match="h:div[contains(@class,'secondarySubjects')]"/> 
    <xsl:template match="h:ul[@class='recommend']"/> 
    <xsl:template match="h:div[@id='unit3']"/> 
    <xsl:template match="h:h3[.='Related articles']"/> 
    <xsl:template match="h:a[.='View all related articles']"/> 
    <xsl:template match="h:div[contains(@class,'social')]"/> 
    <xsl:template match="h:ul[contains(@class,'tabsNav')]"/> 
    <xsl:template match="h:div[@id='siteInfo']"/> 
    <xsl:template match="h:div[contains(@class,'credits')]"/> 
    <xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/> 
    <xsl:template match="h:a[.='View all references']"/> 
    <xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/> 

@robintw
Copy link

robintw commented May 1, 2016

That's great - thanks! How do I go about applying this template to the HTML? Is there a method already built-in to one of the ContentMine tools (eg. norma), or do I need to do this separately?

@petermr
Copy link
Member Author

petermr commented May 1, 2016

It's built into Norma. I think the production version works. Needs two passes. First to create XHTML, next to strip cruft (and third to normalize XHTML to formal SHTML)

@petermr
Copy link
Member Author

petermr commented May 1, 2016

Here's my test:

        File targetDir = new File("target/tutorial/tf");
        CMineTestFixtures.cleanAndCopyDir(new File("src/test/resources/org/xmlcml/norma/pubstyle/tf/TandF_OA_Test"), targetDir);
        String args = "--project "+targetDir+" -i fulltext.html -o fulltext.xhtml --html jsoup";
        DefaultArgProcessor argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 
        CProject project = new CProject(targetDir);
        CTree ctree0 = project.getCTreeList().get(0);
        File xhtml = ctree0.getExistingFulltextXHTML();
        Assert.assertTrue("xhtml: ", xhtml.exists());
        args = "--project "+targetDir+" -i fulltext.xhtml -o scholarly.html --transform tf2html";
        argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 
        File shtml = ctree0.getExistingScholarlyHTML();
        Assert.assertTrue("shtml: ", shtml.exists());

That SHOULD transfer into:

    norma --project . -i fulltext.html -o fulltext.xhtml --html jsoup
    norma --project . -i fulltext.xhtml -o scholarly.html --transform tf2html

You then need the updated symbol file stylesheetByName.xsl:

<stylesheetList>
  <stylesheet name="bmc2html">/org/xmlcml/norma/pubstyle/bmc/xml2html.xsl</stylesheet>
  <stylesheet name="ieee2html">/org/xmlcml/norma/pubstyle/ieee/toHtml.xsl</stylesheet>
  <stylesheet name="ncbi-jats2html">/org/xmlcml/norma/pubstyle/nlm/ncbi/jats-html.xsl</stylesheet>
  <stylesheet name="nlm2html">/org/xmlcml/norma/pubstyle/nlm/toHtml.xsl</stylesheet>
  <stylesheet name="jats2shtml">/org/xmlcml/norma/pubstyle/nlm/jats/jats2shtml.xsl</stylesheet>
  <stylesheet name="nature2html">/org/xmlcml/norma/pubstyle/nature/toHtml.xsl</stylesheet>
  <stylesheet name="hind2xml">/org/xmlcml/norma/pubstyle/hindawi/groupMajorSections.xsl</stylesheet>
  <stylesheet name="tf2html">/org/xmlcml/norma/pubstyle/tf/toHtml.xsl</stylesheet>
  <!--  patents  -->
  <stylesheet name="uspto2html">/org/xmlcml/norma/patents/uspto/toHtml.xsl</stylesheet>
</stylesheetList>

and the stylesheet toHtml.xsl itself:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:h="http://www.w3.org/1999/xhtml">

    <xsl:output method="xhtml"/>

    <xsl:template match="/">
        <xsl:apply-templates />
    </xsl:template>

    <!--Identity template, strips PIs and comments -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" priority="1.0"/>

    <!--  header -->
    <xsl:template match="h:div[@id='hd']"/>
    <xsl:template match="h:div[@id='cookieBanner']"/>
    <xsl:template match="h:div[@id='primarySubjects']"/>
    <xsl:template match="h:div[@id='breadcrumb']"/>
    <xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>

    <xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
    <xsl:template match="h:div[contains(@class,'script_only')]"/>
    <xsl:template match="h:div[contains(@class,'access')]"/> 
    <xsl:template match="h:div[contains(@class,'secondarySubjects')]"/> 

    <xsl:template match="h:ul[@class='recommend']"/> 
    <xsl:template match="h:div[@id='unit3']"/> 
    <xsl:template match="h:h3[.='Related articles']"/> 
    <xsl:template match="h:a[.='View all related articles']"/> 
    <xsl:template match="h:div[contains(@class,'social')]"/> 
    <xsl:template match="h:ul[contains(@class,'tabsNav')]"/> 
    <xsl:template match="h:div[@id='siteInfo']"/> 
    <xsl:template match="h:div[contains(@class,'credits')]"/> 
    <xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/> 
    <xsl:template match="h:a[.='View all references']"/> 
    <xsl:template match="h:a[.='figureViewerArticleInfo']"/> 
    <xsl:template match="h:div[@class='hidden']"/> 
    <xsl:template match="h:span[contains(@class,'dropDownAlt')]"/> 
    <xsl:template match="h:a[contains(@onclick,'showFigures')]"/> 
    <xsl:template match="h:div[@class='figureDownloadOptions']"/> 
    <xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/> 

</xsl:stylesheet>

@petermr
Copy link
Member Author

petermr commented May 1, 2016

see https://github.com/ContentMine/norma/blob/master/docs/TRANSFORM.md - please see if this works and comment. This can be done for other publishers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants