Removing irrelevant cruft from publisher HTML #32

petermr · 2016-04-30T06:45:14Z

Many/most HTML from publishers includes large amounts of material not relevant to the scholarly narrative. These include:

information about the journal
metrics
advertising
links to other resources
over-complex annotation

Much of this can be managed by XSLT stylesheets which "snip off" this cruft. I don't think there is a simple way of tackling this - it has to be a per-publisher or per journal solution. That means we need a way of locating and using stylesheets from the commandline.

Ideally we need:

a means for detecting which publisher/journal has created the document
a means for removing unwanted sections
restructuring (e.g. turning:

<h2>title</h2>
<p>p1</p>
<p>p2</p>

into

<div class="controlled_vocab" title="title">
<p>p1</p>
<p>p2</p>
</div>

I propose XSLT and XPath for the first two. It's possible that the restructuring can also tackle 3; we'd need XSLT2 with Saxon.

The text was updated successfully, but these errors were encountered:

petermr · 2016-05-01T08:31:58Z

Made good progress yesterday for Taylor and Francis (which is full of cruft and repeated text). Here's briesf stylsheet:

    <xsl:output method="xhtml"/>

    <!--Identity template, strips PIs and comments -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" priority="1.0"/>

<!-- delete these sections and snippets -->
    <!--  header -->
    <xsl:template match="h:div[@id='hd']"/>
    <xsl:template match="h:div[@id='cookieBanner']"/>
    <xsl:template match="h:div[@id='primarySubjects']"/>
    <xsl:template match="h:div[@id='breadcrumb']"/>
    <xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>
    <xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
    <xsl:template match="h:div[contains(@class,'script_only')]"/>
    <xsl:template match="h:div[contains(@class,'access')]"/> 
    <xsl:template match="h:div[contains(@class,'secondarySubjects')]"/> 
    <xsl:template match="h:ul[@class='recommend']"/> 
    <xsl:template match="h:div[@id='unit3']"/> 
    <xsl:template match="h:h3[.='Related articles']"/> 
    <xsl:template match="h:a[.='View all related articles']"/> 
    <xsl:template match="h:div[contains(@class,'social')]"/> 
    <xsl:template match="h:ul[contains(@class,'tabsNav')]"/> 
    <xsl:template match="h:div[@id='siteInfo']"/> 
    <xsl:template match="h:div[contains(@class,'credits')]"/> 
    <xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/> 
    <xsl:template match="h:a[.='View all references']"/> 
    <xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/>

robintw · 2016-05-01T10:54:53Z

That's great - thanks! How do I go about applying this template to the HTML? Is there a method already built-in to one of the ContentMine tools (eg. norma), or do I need to do this separately?

petermr · 2016-05-01T17:16:36Z

It's built into Norma. I think the production version works. Needs two passes. First to create XHTML, next to strip cruft (and third to normalize XHTML to formal SHTML)

petermr · 2016-05-01T17:22:03Z

Here's my test:

        File targetDir = new File("target/tutorial/tf");
        CMineTestFixtures.cleanAndCopyDir(new File("src/test/resources/org/xmlcml/norma/pubstyle/tf/TandF_OA_Test"), targetDir);
        String args = "--project "+targetDir+" -i fulltext.html -o fulltext.xhtml --html jsoup";
        DefaultArgProcessor argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 
        CProject project = new CProject(targetDir);
        CTree ctree0 = project.getCTreeList().get(0);
        File xhtml = ctree0.getExistingFulltextXHTML();
        Assert.assertTrue("xhtml: ", xhtml.exists());
        args = "--project "+targetDir+" -i fulltext.xhtml -o scholarly.html --transform tf2html";
        argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 
        File shtml = ctree0.getExistingScholarlyHTML();
        Assert.assertTrue("shtml: ", shtml.exists());

That SHOULD transfer into:

    norma --project . -i fulltext.html -o fulltext.xhtml --html jsoup
    norma --project . -i fulltext.xhtml -o scholarly.html --transform tf2html

You then need the updated symbol file stylesheetByName.xsl:

<stylesheetList>
  <stylesheet name="bmc2html">/org/xmlcml/norma/pubstyle/bmc/xml2html.xsl</stylesheet>
  <stylesheet name="ieee2html">/org/xmlcml/norma/pubstyle/ieee/toHtml.xsl</stylesheet>
  <stylesheet name="ncbi-jats2html">/org/xmlcml/norma/pubstyle/nlm/ncbi/jats-html.xsl</stylesheet>
  <stylesheet name="nlm2html">/org/xmlcml/norma/pubstyle/nlm/toHtml.xsl</stylesheet>
  <stylesheet name="jats2shtml">/org/xmlcml/norma/pubstyle/nlm/jats/jats2shtml.xsl</stylesheet>
  <stylesheet name="nature2html">/org/xmlcml/norma/pubstyle/nature/toHtml.xsl</stylesheet>
  <stylesheet name="hind2xml">/org/xmlcml/norma/pubstyle/hindawi/groupMajorSections.xsl</stylesheet>
  <stylesheet name="tf2html">/org/xmlcml/norma/pubstyle/tf/toHtml.xsl</stylesheet>
  <!--  patents  -->
  <stylesheet name="uspto2html">/org/xmlcml/norma/patents/uspto/toHtml.xsl</stylesheet>
</stylesheetList>

and the stylesheet toHtml.xsl itself:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:h="http://www.w3.org/1999/xhtml">

    <xsl:output method="xhtml"/>

    <xsl:template match="/">
        <xsl:apply-templates />
    </xsl:template>

    <!--Identity template, strips PIs and comments -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" priority="1.0"/>

    <!--  header -->
    <xsl:template match="h:div[@id='hd']"/>
    <xsl:template match="h:div[@id='cookieBanner']"/>
    <xsl:template match="h:div[@id='primarySubjects']"/>
    <xsl:template match="h:div[@id='breadcrumb']"/>
    <xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>

    <xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
    <xsl:template match="h:div[contains(@class,'script_only')]"/>
    <xsl:template match="h:div[contains(@class,'access')]"/> 
    <xsl:template match="h:div[contains(@class,'secondarySubjects')]"/> 

    <xsl:template match="h:ul[@class='recommend']"/> 
    <xsl:template match="h:div[@id='unit3']"/> 
    <xsl:template match="h:h3[.='Related articles']"/> 
    <xsl:template match="h:a[.='View all related articles']"/> 
    <xsl:template match="h:div[contains(@class,'social')]"/> 
    <xsl:template match="h:ul[contains(@class,'tabsNav')]"/> 
    <xsl:template match="h:div[@id='siteInfo']"/> 
    <xsl:template match="h:div[contains(@class,'credits')]"/> 
    <xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/> 
    <xsl:template match="h:a[.='View all references']"/> 
    <xsl:template match="h:a[.='figureViewerArticleInfo']"/> 
    <xsl:template match="h:div[@class='hidden']"/> 
    <xsl:template match="h:span[contains(@class,'dropDownAlt')]"/> 
    <xsl:template match="h:a[contains(@onclick,'showFigures')]"/> 
    <xsl:template match="h:div[@class='figureDownloadOptions']"/> 
    <xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/> 

</xsl:stylesheet>

petermr · 2016-05-01T18:28:42Z

see https://github.com/ContentMine/norma/blob/master/docs/TRANSFORM.md - please see if this works and comment. This can be done for other publishers.

petermr added the enhancement label Apr 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing irrelevant cruft from publisher HTML #32

Removing irrelevant cruft from publisher HTML #32

petermr commented Apr 30, 2016

petermr commented May 1, 2016

robintw commented May 1, 2016

petermr commented May 1, 2016

petermr commented May 1, 2016 •

edited

petermr commented May 1, 2016

Removing irrelevant cruft from publisher HTML #32

Removing irrelevant cruft from publisher HTML #32

Comments

petermr commented Apr 30, 2016

petermr commented May 1, 2016

robintw commented May 1, 2016

petermr commented May 1, 2016

petermr commented May 1, 2016 • edited

petermr commented May 1, 2016

petermr commented May 1, 2016 •

edited