Skip to content
Conal Tuohy edited this page May 4, 2022 · 14 revisions

Example usage of APIHarvester

DigitalNZ API

(NB the DigitalNZ API requires the use of a developer API key, here shown as XXXXXXX)

To harvest metadata containing the text library, from the DigitalNZ content partner Kete Christchurch:

java -jar dist/apiharvester.jar directory=library retries=0 url="http://api.digitalnz.org/v3/records.xml?text=library&per_page=100&and[content_partner][]=Kete+Christchurch&api_key=XXXXXXX&page=1" records-xpath="/search/results/result" id-xpath="id" resume-when-xpath="number(/search/result-count) > number(/search/page) * number(/search/per-page)" discard-xpath="*[not(normalize-space())] | text()[not(normalize-space())]" resumption-xpath="concat(substring-before(/search/request-url, '&page='), '&page=', 1 + number(/search/page))" indent=yes delay=9

Notes

In this example resume-when-xpath is used to calculate whether we've harvested enough pages to include all the records. If the number of search results is greater than the number of pages we've harvested, multiplied by the number of results per page, then we need to resume, otherwise we are finished.

If the harvest does need to resume, then the resumption-xpath parameter says that the URL of the next page of data can be createed by taking the current request URL, stripping off the page parameter, and adding a new page parameter whose value is equal to the current page number + 1.

DigitalNZ is rate-limited to 10000 requests per day, or 1 request per 8.64s. In the example, a delay of 9 seconds between requests is inserted, ensuring that the harvester can continue harvesting for longer than 24 hours, without breaching the rate limit.

The discard-xpath parameter is here used to remove white space, and any elements which don't contain any text (apart from white space). In the case of DigitalNZ, this means that elements like <credit-creator nil="true"/> will be discarded from within each <result> record.

Trove API

(NB the Trove API requires the use of a developer API key, here shown as XXXXXXX)

To harvest the full text of articles from the newspaper whose id is 1055 (which identifies the "Brisbane Telegraph"; see list of newspaper titles).

java -jar apiharvester.jar directory="1055" url="https://api.trove.nla.gov.au/v2/result?q=%20&zone=newspaper&include=articletext&n=100&reclevel=full&bulkHarvest=true&l-title=1055" records-xpath="/response/zone/records/article" id-xpath="@id" resumption-xpath="substring-after(/response/zone/records/@next, '/')" url-suffix="&key=XXXXXXX"

XML Sitemap from APO.org

To harvest individual url records from an XML Sitemap.

java -jar apiharvester.jar directory="/tmp/apo" url="http://apo.org.au/sitemap.xml" records-xpath="/s:urlset/s:url" id-xpath="s:loc" resumption-xpath="/s:sitemapindex/s:sitemap/s:loc/text()" xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9"

Notes

The original URL http://apo.org.au/sitemap.xml does not contain any elements matching the records-xpath expression (/s:urlset/s:url) so no records are harvested from it. However, it does contain 2 nodes matching the resumption-xpath expression (/s:sitemapindex/s:sitemap/s:loc/text()) so it causes two more URLs to be harvested, from which 47941 records are found to match the records-xpath expression, and are therefore saved into separate XML files.

Library of Congress OAI-PMH provider

To harvest the "Sheet Music" set from the Library of Congress, in oai_dc format:

java -jar apiharvester.jar directory=tmp/oai-pmh xmlns:oai="http://www.openarchives.org/OAI/2.0/" url="https://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=musdibib" records-xpath="/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*" id-xpath="../../oai:record/oai:header/oai:identifier" resume-when-xpath="/oai:OAI-PMH/oai:ListRecords/oai:resumptionToken/text()" resumption-xpath="concat('oai2_0?verb=ListRecords&resumptionToken=', /oai:OAI-PMH/oai:ListRecords/oai:resumptionToken)" indent=yes

Note

In this example the resume-when-xpath is used to decide whether the harvest needs to be resumed. If the oai:resumptionToken is empty then the harvest will stop. If the token is not empty, then the resumption-xpath expression will be evaluated to yield the resumption URL.

XMLFiles.com example

Harvest 26 records from the single file http://www.xmlfiles.com/examples/cd_catalog.xml

java -jar apiharvester.jar directory="cd" url="http://www.xmlfiles.com/examples/cd_catalog.xml" records-xpath="/CATALOG/CD" id-xpath="concat(TITLE, '_', ARTIST)"

RSS feed

Harvest only the posts about "Linked Data" from Conal Tuohy's blog's RSS feed at http://conaltuohy.com/feed

java -jar apiharvester.jar directory="/tmp/rss" url="http://conaltuohy.com/feed" records-xpath="/rss/channel/item[category='Linked Data']" id-xpath="link/text()" indent=yes

XML files from XHTML directory listing by Apache web server

Download all the TEI XML files from subfolders of http://vmcp.conaltuohy.com/tei/ and save only the documents which do not contain a tei:date element with a when attribute.

java -jar /usr/src/APIHarvester/dist/apiharvester.jar directory="undated" url="http://vmcp.conaltuohy.com/tei/" xmlns:tei="http://www.tei-c.org/ns/1.0" records-xpath="/tei:TEI[not(.//tei:date/@when)]" id-xpath="//tei:idno[@type='filename']" xmlns:html="http://www.w3.org/1999/xhtml" resumption-xpath="//html:a/@href[not(starts-with(., '/'))]"