-
Notifications
You must be signed in to change notification settings - Fork 0
Home
(NB the DigitalNZ API requires the use of a developer API key, here shown as XXXXXXX
)
To harvest metadata containing the text library
, from the DigitalNZ content partner Kete Christchurch
:
java -jar dist/apiharvester.jar directory=library retries=0 url="http://api.digitalnz.org/v3/records.xml?text=library&per_page=100&and[content_partner][]=Kete+Christchurch&api_key=XXXXXXX&page=1" records-xpath="/search/results/result" id-xpath="id" resume-when-xpath="number(/search/result-count) > number(/search/page) * number(/search/per-page)" discard-xpath="*[not(normalize-space())] | text()[not(normalize-space())]" resumption-xpath="concat(substring-before(/search/request-url, '&page='), '&page=', 1 + number(/search/page))" indent=yes delay=9
In this example resume-when-xpath
is used to calculate whether we've harvested enough pages to include all the records. If the number of search results is greater than the number of pages we've harvested, multiplied by the number of results per page, then we need to resume, otherwise we are finished.
If the harvest does need to resume, then the resumption-xpath
parameter says that the URL of the next page of data can be createed by taking the current request URL, stripping off the page
parameter, and adding a new page
parameter whose value is equal to the current page number + 1.
DigitalNZ is rate-limited to 10000 requests per day, or 1 request per 8.64s. In the example, a delay of 9 seconds between requests is inserted, ensuring that the harvester can continue harvesting for longer than 24 hours, without breaching the rate limit.
The discard-xpath
parameter is here used to remove white space, and any elements which don't contain any text (apart from white space). In the case of DigitalNZ, this means that elements like <credit-creator nil="true"/>
will be discarded from within each <result>
record.
(NB the Trove API requires the use of a developer API key, here shown as XXXXXXX
)
To harvest the full text of articles from the newspaper whose id
is 1055
(which identifies the "Brisbane Telegraph"; see list of newspaper titles).
java -jar apiharvester.jar directory="1055" url="https://api.trove.nla.gov.au/v2/result?q=%20&zone=newspaper&include=articletext&n=100&reclevel=full&bulkHarvest=true&l-title=1055" records-xpath="/response/zone/records/article" id-xpath="@id" resumption-xpath="substring-after(/response/zone/records/@next, '/')" url-suffix="&key=XXXXXXX"
To harvest individual url
records from an XML Sitemap.
java -jar apiharvester.jar directory="/tmp/apo" url="http://apo.org.au/sitemap.xml" records-xpath="/s:urlset/s:url" id-xpath="s:loc" resumption-xpath="/s:sitemapindex/s:sitemap/s:loc/text()" xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9"
The original URL http://apo.org.au/sitemap.xml does not contain any elements matching the records-xpath
expression (/s:urlset/s:url
) so no records are harvested from it. However, it does contain 2 nodes matching the resumption-xpath
expression (/s:sitemapindex/s:sitemap/s:loc/text()
) so it causes two more URLs to be harvested, from which 47941 records are found to match the records-xpath
expression, and are therefore saved into separate XML files.
To harvest the "Sheet Music" set from the Library of Congress, in oai_dc format:
java -jar apiharvester.jar directory=tmp/oai-pmh xmlns:oai="http://www.openarchives.org/OAI/2.0/" url="https://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=musdibib" records-xpath="/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*" id-xpath="../../oai:record/oai:header/oai:identifier" resume-when-xpath="/oai:OAI-PMH/oai:ListRecords/oai:resumptionToken/text()" resumption-xpath="concat('oai2_0?verb=ListRecords&resumptionToken=', /oai:OAI-PMH/oai:ListRecords/oai:resumptionToken)" indent=yes
In this example the resume-when-xpath
is used to decide whether the harvest needs to be resumed. If the oai:resumptionToken
is empty then the harvest will stop. If the token is not empty, then the resumption-xpath
expression will be evaluated to yield the resumption URL.
Harvest 26 records from the single file http://www.xmlfiles.com/examples/cd_catalog.xml
java -jar apiharvester.jar directory="cd" url="http://www.xmlfiles.com/examples/cd_catalog.xml" records-xpath="/CATALOG/CD" id-xpath="concat(TITLE, '_', ARTIST)"
Harvest only the posts about "Linked Data" from Conal Tuohy's blog's RSS feed at http://conaltuohy.com/feed
java -jar apiharvester.jar directory="/tmp/rss" url="http://conaltuohy.com/feed" records-xpath="/rss/channel/item[category='Linked Data']" id-xpath="link/text()" indent=yes
Download all the TEI XML files from subfolders of http://vmcp.conaltuohy.com/tei/ and save only the documents which do not contain a tei:date
element with a when
attribute.
java -jar /usr/src/APIHarvester/dist/apiharvester.jar directory="undated" url="http://vmcp.conaltuohy.com/tei/" xmlns:tei="http://www.tei-c.org/ns/1.0" records-xpath="/tei:TEI[not(.//tei:date/@when)]" id-xpath="//tei:idno[@type='filename']" xmlns:html="http://www.w3.org/1999/xhtml" resumption-xpath="//html:a/@href[not(starts-with(., '/'))]"