Notebook for bulk downloading of AJCP material #58

wragge · 2022-05-04T23:41:05Z

See: https://twitter.com/MichWatsonOz/status/1521725616735014912

dleetalb · 2022-05-06T00:50:08Z

I'd like to be able to download sections from the AJCP digitised collection.

For instance, material from the Miscellaneous Series, London Missionary Society Collection.

From here, it would be great to search by three categories- name, date, and geographical location.

The example below shows the general data in the finding aid-

Letters mainly from missionaries in the Society, Hervey and Samoan Islands and also the New Hebrides, Loyalty Islands and Savage Island (Niue), 1862 - 1863 (File Box 29)

But what I'd really like is to harvest files based on the descriptive section of the file, as seen below. For my research, I would target information about Lawes.

The correspondents include Charles Barff (Huahine), P.G. Bird (Savaii, Apia), Stephen M. Creagh (Uea, Lifu), George Drummond (Upolu), Samuel Ella (Aneiteum), John Geddie (Aneiteum), Henry Gee (Apia), W. Wyatt Gill (Mangaia), James L. Green (Taha'a), William Howe (Papeete), John Jones (Mare), Ernst R.W. Krause (Rarotonga), William G. Lawes (Savage Island), Samuel Macfarlane (Lifu), George Morris (Raiatea), Archibald W. Murray (Malua), Henry Nisbet (Malua), George Platt (Raiatea), Thomas Powell (Tutuila), George Pratt (Matautu, Savage Island),Carl Schmidt (Apia), James Sleigh (Lifu) and George Turner (Sydney).

wragge · 2022-05-06T12:12:09Z

So to break this down:

You'd provide the notebook with a finding aid url and a search term
The notebook would then search for the term within the finding aid, getting a list of matching boxes/item groups
The notebook would then download all of the images in those boxes

Is that what you'd like?

wragge · 2022-05-06T12:18:45Z

Notes to self:

Searching within a finding aid fires off a POST request that returns an HTML fragment.

The params are something like this:

params = {"faIdentifier":"nla.obj-1126174847","term":"lawes","nuc":"ANL:AJCP","facets":"all","zone":"collection","selectedFacets":[],"pageSize":10,"cursorMark":"AoErc3UyMzcxMDI4Nzk=","start":1,"previous":["*"]}

And are posted as json to https://nla.gov.au/tarkine/nla.obj-1126174847/findingaid/search Results are paginated -- increment the start value. So next page would be "start": 11. Looks like the number of results per page can be changed.

Results are HTML so would need to scrape identifiers from the HTML for further processing.

wragge · 2022-05-30T06:23:42Z

Worth noting too that dezoomify (https://dezoomify.ophir.dev/) works a treat in downloading high-resolution versions of pages in the AJCP.

dleetalb · 2022-05-30T06:26:57Z

Thanks for the dezoomify link, Tim. Bart mentioned he spoke with you recently and just commented on how good the images are!

As for the query above, I think that sounds good!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook for bulk downloading of AJCP material #58

Notebook for bulk downloading of AJCP material #58

wragge commented May 4, 2022

dleetalb commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 30, 2022

dleetalb commented May 30, 2022

Notebook for bulk downloading of AJCP material #58

Notebook for bulk downloading of AJCP material #58

Comments

wragge commented May 4, 2022

dleetalb commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 30, 2022

dleetalb commented May 30, 2022