StabiHacks

Various utilities to deal with metadata and content provided by the Berlin State Library/Staatsbibliothek zu Berlin

The scripts work together as illustrated below:

SBBget

a Python script that is capable of downloading digitized media, the associated metadata, and its fulltext from Berlin State Library's digitized collections.
it also extracts images that have been detected by the OCR and stores them in the desired file format, e.g., JPEG
the extracted illustrations can also be stored as .tar files to facilitate distribution
its logic is based on the more or less unique PPN identifier used at the Berlin State Library.
some PPN lists are shipped for demonstration purposes. more can be obtained at the Berlin State Library or the creator of the script.
the script will create various folders below its current working directory, e.g.,
- downloads (fulltexts, original digitizations etc.) are stored at: sbbget_downloads/download_temp/
- extracted images are stored at: sbbget_downloads/extracted_images/
- METS/MODS files are stored at: sbbget_downloads/download_temp//__metsmods/

the script comes pre-configured and tries to download an existing book from the Berlin State Library if run directly (no parameters needed)
the script has been tested with Python 3.9 and 3.11 but should run with other versions as well

the script can be configured by modifying the accompanying YAML configuration file

a Python script that downloads METS/MODS files and DC metadata via OAI-PMH and analyzes them, e.g., to save ALTO XML URLs for certain records or to save metadata such as language codes or authorships
the results of the analyses are saved locally for further processing in various formats, e.g. Excel and CSV

a Python script that retrieves all fulltexts from a SBBget created download directory and converts all files to raw text files
additionally, the script runs a NER on all created raw text files and saves the results, the NER is based on flair
for best (i.e. fast) results you should use a GPU but the script will also run on the CPU
alternatively the script can operate on the result file created by OAI-Analyzer and download ALTO files directly, from this perspective it serves as a Stabi fulltext corpus builder

the script is based on NLTK which needs additional installation steps, i.e.:
- install NLTK in your Python environment
- when running the script, Python will ask you to install additional NLTK packages, the easiest way is to open a Python interpreter and run to launch NLTK's graphical installer or to download the needed data via:
```
import nltk
nltk.download('punkt')
```
- usually, NLTK will inform you about missing data if you forgot this step
- further information can be found an online book that also gives an introduction into natural language processing
the script comes pre-configured and can be launched after SBBget has run (no additional parameters are needed, the script looks for the SBBget download folder at the standard location)
the script has been tested with Python 3.9 and 3.11 but should run with other versions as well

ner_analysis is based on the results from fulltext_analysis and creates graph data etc.
this script is still under development

a Python script that parses files in the Pica+ format as provided by the GBV
the script lets you choose interesting fields (as stored in the fieldsOfInterest list) and will output the contained data
records will be separated by a NEW_RECORD string on command line or by an empty line in the text format
output can be saved in text format, separated by the language of the record
standard fields are:
- title
- author (+ optional GND ID)
- country of publication (only the first entry in a specific extension of the DIN ISO 3166 format)
- publisher and place of publication
documentation of the Pica Plus format is only available in German here:
- general overview
- list of fields

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
_datasets		_datasets
fulltext-tools		fulltext-tools
image-tools		image-tools
img		img
oai-analyzer		oai-analyzer
pica_plus		pica_plus
ppn_lists		ppn_lists
sbbget		sbbget
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
OCR-PPN-Liste.txt		OCR-PPN-Liste.txt
README.md		README.md
ppn-howto.md		ppn-howto.md