Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.
Kam Woods edited this page Apr 17, 2023 · 8 revisions

The BitCurator NLP project developed software for collecting institutions to extract, analyze, and produce reports on features of interest in text extracted from born-digital materials contained in collections. The project used open source natural language processing libraries to identify items likely to be relevant to preservation, information organization, and access activities. These included entities (e.g. persons, places, and organizations), potential relationships among entities (e.g. those entities that appear together within documents or set of documents), and topic models to provide insight into how concepts are naturally clustered within the documents. Project staff developed software that allowed users to create customized reports from text discovered in disk images, providing both command-line executables and a public Python API to extend the capabilities of external tools.

Rationale and Technical Foundation

Born digital collections often include a wide range of complex file formats (for example, Office documents, PDF files, email, and audiovisual materials) from which text may be extracted directly or by automated transcription. Text extraction from arbitrary collections of files is itself a non-trivial task; solving this problem is not a focus of the project. BitCurator NLP projects used existing software platforms including textract, textacy, spaCy, scikit-learn, and GraphLab to perform text extraction from heterogeneous collections of file and execute NLP tasks such as entity and entity relationship identification, topic modeling (and topic model visualization), and document summarization.

The projects developed as part of BitCurator NLP allowed users to perform text extraction from heterogeneous collections of file and execute NLP tasks such as entity and entity relationship identification, topic modeling, and topic model visualization. The software enabled users to select candidate files from a collection (or process complete disk images) and create human- and machine-readable reports that meaningfully characterized the contents of those files based on raw (unannotated) text that they contained.

Generating Topic Models from Disk Image Contents

The bitcurator-nlp-gentm project simplifies the process of automatically analyzing disk image collections by text content, allowing archivists and other collecting institution professionals to identify potential topics of interest without requiring manual inspection of individual files or directories.

The software uses [//github.com/log2timeline/dfvfs dfVFS] to automate parsing and extracting file system contents from a wide range of disk image formats and file types. This toolset provides access to disk images stored as raw, EWF (EWF-E01, EWF-Ex01, EWF-S01), QCOW, VHD, and VMDK, and file systems including FAT, HFS, HFS+, NTFS, and ext2/3/4.

Once available file systems are exposed using dfVFS, text is extracted from candidate files using textract. The textract wrapper supports many different formats; the relevant formats can be targeted using a simple configuration file provided with bitcurator-nlp-gentm (for example, to reduct processing time by disregarding those files for which OCR would be required).

Extracted text is cleaned and processed using pyLDAvis - a Python fork of the LDAVis tool described in (Sievert and Shirley, 2014) - to produce visualizations of topic models created from English language terms.

BitCurator NLP Gentm

Licenses

This wiki, documentation, and other materials generated by the BitCurator team are licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). See our GitHub repositories for licenses associated with specific projects.

Development, Funding, and Partners

Grants from the Andrew W. Mellon Foundation supported the BitCurator project (a partnership between the School of Information and Library Science at the University of North Carolina at Chapel Hill and the Maryland Institute for Technology in the Humanities through September 2014, and the BitCurator Access project through September 2016. A grant from the Andrew W. Mellon Foundation currently supports the BitCurator NLP project (2016-2018).