Skip to content

Latest commit

 

History

History
86 lines (68 loc) · 4.35 KB

GUIDE.md

File metadata and controls

86 lines (68 loc) · 4.35 KB

Guide to the Distant Reader

The Distant Reader is a suite of repositories. This is the main repository which provides all of the processing steps. The others handle auxiliary tasks such as downloading the CORD-19 dataset or creating an index to all the public carrels.See also

If you are trying to understand the reader, this is a good description of the Reader's goals and operation.

Guide to This Repository

This repository is organized in the following directories:

  • bin/ - all the processing scripts
  • cgi-bin/ - not used
  • css/ - CSS files used by the created HTML files
  • etc/ - template files used to generate the reports, databases, and other output files in each study carrel
  • js/ - Javascript files used be the created HTML files
  • lib/ - Libraries used by the processing scripts

Processing steps

Distant Reader takes a set of unstructured data as input and outputs a set of structured data. You could think of it as a glorified back of the book index. Each carrel is a structured dataset with all having the same structure.

Most of the processing is done using scripts in the bin/ directory. All the data in a study carrel is contained in a structured directory (see data layout, below). The processing is done using the map/reduce paradigm. For the most part, the map transforms each input file into an output file. And the reduction step takes all the output files and puts them into a SQLite database for further ad-hoc queries. After the reduction step a bunch of reports and HTML pages are created. The website and exported study carrels are all html files, and the exported carrels are completely self-contained and don't require a server or network connection.

Key steps in the processing:

Other files of interest:

Data Model

Each study carrel is independent. The files in the exported zip file of a study carrel are exactly the same as it is stored on disk and how a carrel the scripts work with the carrel. The results of each processing step is stored in specific subdirectories of a carrel. Some of the sub-directories are (see bin/initialize-carrel.sh for complete list)

  • cache - the original source material, with one file for each document or section
  • txt - holds a plain text file derived from each source document
  • adr - one TSV file for each document giving a list of addresses found in that document
  • ent - one TSV file for each document giving a list of named entities extracted from that document
  • pos - one TSV file for each document giving a list of parts of speech found in that document
  • urls - one TSV file for each document giving a list of URLs found in that document

After the processing step, the following report directories are created

  • etc - the generated reports
  • htm - HTML versions of the generated reports
  • figures - images generated to support the HTML files
  • css - CSS to support the generated reports