Skip to content

nypl-spacetime/city-directory-entry-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

city-directory-entry-parser

city-directory-entry-parser parses lines from OCR’d New York City directories into separate fields, such as names, occupations, and addresses.

city-directory-entry-parser is part of NYPL’s NYC Space/Time Directory project.

For more tools that are used to turn digitized city directories into datasets, see Space/Time’s City Directories repository.

This module relies on the sklearn-crfsuite implementation of a conditional random fields algorithm.

Example

Input:

"Calder William W, clerk, 206 W. 24th"

Output:

{
  "subjects": [
    "Calder William W"
  ],
  "occupations": [
    "clerk"
  ],
  "addresses": [
    [
      "206 W . 24th"
    ]
  ]
}

If the output contains an address field, nyc-street-normalizer can be used to turn this abbreviated address into a full address (e.g. 668 Sixth av.668 Sixth Avenue).

Prerequisites

city-directory-entry-parser depends on the following Python modules:

  • numpy
  • sklearn
  • nltk
  • scipy
  • sklearn_crfsuite

Installation & usage

From Python:

from cdparser import Classifier, Features, LabeledEntry, Utils

## Create a classifier object and load some labeled data from a CSV
classifier = Classifier.Classifier()
classifier.load_training("/full/path/to/training/nypl-labeled-train.csv")

## Optionally, load validation dataset
classifier.load_validation("/full/path/to/validation/nypl-labeled-validate.csv")

## Train your classifier (with default settings)
classifier.train()

## Create an entry object from string
entry = LabeledEntry.LabeledEntry("Cappelmann Otto, grocer, 133 VVashxngton, & liquors, 170 Greenwich, h. 109 Cedar")

## Pass the entry to the classifier
classifier.label(entry)

## Export the labeled entry as JSON
json.dumps(entry.categories)

From bash (using parse.py):

cat /path/to/nypl-1851-1852-entries-sample.txt | python3 parse.py --training /path/to/nypl-labeled-70-training.csv

See also

About

Module to parse lines from OCR’d New York City directories into separate fields, such as names, occupations, and addresses.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages