Skip to content

pdfliberation/NYCEDCprosedatascraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYCEDC Newsletter Prose Data Scraper

This uses regex (in php, but can be any language) get data from the NYC EDC newsletters

See script run.

###Process:

First, we extreacted the text from the PDF files using a Mac "Get Text" tool to extract the data. This was for expedency; this was originally intended to run in Ruby against text returned from Tabula that was not converted into charts.

Second, A set of regular expressions was written (and then converted to PHP) to convert the data of textual indicators in the monthly report to a csv file output format that can be useful to the EDC team and larger community.

We analyzed the discrepencies in descriptions from year to year (to account for the changes in decsriptions/summaries, Coverage included 2005-2013.

Thanks for the opportunity.

About

This uses regular expressions (in php, but can be any language) get data from the NYC EDC newsletters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages