Install the python dependences, preferably in a virtualenv:
$ pip install -r requirements.txt
Install the CMU corpus and Punkt tokenizer models:
>>> import nltk
>>> nltk.download()
Select cmudict
and punkt
.
$ aws configure
Enter your AWS credentials, and use the default of None
for the region.
Assuming you are starting with the output of a parsed library jsonl file (e.g., Elsevier):
- Move file into BigGuns: ~/nlp/raw_inputs
- Extract relevant sections (e.g., for pval: abstract, summary, methods). Set outfile to be in ~/nlp/modeling and name accordingly
- Run nlp markup script located in ~/nlp, output to ~/nlp/models
- Import the new article and section tables into DD
- Run relevant shell scripts (make sure models still look good)
- Extract information via queries
- Move csv extraction files into ~/modeling/csv_outputs
- Run full extractor script in ~/modeling
- Done with pipeline, now go to data science :)
Data is stored in AWS S3, specifically the s3://deepmed-data
bucket. The data
are fetched into the not version controlled data/raw
folder so you can work with
it locally.
To push a new data file (note that independent of the path the file will be placed
directly into the deepmed-data
bucket):
$ ./bin/s3push /path/to/data/file.jsonl
To fetch that file into the local data/raw/
folder run:
# make data/raw/file.jsonl
With raw data in hand, you're ready to transform it. Try keeping the derivative
data under data/build
and add new make targets to the Makefile
to automate
building the data.