deepmed

Install

Python Deps

Install the python dependences, preferably in a virtualenv:

$ pip install -r requirements.txt

Readability NLTK Data

Install the CMU corpus and Punkt tokenizer models:

>>> import nltk
>>> nltk.download()

Select cmudict and punkt.

Configure AWS

$ aws configure

Enter your AWS credentials, and use the default of None for the region.

Full Pipeline

Assuming you are starting with the output of a parsed library jsonl file (e.g., Elsevier):

Move file into BigGuns: ~/nlp/raw_inputs
Extract relevant sections (e.g., for pval: abstract, summary, methods). Set outfile to be in ~/nlp/modeling and name accordingly
Run nlp markup script located in ~/nlp, output to ~/nlp/models
Import the new article and section tables into DD
Run relevant shell scripts (make sure models still look good)
Extract information via queries
Move csv extraction files into ~/modeling/csv_outputs
Run full extractor script in ~/modeling
Done with pipeline, now go to data science :)

Working with Data

Data is stored in AWS S3, specifically the s3://deepmed-data bucket. The data are fetched into the not version controlled data/raw folder so you can work with it locally.

To push a new data file (note that independent of the path the file will be placed directly into the deepmed-data bucket):

$ ./bin/s3push /path/to/data/file.jsonl

To fetch that file into the local data/raw/ folder run:

# make data/raw/file.jsonl

With raw data in hand, you're ready to transform it. Try keeping the derivative data under data/build and add new make targets to the Makefile to automate building the data.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
bin		bin
lib		lib
modeling		modeling
scraping_scripts		scraping_scripts
.gitignore		.gitignore
Cochrane_harvest_charstudies_output.csv		Cochrane_harvest_charstudies_output.csv
Makefile		Makefile
README.md		README.md
article_data_extractor.py		article_data_extractor.py
full_extractor.py		full_extractor.py
ghostdriver.log		ghostdriver.log
highwire.py		highwire.py
notes_micro_setup		notes_micro_setup
pubmed_harvest.py		pubmed_harvest.py
pubmed_harvest_library.py		pubmed_harvest_library.py
pubmed_harvest_threaded.py		pubmed_harvest_threaded.py
quality_model.R		quality_model.R
requirements.txt		requirements.txt
supervise_number_eli.py		supervise_number_eli.py

ebenjoseph/deepmed

Folders and files

Latest commit

History

Repository files navigation

deepmed

Install

Python Deps

Readability NLTK Data

Configure AWS

Full Pipeline

Working with Data

About

Resources

Stars

Watchers

Forks

Languages