predictMEE - Predicting Missing Metadata with Entity Extraction

Requirements

Conda or miniconda installation

The predictMEE model and analysis workflow requires a variety of packages to be installed prior to running the code. The easiest way to install all the necessary packages is by installing the Anaconda3 or minconda3 python package manager.

Data

The data download script to automate the download and preprocessing of the SRA attribute-value pairs is still a work in progress. For now you can find the data that was used here.

Word Embedding Model

The downloadable word2vec model can be found here

Installation

Substituting your GH username below, you can clone this repo to the curent directory with

git clone https://username@github.com/aklie/predictMEE.git

Configuring environment

Then, install the required packages with the following commands

cd predictMEE/config
conda env create -f deep_nlp_cpu.yml  # Load the envrionment
conda activate deep_nlp_cpu  # Activate the environment

If you are planning on recapitulatiing the full analysis, you will need to mimic the file structure shown below.

├── bin
│   ├── dataLandscapeSRA.ipynb
│   ├── downloadData.ipynb
│   ├── evaluateModel.ipynb
│   ├── evaluatePrediction.ipynb
│   ├── generateTestSet.ipynb
│   ├── mergeAttributes.ipynb
│   ├── predictMetadata.ipynb
│   └── trainModels.ipynb
├── config
│   └── deep_nlp_cpu.yml
├── data
│   ├── allSRS_05_15_2018.pickle
│   ├── BioSampleAttributes.pickle
│   ├── BioSampleAttributes.xml
│   ├── sra_dump.pickle
│   └── wikipedia-pubmed-and-PMC-w2v
├── doc
│   ├── figures
│   ├── submission
│   └── tables
├── models
├── README.md
└── results
    ├── embedding
    ├── prediction
    ├── training
    └── validation

Running notebooks

Certain notebooks require data and output from other notebooks. In order to run the analysis as was completed for the paper cited below, run the notebooks in the following order.

downloadData.ipynb
dataLandscapeSRA.ipynb
mergeAttributes.ipynb
generateTestSet.ipynb
trainModels.ipynb
evaluateModel.ipynb
predictMetadata.ipynb
evaluatePrediction.ipynb

Citation

Klie A, Tsui BY, Mollah S, Skola D, Dow M, Hsu C-N, et al. Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition. Database. 2021;2021. doi:10.1093/database/baab021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

config

config

doc

doc

results

results

.gitignore

.gitignore

README.md

README.md

Repository files navigation

predictMEE - Predicting Missing Metadata with Entity Extraction

Requirements

Conda or miniconda installation

Data

Word Embedding Model

Installation

Configuring environment

Running notebooks

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
bin		bin
config		config
doc		doc
results		results
.gitignore		.gitignore
README.md		README.md

adamklie/predictMEE

Folders and files

Latest commit

History

Repository files navigation

predictMEE - Predicting Missing Metadata with Entity Extraction

Requirements

Conda or miniconda installation

Data

Word Embedding Model

Installation

Configuring environment

Running notebooks

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages