Skip to content

Navigation Menu

Explore
For
- Enterprise
- Teams
- Startups
- Education
By Solution
Resources
- Learning Pathways
- White papers, Ebooks, Webinars
- Customer Stories
- Partners
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

DavidKLim / NIMIWAE_Paper Public

Notifications
Fork 0
Star 0

Code
Issues 1
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

NIMIWAE_Paper

/

acc_form.md

Latest commit

History

352 lines (252 loc) · 11 KB

Breadcrumbs

NIMIWAE_Paper

/

acc_form.md

File metadata and controls

352 lines (252 loc) · 11 KB

NIMIWAE_Paper

David Lim 30 December, 2020

This form documents the artifacts associated with the article (i.e., the data and code supporting the computational findings) and describes how to reproduce the findings.

Part 1: Data

This paper does not involve analysis of external data (i.e., no data are used or the only data are generated by the authors via simulation in their code).

I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.

Abstract

The Physionet 2012 Challenge dataset is an EHR dataset that contains 12,000 ICU patients with measurements of 37 variables across 48 time points, with one hour between each consecutive time point. Pre-processing of the data was done using the code in this repository, which uses prior domain knowledge to compress the repeated measurements into one observation per patient. The patients are divided into training, validation, and test sets of equal sizes, and an entry is missing if, for a patient, the corresponding feature was not measured at any time point.

Availability

Data are publicly available.
Data cannot be made publicly available.

If the data are publicly available, see the Publicly available data section. Otherwise, see the Non-publicly available data section, below.

Publicly available data

Data are available online at: Various locations. See the supplementary materials of the NIMIWAE manuscript (NIMIWAE_Appendix.pdf) for details on these datasets. Missingness is simulated on top of the UCI datasets, while missingness is inherent in the Physionet dataset. The missingness-simulated datasets can be reproduced by using the code in the reproducibility repository (using the default seed as provided), and these datasets can also be provided by request. These links are copied below for your convenience.

banknote

Documentation of the data can be found here. The data can be found here

concrete

Documentation of the data can be found here. The data can be found here

hepmass

Documentation of the data can be found here. The data can be found here

power

Documentation of the data can be found here. The data can be found here

red

Documentation of the data can be found here. The data can be found here

white

Documentation of the data can be found here. The data can be found here

Physionet 2012 Challenge data

The raw data can be downloaded here. The pre-processed version of the data can be found here.

Data are available as part of the paper’s supplementary material.
Data are publicly available by request, following the process described here:
Data are or will be made available through some other mechanism, described here:

Description

File format(s)

CSV or other plain text.
Software-specific binary format (.Rda, Python pickle, etc.):
Standardized binary format (e.g., netCDF, HDF5, etc.):
Other (please specify):

Data dictionary

Provided by authors in the following file(s): See the Appendix of the NIMIWAE manuscript (NIMIWAE_Appendix.pdf)
Data file(s) is(are) self-describing (e.g., netCDF files)
Available at the following URL:

Additional Information (optional)

Part 2: Code

Abstract

This code reproduces training of models and summarization of results in the paper “Handling Non-ignorably Missing Features in Electronic Health Records Data Using Importance-Weighted Autoencoders.” We show superiority of NIMIWAE in imputation of simulated missing values, and perform comparative analyses on the Physionet 2012 Challenge EHR Dataset, showing improvements in imputed missing entries.

Description

Code format(s)

Script files
- R
- Python
- Matlab
- Other:
Package
- R
- Python
- MATLAB toolbox
- Other:
Reproducible report
- R Markdown
- Jupyter notebook
- Other:
Shell script
Other (please specify):

Supporting software requirements

Version of primary software used

R version 3.6.1
Python version 3.6.3

Libraries and dependencies used by the code

R packages:

reticulate (1.13)
NIMIWAE (0.1.0)

Python modules:

numpy (1.18.1)
pandas (1.5.0)
scipy (1.4.1)
torch (1.5.0)
tensorflow (1.14.0)
sklearn (0.22.1)
argparse (1.1)
tqdm (4.42.1)

Supporting system/hardware requirements (optional)

This code requires access to a cuda-enabled GPU.

Parallelization used

No parallel code used
Multi-core parallelization on a single machine/node
- Number of cores used:
Multi-machine/multi-node parallelization
- Number of nodes and cores used: 3 nodes, 243 cores

License

MIT License (default)
BSD
GPL v3.0
Creative Commons
Other: (please specify below)

Additional information (optional)

The blinded NIMIWAE package can be found here. It has also been attached in the original submission for convenience.

Scope

The provided workflow reproduces:

Any numbers provided in text in the paper
All tables and figures in the paper
Selected tables and figures in the paper, as explained and justified below:

Workflow

Format(s)

Single master code file
Wrapper (shell) script(s)
Self-contained R Markdown file, Jupyter notebook, or other literate programming approach
Text file (e.g., a readme-style file) that documents workflow
Makefile
Other (more detail in Instructions below)

Instructions

Run runComparisons.R script to train all models (time-consuming). Then, run SummarizeResults.R script to reproduce Figures 2 and 3, and the results in Table 2.

Expected run-time

Approximate time needed to reproduce the analyses on a standard desktop machine:

< 1 minute
1-10 minutes
10-60 minutes
1-8 hours
> 8 hours
Not feasible to run on a desktop machine, as described here

Training the deep learning architectures can be very time-consuming, and it may be recommended to run on a computing cluster with access to a GPU with a large memory (at least 16GB recommended). Summarizing the results should not take any longer than 10 minutes.

Additional information (optional)

The blinded reproducibility repository can be found here, and it can also be found attached in the original submission for convenience.

Notes (optional)

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.