Skip to content

This repository should help people that would like to code in R and work with the National Health and Nutrition Examination Survey (NHANES). Some topics corved are SQL , logistic regression.... etc

john-m-burleson/NHANES-R-Programming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 

Repository files navigation

National Health and Nutrition Examination Survey (NHANES) R-programming

nhanes

Here is a Link to the NHANES website so you can first learn about this data repository: NHANES Homepage

A understanding of their metadata repository is highly encourged

Who can benifit from this code repository?

This is to help anybody that would like to conduct research using the National Health and Nutrition Examination Survey (NHANES) data repository using R. In this project we go from: downloading our data directly from the NHANES data repository, using a R package called RNHANES , to cleaning our data, imputing missing values using a R package called mi, providing descriptive statistics with graphics, and using logistic regression to try and forecast our results.

SQL R nhanes 1999-2018 walking impairment

This file downloads raw data files from 1999-2018 and merges them into a "master" file that is the first start to any research project(After a proper literature review!); NHANES breaks down their data into years and within each year they further break down their datasets into questions that are similar to one another. This is useful once you can navigate the site and once you get a thorough understanding of their metadata repository. We join our different datasets using basic SQL commands and output our datafile as a CSV file.

Note: The RNHANES package losses suppourt after the year 2014, so we have to manually download SAS files and work with these files within the R enviorment.

Descriptive statistics/Inferential Statistics/ Graphics using ggplot2

In this file we conduct statistical inference in the form of: Chi- square test of equal proportions, Chi-square test of independence and we create a correlation matrix of all variables in our dataset. We also wish to summarize our dataset in the form of descriptive statistics; we do this by visualizing our dataset in the form of waffle plots (square pie charts) and box plots using ggplot2.

Here are some of examples of data vizualizations that we will create using this code:

figure1h_walking impairment

waffle plot

figure2_correlation_matrix

figure1a_age

Logistic regression

In this file we conduct binary logistic regression to try and forecast the odds of a person experiencing walking impairment in their lifetime. Our logistic regression model is a function of other possible dependent demographic variables that represent our population of interest. These variables are associated with walking impairment, such as diabetes and gender. These variables are important because having one or more of these characteristics increases the odds of you having walking impairment in your lifetime. We come up with a statistically significant model as well as provide some model performance metrics Such as a ROC curve (as seen below) to gauge the predictability of our model.

figure3_roc_curve

Machine learning models/ General Linear Model

In these files we try and put together some machine learning models and one GLM and try and predict walking impairment in an individual. These models include:

1) Decision Trees

2) Neural Networks

3) Random Forrest (partykit R:package)

4) Random Forrst

5) General Linear model: Ordinal Logistic Regression

Releases

No releases published

Packages

No packages published

Languages