The purpose of this study is to predict sleeping hours and classify sleep trouble with study, demographic, physical measurement, health, and lifestyle variables based on 5,000 American participants’ health and nutrition examination surveys since early 1960’s data.
For the multiple regression model, an increase in the age decade is associated with a decrease in number of sleeping hours with decade 50-59 with the largest decrease. Out of Hispanic, Mexican, White and other races, Mexican race is associated with an increase of sleeping hours versus other race which has the least hours. Based on education, participants who are in high school are associated with a a decrease in sleeping hours versus other education groups. Those with lower household income are associated with an increase in sleeping hours compared to those with higher household income. Those with poor general health are associated with a decrease in sleeping hours compared to those with very good general health. Those who use the computer more than four hours a day are shown to be associated with an increase in sleeping hours than those less than four hours. Those who smoke at least 100 cigarettes a year are associated with a decrease in sleeping hours than others.
In the regression tree model, the two main predictors that explain number of sleeping hours in a weekday/workday night are number of days of poor mental health and age decade. Those with 5.5 or more days of poor mental health are predicted to have 6.2 hours of sleep. Those with less than 5.5 days of poor mental health and belong to age range 20-69 are predicted to have 6.9 hours of sleep. Those with less than 5.5 days of poor mental health and are older than 69 are predicted to have 7.3 hours of sleep.
When comparing the models, 7% of the variability observed in the number of hours of sleep a weekday or workday night is explained by the multiple regression model whereas only 3% of the variability observed in the number of hours of sleep a weekday or workday night is explained by the decision tree model. Since linear model has the lowest RMSE of 1.29 and MAE of 1.01, the multiple linear regression model is the best at predicting number of sleeping hours per weekday/workday night in the NHANES data set.
The logistic model has a 76% accuracy and 70.8% ROC AUC value. k-NN model has a 89.3% accuracy and 94.0% ROC AUC value. C5.0 model has a 88.0% accuracy and 91.6% ROC AUC value. Random forests model has a 86.3% accuracy and 91.3% ROC AUC value. Naive Bayes model has a 74.6% accuracy and 68.9% ROC AUC value. When comparing the models, kNN performed the best out of the four models with the best ROC AUC value and highest accuracy.
When predicting sleeping hours, the best model for determining the number of hours a participant sleeps is the multiple linear regression model. The most significant predictors are age decade, race, education, household income, general health, computer/gaming usage, and smoking at least 100 cigarettes. As for classifying sleep trouble, the best model for determining if a participant has sleep trouble or not is the kNN model.
- C5.0 by boosting
- Decision tree by cost or trials
- Logistic and Linear Regression by regularization
- Random Forests by tuning
- Naive Bayes by laplace
- Cross-Validation