GitHub - jmftrindade/6.830-project: Class project for 6.830 database systems

An auto-cleaning layer for tabular data management systems

Class project for Fall 2016's edition of MIT's 6.830/6.814: Database Systems.

In it, we proposed and evaluated the performance -- taking into account both prediction accuracy as well as time to train -- of different supervised learning algorithms on the task of predicting missing values in tabular data. Additionally, we purposefully relied on an auto-ML like approach, where we didn't perform any explicit feature engineering, and instead used SFS (Sequential Feature Selection) to prune the search space of features to consider.

The tl;dr was: random forests provided the best trade-off in terms of prediction accuracy and time to train. We also found that using SFS added incurred in significant performance penalty at training time, while offering only marginal gains on prediction accuracy compared to using all available table columns as features.

Final report available here.

TODO

Add at least some documentation on how the repo is organized, e.g., how to run the training scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
FD_CFD_extraction-master		FD_CFD_extraction-master
datasets		datasets
experiment_logs		experiment_logs
figs		figs
plots		plots
willies_keras_example		willies_keras_example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_missing_value_datasets.py		generate_missing_value_datasets.py
plot_experiment_results.py		plot_experiment_results.py
plots_for_report.py		plots_for_report.py
report.pdf		report.pdf
requirements.txt		requirements.txt
run_experiments.sh		run_experiments.sh
run_ml_algos.py		run_ml_algos.py
statistics_plots.py		statistics_plots.py

License

jmftrindade/6.830-project

Folders and files

Latest commit

History

Repository files navigation

An auto-cleaning layer for tabular data management systems

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Languages