Exploratory Data Analysis (EDA) Tool

Team members: Claude Hu, Caitlyn Nguyen, Luenna Wu

Overview

This tool can be used to produce graphical displays without any necessary end-user coding. It provides a quick preliminary analysis for researchers or scientists to understand their results.

Description

A graphical user interface (GUI) was coded in python. This GUI instructs the user to upload a .csv file containing a dataset with an outcome (y) and predictor(s) (x). The back-end code then read in the data. The next screen of the GUI then instructs the user of which variable is the outcome and which are the predictors. Next, the user selects the types of data (binary, categorical, discrete, or continuous) for each outcome and predictor. The GUI will then have different options for the tool to generate which the user can select through square checkboxes, including:

Boxplots:
Shows the five-number summary of sets of data. It is helpful for comparing distributions across groups and reveal any potential outliers.
Scatterplot matrix:
Displays scatterplots of the outcome on each continuous predictor. It shows the relationship between the outcome and predictors.
Correlation matrix:
Exhibits a heatmap of the correlation coefficients between continuous predictors. It is useful to investigate the dependence between multiple variables at the same time and examine if there is multicollinearity.
Histograms:
Displays the distribution of each continuous or discrete variable in the dataset. It is useful to show where the peaks of the distribution are, whether the distribution is skewed or symmetric, and any potential outliers.
Pairplots:
Plots pairwise relationships of continuous variables in the dataset. A grid of axes is created with each numeric variable as the y-axes of a single row and the x-axes across a single column.

Once the user submits their selection, a new folder will be created within the local directory. Graphical displays will be outputted as .png files in the folder.

Setup and Installation

Python Setup

To install Python and all necessary packages listed in Requirements.txt, please refer to Python Packaging Installation Instructions.

To install Git, please refer to Git Guides.

To install Pytest, please refer to Pytest Documentation.

Data Setup

Data should be recorded in a .csv file with columns being each variable. Each column should have a header/variable name, in the first row.

The accepted variable types are:

Binary
Categorical
Discrete
Continuous

Scripts Setup

The following .py scripts should be downloaded and saved in the same folder within your local drive:

Variable_Class.py
final_project_main.py
plots.py

Class Description

Variable Class

The Variable class is initialized as: Variable(name: str, values: list).

Each instance of the Variable class stores attributes for a single variable listed as a column in the uploaded dataset.

The Variable class has the following instance attributes:

name
values

The Variable class has the following properties:

get_type
get_x_or_y

The method set_type(self, var_type: str) is used to set the variable type for the variable. The variable type can be set as "Binary", "Categorical", "Discrete", or "Continuous". A ValueError is raised if the inputted var_type is not one of the four types previously listed. Calling upon get_type will return the variable type as a str set from set_type.

The method set_x_or_y(self, x_or_y_type: str) is used to set whether the variable is a x or y variable. The variable type can be set as "x" or "y". A ValueError is raised if the inputted x_or_y_type is not "x" or "y". Calling upon get_x_or_y will return if the variable is "x" or "y" as a str set from set_x_or_y.

Instructions

The .csv file should be saved in your local drive in a location which can be easily accessed again.

Set the folder with all of your scripts as the working directory within VScode.
In the cmd terminal, write:

python final_project_main.py

A GUI window will show up instructing with a button to "Select Input Table". Click the "Select Input Table" button.

The file directory will pop up. Find and select your .csv file, then click "Open".

The file path should now populate within the GUI. Click on "Next".

A list of all of the variable names will appear. Select your outcome variable and click on "Next".

Select the type of variable for the outcome from the list of variable types and click on "Next".

Select which predictors you would like to include and the variable type for each predictor. Click on "Next" when done.

Select the visualizations you would like to be produced. Click on "Run" to run the generation of plots.

A new folder named EDA_[year]_[month]_[day]_[hour]_[minute]_[second] will be created in your local directory. All plot figures are outputted into this folder as .png files.

Example Outputs

An example output of each plot type is given below.

Boxplot

Correlation Matrix

Histogram

Pairplot

Scatterplot

Testing

Test modules are placed in the test_Variable_Class.py. Importation of pytest is required for testing. The files test_data.csv and test_data_2.csv are included to be used for testing.

To test the Variable_Class.py module, run the test_Variable_Class.py module by typing in the console:

pytest test_Variable_Class.py

A 100% passed test result should appear similar to:

test_Variable_Class.py ....                                                                    [100%]

======================================== 1 passed in 0.12s ========================================

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
README.md		README.md
Requirements.txt		Requirements.txt
Variable_Class.py		Variable_Class.py
final_project_main.py		final_project_main.py
plots.py		plots.py
test_Variable_Class.py		test_Variable_Class.py
test_data.csv		test_data.csv
test_data_2.csv		test_data_2.csv

ClaudeHu/821_Final_Project

Folders and files

Latest commit

History

Repository files navigation