Cummulative Distribuition Function Fitter

Python script based on the scipy-stats library to fit an empirical distribution to an theorical statistical distribution.

It receives a CSV dataset with continuous and discrete data and find the parameters for more than 90 distributions. For each distribution, the script also employs

Kolmogorov-Smirnov and Anderson-Darling adherence tests
RMSE error measurement for the entire curve and Weighted between curve body and tail

How to call

python -W ignore fitter.py [csv_data] [output_folder] [column_header];

E.g: Assuming you have a dataset.csv and wants to store in a folder called results:

python -W ignore fitter.py dataset.csv results data;

The script outputs

A png for each distribution
CDF percentiles (synthdata_*) to plot in other program languages
The distribution parameters (parameters_*). The two last parameters are related to curve position and scale. The remaining parameters are related to distribution parameters

Goodness

The script outputs RMSE and Kolmogorov-Smirnov (KS). If you use RMSE, you must choose the dist. with the smallest value. However, there are cases where the curve has a good visual fit but high RMSE. It happens because some distributions overestimates the last percentile value. Therefore you may use the value up to 95% (curve body) or the weighted RMSE, which reduces the weight of the curve tail in RMSE calculation.

You can use KS instead. In this case you also must choose the smallest one. However, p-value must be above 0.05. P-value indicates if empirical and test samples are from the same theoretical distribution. If you encounter two distributions with similar KS values, then you can look to RMSE to decide which one you will pick. In fact, a good RMSE depends on your application. There are cases where curve tail does not matter and there are cases where making mistakes for values in the tail have a high impact in your system.

If you have problems with p-values even in distributions with a good visual fit, you can try to change the test sample sizer in lines 52 and 53 (second parameter):
smp_emp = np.random.choice(emp_d,50,replace = False)
smp_theo = np.random.choice(theo_d,50,replace = False)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
destCont		destCont
destDisc		destDisc
README.md		README.md
calls.sh		calls.sh
fitter.py		fitter.py
fitter_disc.py		fitter_disc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

destCont

destCont

destDisc

destDisc

README.md

README.md

calls.sh

calls.sh

fitter.py

fitter.py

fitter_disc.py

fitter_disc.py

Repository files navigation

Cummulative Distribuition Function Fitter

How to call

Goodness

About

Releases

Packages

Languages

thiagoguarnieri/cdf-fitter

Folders and files

Latest commit

History

Repository files navigation

Cummulative Distribuition Function Fitter

How to call

Goodness

About

Topics

Resources

Stars

Watchers

Forks

Languages