Skip to content

Python script to fit an empirical distribution to an theorical statistical distribution

Notifications You must be signed in to change notification settings

thiagoguarnieri/cdf-fitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cummulative Distribuition Function Fitter

Python script based on the scipy-stats library to fit an empirical distribution to an theorical statistical distribution.

It receives a CSV dataset with continuous and discrete data and find the parameters for more than 90 distributions. For each distribution, the script also employs

  • Kolmogorov-Smirnov and Anderson-Darling adherence tests
  • RMSE error measurement for the entire curve and Weighted between curve body and tail

How to call

python -W ignore fitter.py [csv_data] [output_folder] [column_header];

E.g: Assuming you have a dataset.csv and wants to store in a folder called results:

python -W ignore fitter.py dataset.csv results data;

The script outputs

  • A png for each distribution
  • CDF percentiles (synthdata_*) to plot in other program languages
  • The distribution parameters (parameters_*). The two last parameters are related to curve position and scale. The remaining parameters are related to distribution parameters

Goodness

The script outputs RMSE and Kolmogorov-Smirnov (KS). If you use RMSE, you must choose the dist. with the smallest value. However, there are cases where the curve has a good visual fit but high RMSE. It happens because some distributions overestimates the last percentile value. Therefore you may use the value up to 95% (curve body) or the weighted RMSE, which reduces the weight of the curve tail in RMSE calculation.

You can use KS instead. In this case you also must choose the smallest one. However, p-value must be above 0.05. P-value indicates if empirical and test samples are from the same theoretical distribution. If you encounter two distributions with similar KS values, then you can look to RMSE to decide which one you will pick. In fact, a good RMSE depends on your application. There are cases where curve tail does not matter and there are cases where making mistakes for values in the tail have a high impact in your system.

If you have problems with p-values even in distributions with a good visual fit, you can try to change the test sample sizer in lines 52 and 53 (second parameter):
smp_emp = np.random.choice(emp_d,50,replace = False)
smp_theo = np.random.choice(theo_d,50,replace = False)

About

Python script to fit an empirical distribution to an theorical statistical distribution

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published