Skip to content

Bayesian and frequentist statistics in Python with data sampled from a distribution in Scala

License

Notifications You must be signed in to change notification settings

mdh266/BayesMLE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Frequentist & Bayesian Statistics With Py4J & PyMC3


In this post I want to go back to the basics of statistics, but with an advanced spin on things. By "advanced spin" I mean, both from in terms of mathematics and computational techniques. The topic will dive into is:

  • Estimating a single parameter value from a distribution and then quantifying the uncertantity in the estimate.

In general I will take two approaches to quantitfying the uncertainity in the estimate, the first of which is frequentist and second that is Bayesian. I was originally inspired by Jake VanderPlas' post and admit, I am not very seasoned using Bayesian methods. That's why I'll be sticking to a simple example of estimating the mean rate or 𝜆 in a Poisson distribution from sampled data. An image of the Poisson distribution for various 𝜆 values which we wish to esimate are shown below:

Poisson

From the computational perspective, I wanted to do something different and decided to write the probability distribution for generating the data in Scala, but then use it with Python. Why did I do this? Well, I like Scala and enjoyed the challenge of writing a Poisson distribution using a functional approach. I also wanted to learn more about Py4J which can be used to work with functions and objects in the JVM from Python. Apache Spark actually uses Py4J in PySpark to write Python wrappers for their Scala API. I've used both PySpark and Spark in Scala extensively in the past and doing this project gave me an opportunity to understand how PySpark works much better.

In this post I covered maximum likelyhood estimators (MLE) and Bayesian point estimators. The MLE in this case was simple and I could show how to quanitify the uncertaintity in the estimate using confidence intervals from the Fisher information. I use PyMC3 to calculate two Bayesian estimators and the credible Interval. PyMC's makes it easy MCMC methods to calculate and visualize posterior distributions for the parameter of interest as shown below,

posterior.png

One can also show that in the limit of large data Bayesian estimators and Maximum Likelyhood estimators converge to the same thing! This is called the Bernstein-von Miss Theorem.

Building The Jar


You first need to compile the Scala code and build the uber jar using Maven

mvn package

Starting Java Server & Jupyter Lab With The Docker


You build the docker images,

docker compose build

The start up the containers through,

docker compose up

You can shut down the contains using

docker compose down

Libraries:


  1. Docker 20.10.5
  2. Apache Maven 3.6.0
  3. Scala 2.12.6
  4. Py4J 0.10.9.2
  5. Python 3.7
  6. PyMC3 3.11.2
  7. Seaborn 0.11.0
  8. ArivZ 0.11.2

Releases

No releases published

Packages

No packages published

Languages