In order for this book to be successful the code examples must be solid. Statistics is a complicated topic, and for readers who have no prior experience programming the complexity of statistics will be compounded by the complexity of learning Python syntax.
I want the Python code examples to be simple and ideally self-explanatory. I want the reader to think "yes, this makes sense" when they see each code block, and support the text narrative, rather than see the code examples as additional stumbling blocks.
Goals:
- Keep it pythonic: follow standard conventions for naming an source formatting.
- Make it general enough, but not too general: helper functions should be able to handle all the cases for the book, but not generalize beyound what is necessary.
- Make it simple: don't use fancy Python features that beginners might not be familiar with.
- Do not reuse variable names in the same notebook (e.g. if you define
rvX
in the beginning, don't redefine it to be a different thing later on).
eprices
(df): the dataframe loaded fromdatasets/eprices.csv
eprices{W}
(df): the subset of the rows from data frame for group{W}
pricesW
andpricesE
(array or series): of theprice
data column
ksample
: kombucha volumes sample, alsoksample02
,ksample03
, from different batchesasample
: apples weightsscoresR
andscoresU
: the values of the two groups of sleep scores form thedoctors
dataset
Sampling distributions:
xbars20
: sampling distribution of the mean for samples of sizen=20
from the random variablervX
.xvars20
: sampling distribution of the variance for samples of sizen=20
fromrvX
.
Sampling and resampling conventions:
-
{x}sample
: generated sample fromrvX
by simulation.
Optionally include sample size, e.g.{x}sample20
for sample of size$n=20$ . -
rsample
: sample generated using resampling (e.g. permutation test) -
bsample
: bootstrap sample generated fromsample
Data frames full of samples or resamples (not currently used in user-facing code):
{x}samples_df
: dataframe that contains multiple{x}samples
(identified by value insample
column)rsamples_df
: a DataFrame that contains data fromR
rsample
s and has an extra columnrep=1:n
(to imitatereplicate
col used byinfer
).bsamples_df
: a DataFrame that contains data fromB
bsample
s
See function gen_sampling_dist
in ministats/sampling.py
.
See function gen_boot_dist
in ministats/sampling.py
.
See functions simulation_test_mean
in ministats/hypothesis_tests.py
.
See functions resample_under_H0
and permutation_test
in ministats/hypothesis_tests.py
.
Below we save the old implementations (which were less good).
def dmeans(sample):
"""
Compute the difference between groups means.
"""
xS = sample[sample["group"]=="S"]["ELV"]
xNS = sample[sample["group"]=="NS"]["ELV"]
d = np.mean(xS) - np.mean(xNS)
return d
def resample_under_H0(sample, groupcol="group"):
"""
Return a copy of the dataframe `sample` with the labels in the column `groupcol`
modified based on a random permutation of the values in the original sample.
"""
resample = sample.copy()
labels = sample[groupcol].values
newlabels = np.random.permutation(labels)
resample[groupcol] = newlabels
return resample
# # resample
# resample = resample_under_H0(data)
# # compute the difference in means for the new labels
# dmeans(resample)
def permutation_test(sample, statfunc, groupcol="group", permutations=10000):
"""
Compute the p-value of the observed `statfunc(sample)` under the null hypothesis
where the labels in the `groupcol` are randomized.
"""
# 1. compute the observed value of the statistic for the sample
obsstat = statfunc(sample)
# 2. generate the sampling distr. under H0
restats = []
for i in range(0, permutations):
resample = resample_under_H0(sample, groupcol=groupcol)
restat = statfunc(resample)
restats.append(restat)
# 3. compute p-value: how many `restat`s are equal-or-more-extreme than `obsstat`
tailstats = [restat for restat in restats \
if restat <= -abs(obsstat) or restat >= abs(obsstat)]
pvalue = len(tailstats) / len(restats)
return restats, pvalue
# sampling_dist, pvalue = permutation_test(data, statfunc=dmeans)
Comments:
- The function
dmeans
is not the same asdmeans
used earlier in the book. - The
for
-loop inpermutation_test
could be abstracted into a function.
def resample_under_H0(data, groupcol="group"):
"""
Return a copy of the dataframe `data` with
the labels in the column `groupcol` mixed up.
"""
redata = data.copy()
labels = data[groupcol].values
newlabels = np.random.permutation(labels)
redata[groupcol] = newlabels
return redata
def gen_sampling_dist_under_H0(data, permutations=10000):
"""
Obtain the sampling distribution of `dmeans` under H0
by repeatedly permutations of the group labels.
"""
pstats = []
for i in range(0, permutations):
data_perm = resample_under_H0(data, groupcol="end")
xW_perm = data_perm[data_perm["end"]=="West"]["price"]
xE_perm = data_perm[data_perm["end"]=="East"]["price"]
dhat_perm = dmeans(xW_perm, xE_perm)
pstats.append(dhat_perm)
return pstats
def permutation_test(data):
"""
Compute the p-value of the observed `dmeans(xW,xE)` under H0.
"""
# Obtain the sampling distribution of `dmeans` under H0
pstats = gen_sampling_dist_under_H0(data)
# Compute the value of `dmeans` for the observed data
xW = eprices[eprices["end"]=="West"]["price"]
xE = eprices[eprices["end"]=="East"]["price"]
dhat = dmeans(xW, xE)
# Compute p-value of `dhat` under the distribution `pstats`
# (how many of `pstats` are equal-or-more-extreme than `dhat`)
tailstats = [pstat for pstat in pstats \
if pstat <= -abs(dhat) or pstat >= abs(dhat)]
pvalue = len(tailstats) / len(pstats)
return pvalue
Comments:
- Seems awkward for
gen_sampling_dist_under_H0
andpermutation_test
know about the specifics of the problem.
It would be nice to imitate the infer API and process so that we don't have to make design choices:
specify()
orassume()
hypothesize()
generate()
get_p_value()
get_confidence_interval()
I don't really see how to do this yet, but I'm watching the authors of UBC-DSCI/introduction-to-datascience-python to see what Python code they will use to imitate infer functionality.
- Extract the data from dataframe when passing to function
- Combine into an long array; permute; return first
n
and remainingm
- Can use ordinary functions
dmeans
as previously shown.
def resample_under_H0(sample1, sample2):
"""
Generate new samples from a random permutation of the values in two samples.
"""
values = np.concatenate((sample1, sample2))
shuffled_values = np.random.permutation(values)
resample1 = shuffled_values[0:len(sample1)]
resample2 = shuffled_values[len(sample1):]
return resample1, resample2
# usage:
# >>> resample_under_H0([1,1,1], [2,2,2,2])
# (array([2, 2, 1]), array([2, 1, 1, 2]))
def permutation_test(sample1, sample2, statfunc, permutations=10000):
"""
Compute the p-value of the observed `statfunc(sample1, sample2)` under
the null hypothesis where the group membership is randomized.
"""
# 1. compute the observed value of the statistic for the sample
obsstat = statfunc(sample1, sample2)
# 2. Obtain the sampling distribution of `statfunc` under H0
pstats = []
for i in range(0, permutations):
resample1, resample2 = resample_under_H0(sample1, sample2)
pstat = statfunc(resample1, resample2)
pstats.append(pstat)
# 3. compute p-value: how many `pstat`s are equal-or-more-extreme than `obsstat`
tailstats = [pstat for pstat in pstats \
if pstat <= -abs(obsstat) or pstat >= abs(obsstat)]
pvalue = len(tailstats) / len(pstats)
return pstats, pvalue
# usage:
# >>> permutation_test(pricesW, pricesE, statfunc=dmeans)[1]
# 0.0003
Comments:
- Pretty tight...
- Only issue is the
[0:len(sample1)]
selection which is weird, but we can live with that.