Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating Bayesian power analysis method #292

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

cetagostini
Copy link

@cetagostini cetagostini commented Feb 15, 2024

I'm organizing the changes into a cleaner and more updated PR since it's been a while since I moved the last one.

Previous Context #277

Hey everyone!

As we discussed in the issue here, I've added a new function to CausalPy that performs power analysis. This is based on my internal work on estimating effects and conducting sensitivity analysis within Bayesian frameworks.

What will you find?

  • Notebook
  • Code example

I hope this explanation makes sense. Ideally, I'd like to have a few rounds of feedback and answer any questions you may have before moving on to unit testing.

Looking forward to hearing your thoughts!

cc: @drbenvincent

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

codecov bot commented Feb 15, 2024

Codecov Report

Attention: Patch coverage is 29.16667% with 85 lines in your changes are missing coverage. Please review.

Project coverage is 73.34%. Comparing base (00dc744) to head (30eef2d).

Files Patch % Lines
causalpy/pymc_experiments.py 7.60% 85 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #292      +/-   ##
==========================================
- Coverage   77.10%   73.34%   -3.76%     
==========================================
  Files          21       21              
  Lines        1380     1493     +113     
==========================================
+ Hits         1064     1095      +31     
- Misses        316      398      +82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cetagostini
Copy link
Author

cetagostini commented Feb 15, 2024

Hey @drbenvincent

According to what we discussed in the previous PR, I have made the following changes:

  1. Delete rounding from functions.
  2. Adjust the docstrings to be rendered correctly.
  3. Remove mentions of internal methods.
  4. Change the name of the methods.

The power analysis method should be universal to the different methods we use, the idea itself is to validate that our regression (Regardless of the type) does not capture any type of effect during a period where no effect is present, and using the posterior we can estimate then how large would the effect have to be. On the other hand, using this we can check the natural bias of the model in the estimates, checking how big reality diverges from the area of higher density.

@drbenvincent
Copy link
Collaborator

drbenvincent commented Feb 23, 2024

Cool. Sorry for the slow reply, only just getting a chance to look at this now.

We lost a few changes I made with this new PR. So for example examples.rst needs to be updated in order for the new notebook to be included in the docs.

Will try to do a more thorough review today

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the call today. Here are some thoughts:

Because this could be applied to multiple situations, not just synthetic control, then I think it could be a good idea to have a new section of the docs. It would look something like this

  • Bayesian Power Analysis
    • Explanation: This would be a new notebook which is mostly just text and explanation of the core logic of what happens in Bayesian Power Analysis. If there is a schematic image, then that would be great, otherwise I could come up with one in a later PR after this is merged. This could also include a couple of references for people who want to find out more. It could also be good to have a brief explanation of how this is similar or different to frequentist power analysis.
    • Synthetic control example: This would be the current notebook you've put together. Could add a bit more explanation of what the mu and cumulative mu values correspond to
    • Other examples: At a later date, you, me, or some other contributor could put together examples using power analysis with other quasi experiment types.

Could potentially add in a warning that the results may be biased when using the default flag of correction=False. I'm happy for that to be the default still.

If we think this is actually very general, and could be applied to any of the quasi-experimental methods, then we'd want to remove the core methods out of the PrePostFit class. It might take a bit of thinking about how best to implement it. For example, if we are lucky then we could just create a single BayesianPowerAnalysis class which gets fed data and the experiment object. But it could be that there are experiment-specific nuances, in which case we'd want to just have methods available under the specific experiment classes. Happy to chat about this on a call - also happy to just get it merged and add a warning that this is experimental and that the API may change in the future.

@cetagostini
Copy link
Author

@drbenvincent

  1. I added a few extra paragraphs to explain everything a bit more.
  2. I added a few notes about what mu means.
  3. Alerts added to mention the API can change in the future.
  4. Note on the notebook about the correction=False
  5. I'll try to create the schematic image. Still in progress

Let me know, if you think something else is missing. Once is merged, I'll dedicate some hours to create/add other notebooks.

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run make uml to rebuild the UML diagrams, see instructions here https://github.com/pymc-labs/CausalPy/blob/main/CONTRIBUTING.md#overview-of-code-structure

I think the following line might be too strong - at the moment I think it's clear that we can apply to quasi-experimental designs with a pre and a post intervention period (i.e. synthetic controls and interrupted time series). It's not immediately obvious if it can be applied to other methods like ANCOVA or diff in diff or regression discontinuity. So we could maybe just say that this notebook will illustrate it's use in the context of synthetic control.

Our proposed power analysis method is designed to be universally applicable across different regression models in causalpy

Can you change the notebook title to "Bayesian Power Analysis for Synthetic Control"?

In the notebook can you be more explicit what you mean by "this" where you write "we should see this difference approach zero"

Where you say: "It’s evident that the model fails to accurately represent reality. Even during periods without direct actions affecting the target, the actual values significantly deviate from the model’s predicted distribution."... can you just expand on this a bit. What is it about the image that shows this? Are we always in a position where we know the real mean, and is this only useful when we do?

Where you say: "In this scenario, the likelihood of our observed value falling within the model’s posterior distribution is extremely low"... can you say what about the summary table tells you this?

The Bayesian Tail Probability seems like a really important thing when interpreting the summary table. But at the moment there's no explanation about what this means. Could you add an explanation either in the introduction or when Bayesian Tail Probability first comes up?

causalpy/pymc_experiments.py Show resolved Hide resolved
docs/source/notebooks/power_analysis.ipynb Outdated Show resolved Hide resolved
causalpy/pymc_experiments.py Show resolved Hide resolved
causalpy/pymc_experiments.py Show resolved Hide resolved
Requested by @drbenvincent

Co-Authored-By: Carlos Trujillo <59846724+cetagostini@users.noreply.github.com>
@cetagostini cetagostini linked an issue Mar 10, 2024 that may be closed by this pull request
@cetagostini cetagostini added the enhancement New feature or request label Mar 10, 2024
@cetagostini
Copy link
Author

Hey @drbenvincent

I'm going over your comments:

Can you run make uml to rebuild the UML diagrams, see instructions here https://github.com/pymc-labs/CausalPy/blob/main/CONTRIBUTING.md#overview-of-code-structure

Done!

I think the following line might be too strong - at the moment I think it's clear that we can apply to quasi-experimental designs with a pre and a post intervention period (i.e. synthetic controls and interrupted time series). It's not immediately obvious if it can be applied to other methods like ANCOVA or diff in diff or regression discontinuity. So we could maybe just say that this notebook will illustrate it's use in the context of synthetic control.

Done!

Can you change the notebook title to "Bayesian Power Analysis for Synthetic Control"?

Done

In the notebook can you be more explicit what you mean by "this" where you write "we should see this difference approach zero"

Sure, now looks like: we should see the difference between the real data and the estimated inference approach zero, indicating that we have accurately captured the with our method the pre-intervention scenario.

Where you say: "It’s evident that the model fails to accurately represent reality. Even during periods without direct actions affecting the target, the actual values significantly deviate from the model’s predicted distribution."... can you just expand on this a bit. What is it about the image that shows this? Are we always in a position where we know the real mean, and is this only useful when we do?

Done, now looks like:

:::{note}
It is important to remember that in this methodology, the **power analysis must be executed in a period of the same duration before the intervention** and we have the knowledge that no other action is affecting our target variable (`mu`).

If we do not have pre-intervention data, we cannot make this comparison of the model performance prior to the intervention, losing insight into its power.

Additionally, if during the period prior to the intervention our assumption about other activities affecting the target variable is not maintained, we will not be able to trust the results.
:::

Where you say: "In this scenario, the likelihood of our observed value falling within the model’s posterior distribution is extremely low"... can you say what about the summary table tells you this?

The Bayesian Tail Probability seems like a really important thing when interpreting the summary table. But at the moment there's no explanation about what this means. Could you add an explanation either in the introduction or when Bayesian Tail Probability first comes up?

Done, now I added the following:
We can observe this after observe the low value of the Bayesian tail probability. How the bayesian tail probability works? Think of our posterior distribution as a range of possible values that we might see, with the mean value representing the most probable outcome. In this way, we can evaluate the probability of a new value being part of this distribution by measuring how far it deviates from the mean value.

If a value is precisely at the mean, it has a probability of 1 to fall in our posterior. As the value moves away from the average towards both extremes of the distribution, the probability decreases and approaches zero. This process allows us to determine how 'typical' or 'atypical' a new observation is, based on our model estimated posterior. It is an effective tool for interpreting and comprehending the model's inference.

cetagostini-wise and others added 2 commits March 10, 2024 21:39
Co-Authored-By: Carlos Trujillo <59846724+cetagostini@users.noreply.github.com>
Co-Authored-By: Carlos Trujillo <59846724+cetagostini@users.noreply.github.com>
Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changes so far. I think it would be good to get some other's thoughts on this, so I've requested that @NathanielF take a look too. But here are some thoughts, mostly about clarity...

  • change treatment_time to actual_treatment_time throughout
  • change test_time to fake_treatment_time throughout

Under "Evaluating the model" section, I find the wording confusing.

  • Can you clarify what model intervals we are talking about? The HDI intervals of the causal impact?
  • When you talk about the errors, clarify again that these are between the model retrodiction and the hold-out data. And these residuals are shown in the causal impact plot, to the right of the red line? Or also the cumulative being credibly non-zero?
  • "we can see that the model based on these regressors is systematically failing" Explain why. What is it about the plot that shows us this?

Under the first "Power Analysis" section

  • Would be a good point to introduce what the alpha kwarg is here. Eg are we talking about the same alpha as here https://en.wikipedia.org/wiki/Power_of_a_test ? Could be good to link to that if so, but to again clarify if the interpretation is the same of different from the frequentist setting.

Under "Evaluating the model"

  • I'm still a bit confused about the logic of what's happening in _power_estimation. The mean operates on the outcome variable itself, so what seems to be going on is you're calculating the mean outcome in the hold-out period, like what's shown in the image below (see the magenta bar I added). You could easily have a situation where the model predictions are completely off from the data, but they have the same mean values. For example if the predictions were like a sine wave but the data were like a cosine wave for example. I don't know if this is a major problem or not, but I think this is an area of the approach where there is some uncertainty about why it's done this way. Would it make more sense to instead calculate a different statistic that is more sensitive to this, like the sum squared residuals (i.e. operate on the residuals / causal impact) rather than on the raw data space? Maybe this is good idea, maybe this is a silly idea. Either way, I think there needs to be some more explanation and justification of why this is working like it is. Has this kind of approach been done before?

Screenshot 2024-03-15 at 14 37 51

Unfortunately, time wise I have to leave the review at this point. Looking forward to coming back to it, but feel free to respond to current points in the mean time if you have time.

If you also update from main it should also help remove some of the warnings that you're getting in the notebook as we've merged some bug fixes recently.

Copy link
Contributor

@NathanielF NathanielF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conscious that' i'm late to the party in reviewing this PR, but my broad feeling is that we need a good bit more clarity for what we are doing in this setting, for two reasons (i) the docs should be ideally a sufficient resource for people want to dig into the methods and (ii) we need to work very hard to avoid conflation with "classical" power analysis and the terminological jargon there.

For instance:

We have introduced a power analysis method in a Bayesian framework that provides a systematic approach to determine the sensitivity of a model in detecting expected effects.

This sounds good, but is too vague. Sensitivity to what? For what?

The method involves creating a null model that does not capture any effect during a period where no effect is present.

Reference to the null model here has to be more explicit in the docstrings and notebook as applied to particular experiment designs i.e. the Pre-Post design.

This validation process not only aids in determining the required effect size for significance

Is this how we're defining power? Mixes effect size language and signifcance terminology. Confusing. If we have a Null model are we assessing an alternative mode? Is the goal to reject the Null or to determine superiority after treatment? Bayesian power analysis is well-defined in the literature - see e.g. Spiegelhalter _Bayesian Approaches to Clinical Trials and Health Care evaluation. How does this methodology relate to that one which aims to gauge power

In contrast, Bayesian approaches use the posterior distribution to estimate the probability of various outcomes

This is the heart of it i think. We're doing too much. Better to focus on one simple method of evaluating power using the posterior. Dig in deeply on this one example and then gesture towards other metrics we can calculate using Bayesian posterior based power analysis. Ideally state the goal here: In this notebook we intend to evaluate the sensitivity of the Bayesian pre-post model to interventions.... we will show that XYZ for a particular metric M. This will inform us about the power of our experiment design to precisely estimate conclusions regarding the treatment effect.
Maybe even constrast the richness of this perspective when looking at a more narrow frequentist phrasing of the power analysis.

Lost the whole getting started section from the Notebook.

Essentially, if the model accurately represents our reality, and we'll base our decision on the model intervals. By grasping these intervals, we can ascertain how significant an effect must be

... must. be for what?

Furthermore, we can view the quantiles and determine the minimum detectable effect (MDE)

I think you need to be clearer on what MDE means in this context. MDE relative to what hypothesis?

Something i don't understand. About the first plot, you're using filtered data i.e. before the real intervention, so you're just assessing out-of-sample model fit for 10 observations preceding the treatment? I think the fit looks pretty good on the outcome scale? Not sure why we should think it's systematically failing.

The table output is unclear to me. Not sure how it relates to power analysis exactly. I think you're doing something more like a coverage test for a prediction interval?

Kruschke has a good paper on Bayesian power analysis here: https://link.springer.com/article/10.3758/s13423-016-1221-4

I think we need to be clearer about what kind of analysis we're doing here and be more explicit in the doc-strings and the notebook.


def _power_estimation(self, alpha: float = 0.05, correction: bool = False) -> Dict:
"""
Estimate the statistical power of an intervention based on cumulative and mean results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is defined as sub-function under the Pre-Post design class which (i understand) is intended to apply to multiple experiment designs which avail of this structure. As such the doc-string should be very clear on the nature of the power calculation being done.

As it stands the doc-string tries to cover too much and does so too quickly mixing technical concepts like confidence intervals and MDE which are more naturally stated in the frequentist paradigm. Additionally the function itself is overloaded which makes it hard to parse what is going on.

Not saying you should introduce the material here for giving background on Bayesian power analysis, but the doc-string should have something about the structure of how the pre-post design informs the kind of power analysis we are conducting here? I.e. where aiming to gauge mean changes in the posterior over time....

It can apply corrections to account for systematic differences in the data.

This is too vague. What corrections?

It's not even really clear what we mean by "statistical power" in this context.

causalpy/pymc_experiments.py Show resolved Hide resolved
"cumulative": (cumulative_upper_mde + cumulative_lower_mde) / 2,
"mean": (mean_upper_mde + mean_lower_mde) / 2,
}
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider splitting these steps into smaller functions with individual docstrings desribing the metric you're calculating in each case.

correction (bool, optional): If True, applies corrections to account for systematic differences in
cumulative and mean calculations. Default is False.

This really needs clarifying with respect to the nature the particular type of power analysis you're running on the pre-post design structure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but I think better describe on the document than the docstring otherwise can be to much.

"""
Calculate and summarize the intervention analysis results in a DataFrame format.

This function performs cumulative and mean calculations on the posterior predictive distributions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the doc-string is overloaded and unclear. Not clear why the Pre-Post design is relevant. Is this a summary of the intervention or just the intervention with respect to the power analysis we're running?

But now we also have reference to Bayesian tail probabilities, posterior estimations and causal effects. Too much going on.

If we're using particular bayesian tail probabilities because we have assumed a particular likelihood distribution we should clarify this here.

causalpy/pymc_experiments.py Show resolved Hide resolved

Returns
-------
- pd.DataFrame: A DataFrame representing the power estimation results, including posterior estimations,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again too vague since we're not clarifying before hand what the goal of Bayesian power analysis is in this Pre-Post design context.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I believe the description of the power can be reference to the document but not the added directly. What do you think?

@NathanielF
Copy link
Contributor

NathanielF commented Mar 17, 2024

Hi @cetagostini

Sorry i wrote a load of comments on the classes and methods before getting through the notebook fully. I think i realised at the end of the notebook what you're doing is not either "classical" power analysis or bayesian power analysis. But a kind of model adequacy check for out-of-sample prediction coverage? I think this is useful as a means of evaluating models, but the reference to power analysis confused me throughout.

I've added some references to Bayesian power analysis - but i think the heart of it comes down to a statement about the sensitivity of a Bayesian model to either priors or sample data in the precision of estimating a quantity using the posterior.
You can potentially turn your out-of-sample predictions under different model settings as a variety of sensitivity check, but i think this is more a model adequacy than power analysis? Is that the right reading?

image

https://solomonkurz.netlify.app/blog/bayesian-power-analysis-part-i/

@cetagostini-wise
Copy link

cetagostini-wise commented Mar 18, 2024

Hey team, thanks a lot for all the feedback 👍

@drbenvincent I'll go over your feedback, probably next week 🤞🏻 Looks pretty straightforward.

@NathanielF I just did a quick check and read the article here and I feel we are pretty align. I agree we have a mix of concepts and ramifications about how to use it which can create confusion, we can jump on a call next week, let me know if you have time in the agenda!

If I understand correctly the proposal from John K. Kruschke & Torrin M. Liddell, they establish a model in which through the result of their posterior a region which they consider significant or insignificant is defined.

The procedure requires establishing a region of practical equivalence (ROPE) around the null value that expresses a range of parameter values that are equivalent to the null value for current practical purposes. For example, if measuring the IQ of people in a treatment group, we might say that any group mean in the interval from 98.5 to 101.5 is practically equivalent to the general population mean of 100 (recall that the standard deviation of IQ scores in the general population is 15, hence the ROPE in this case extends plus or minus 0.1 standard deviations from the general population mean)

Our objective is to establish a model utilizing a methodology, such as synthetic controls, to ascertain the feasible values of our target variable. Then, we define a ROPE (In this case we use the alpha parameter for that) and based on the real value of our target variable, we evaluate whether to consider it significant or not.

For example, in our ongoing example about the probability of agreement with a policy statement, suppose we are interested in whether or not the population can be said to be ambivalent (i.e., 𝜃=0.50), and for practical purposes of decision making we define a ROPE from 𝜃=0.45 to 𝜃=0.55. From the posterior distribution in Figs. 3 and 4, we can make statements as follows: “The 95 % most credible values are all not practically equivalent to the null value (i.e., the 95 % HDI excludes the ROPE)”, and, “there is only 2.4 % probability that 𝜃 is practically equivalent to the null value (i.e., 2.4 % of the posterior distribution falls within the ROPE)”. The area of the posterior distribution inside a ROPE is easily computed but is not displayed in Figs. 3 and 4 (but is displayed later in Fig. 11). A different Bayesian decision rule for equivalence examines only the posterior probability mass inside the ROPE, without regard to the HDI (e.g., Wellek, 2010) but we prefer to treat the probability density as a meaningful quantity that better handles ske

Following the statement from the paper, we could recreate that scenario:
image

From my perspective, the approach appears to be performing identical actions. However, instead to execute it with a particular parameter, we use the entire target variable (mu).

Now, I do this process for the average and cumulative values, because it is my way of working inherited from causal impact (We can probably do it in a better way), but I think it is the way most marketers who are used to those tools see it. Therefore, the table of results that I generate is basically a mirror of the one generated by Google Causal Impact.

## Posterior inference {CausalImpact}
## 
##                          Average        Cumulative  
## Actual                   117            3511        
## Prediction (s.d.)        107 (0.37)     3196 (11.03)
## 95% CI                   [106, 107]     [3174, 3217]
##                                                     
## Absolute effect (s.d.)   11 (0.37)      316 (11.03) 
## 95% CI                   [9.8, 11]      [294.9, 337]
##                                                     
## Relative effect (s.d.)   9.9% (0.35%)   9.9% (0.35%)
## 95% CI                   [9.2%, 11%]    [9.2%, 11%] 
## 
## Posterior tail-area probability p:   0.001
## Posterior prob. of a causal effect:  99.9%

Now, this is the main and only objective. The primary aim of your modeling exercise is to gauge the power of your models based on a minimum detectable effect (MDE). Once you have evaluated the MDE of your models, you can focus on selecting the models and their regressors to increase their power and identify any systematic biases. However, these activities are just secondary goals and not the main focus of your analysis.

Ultimately, the key goal is to determine the MDE of your regressive model.

Let me know what you think, I'd be happy to make time during the week to talk, I think it's easier to deal with doubts that way 💪🏻

@cetagostini-wise
Copy link

cetagostini-wise commented Mar 18, 2024

Final comment @NathanielF, as I see it, we are perhaps aligned in the method (I will double-read your resources to be extremely sure) but perhaps we should align better in the definitions and words used in the notebook!

@drbenvincent Any extra comments would be great to know, happy to discuss the idea and see if there is value, I particularly think we could use this addition to empower users to synergize between pymc-marketing and causalpy. Not to mention that it is a feature that until now I have not found in the most used models by Google and the only one now that has it is Meta.

Possibly within a week or two, I might be ready again 😅

@NathanielF
Copy link
Contributor

Thanks @cetagostini , yeah I think we need to be quite clear about the language. Happy to chat soon.

Had a quick read of the Causal Impact docs you sent on and your focus on thenmu variable.

If our focus is on mu it makes sense to phrase our hypothesis about movements in mu the expected value of our posterior predictive distribution.. but in the Causal inference docs they focus on a hypothesis test to validate the impact of the intervention is greater than chance I.e. that their CI for the mean difference in mu pre and post intervention doesn't contain 0.

This is totally a fine procedure to implement, but it is just a hypothesis test, not a power analysis by my understanding.

A power test about movements in mu would have to say something about how the change in mu is sensitive to sample size or prior strength I.e. how many post treatment measurements we consider in the evaluation of mean-difference hypothesis test... This could be useful as it would help pin down how many observations (draws from the post-intervention model) it would take for us to be able to quickly claim an established mean-difference in mu > minimum effect of interest.

But yeah, happy to jump on a call when you have time. I just think we need to be real careful with language because hypothesis testing language is already such a mess because frequentist concepts in the mix.

@cetagostini
Copy link
Author

You could easily have a situation where the model predictions are completely off from the data, but they have the same mean values. For example if the predictions were like a sine wave but the data were like a cosine wave for example. I don't know if this is a major problem or not, but I think this is an area of the approach where there is some uncertainty about why it's done this way.

@drbenvincent This could happen, but for the model to make a prediction of that type and have this case, the user would then have to choose regressors that follow that pattern. Which would be very strange, because they should not correlate with the target variable, so why would I use them?

I think it is a fairly isolated case. That in reality has never presented itself to me,

@cetagostini
Copy link
Author

cetagostini commented Apr 2, 2024

Would it make more sense to instead calculate a different statistic that is more sensitive to this, like the sum squared residuals (i.e. operate on the residuals / causal impact) rather than on the raw data space? Maybe this is good idea, maybe this is a silly idea. Either way, I think there needs to be some more explanation and justification of why this is working like it is. Has this kind of approach been done before?

@drbenvincent I'm not sure if I've seen this approach anywhere else. The reason for working in raw space than causal space is because it is easier for me to explain 😅 but we can add an option for the user to decide how they want to make it. What do you think?

Plus, Causal impact from talks usually in raw space in the output, and I get use to it (You can see the previous output).

@cetagostini
Copy link
Author

cetagostini commented Apr 2, 2024

A power test about movements in mu would have to say something about how the change in mu is sensitive to sample size or prior strength

@NathanielF Exactly this is what we can determine with the current process. After running the first model and seeing how it responds during the period prior to the intervention, where none effect should we present, we can find:

  1. Should we generate more samples?
  2. Should we increase the number of chains?
  3. What are the best set of regressors?
  4. What is the best number of observations to train?

Iterating will allow us to respond those questions. Finding the best answers to decrease the MDE in our model.

@cetagostini
Copy link
Author

@NathanielF

As we discussed, I added descriptions in the document example where I try to reform the idea as a sensitivity analysis that can become a power analysis if it is performed before the intervention and is used to determine what actions should be taken in order to increase the power of the method.

Even so, I try to organize this new idea because, as I always imagine it as a previous step to the intervention, and by consequence, it was always a power analysis!

@cetagostini
Copy link
Author

@NathanielF @drbenvincent

I made some adjustments, and responded to some of your comments, let me know if we have anything extra to add!

@NathanielF
Copy link
Contributor

NathanielF commented Apr 19, 2024

Feedback as I read through the notebook

"...for it to be considered significant." Unpack "it" a bit. The treatment effect?

Typo method

"Using this methon we can answer"

In this section: "Similarities and Differences to Frequentist Methods" would it be worth highlighting the relation of this idea to the one you showed me in the other causal impact package?

In the plot can we change "real mean" to actual post-intervention mean. Or something similar to denote that it reflects the realisations of the observed data after the pseudo intervention. Then also label the predicted distribution appropriately to convey it is the predicted distribution for the "future" period.

Similarly in the table the results column needs to be clearer. Are these the observed mean values?

Naming of the functions power_summary and summary needs to be better. The output is too similar. The names should distinguish them a bit better i feel. Not sure what to call them though.

In the how it works section i think you need to be clearer that you are using a time-series null model and introducing a pseudo treatment date. The idea being that the model should be able to fit a pre-period and predict well in the post-period. You're conducting this exercise with pseudo treatment dates as a prepatory robustness check on the model specification that you want to use in the actual experiment involving a genuine treatment or intervention. Maybe also make a note about why the choice of pseudo treatment is not arbitrary or how you can bootstrap this style of analysis with random pseudo treatment dates to provide the sensetivity check. Might have to look into block-bootstrap methods for this.

Then outline the intended structure of the notebook i.e. we will first fit a basic model with a fixed pseudo treatment date showing the inadequacy of this model spec in pre-period data. Then we will improve the model and again evaluate using pseudo treatment date in the pre-period data. Once we're happy with this model we will apply the modelling exercise to the real treatment date.

I would consider cutting the bias estimation part altogether. There is quite a bit going on that is hard to follow. Additionally i'd consider if you can use simulated data to help better sign post

Generally i think i understood the process better this round, but the naming conventions (e.g. bayesian tail probability) are just too opaque and the notebook needs more sign posting. We're doing this step now because... etc. Much improved though.

Will try and look at the code soon too

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General or code comments

  • Running make check_lint shows a number of style violations. Running make lint should auto fix all this with ruff.
  • We need the new code in pymc_experiments.py to be covered by tests. Perhaps a good way of doing that is with an integration test (see test_integratio_pymc_example.py for examples). You could maybe base it around the same example you use for the illustrative example notebook.
  • As far as I can tell, we've got the two public methods plot_power and power_summary which call _power_estimation. So I'm just double checking that the outputs of _power_estimation are deterministic (given the existing stored mcmc samples) right? Can you just confirm that this is the case? Otherwise we might get slightly different results coming out of plot_power and power_summary.

Comments about the example notebook

  • Introduction paragraph is nice. When you say "We have introduced a..." do you want any attribution there to yourself, or to a team who developed these ideas? Otherwise the wording could change to something like "We outline an approach..." and again, I think it could benefit from adding a reference to similar kinds of ideas or approaches that are out there.
  • Clarity change: "The method involves creating a null model that does not capture any effect during a period where no effect is present" -> "The method involves creating a null model of the data before any intervention takes place."
  • "By analyzing the posterior distribution derived from this null model" Can you be more specific... it's not the whole posterior right, so you could clarify if it's a summary statistic (if so, what).
  • The "How it works" section is a great addition. But in the list, the samples and chains seems kind of boring and technical rather than consequential to doing an experiment. Maybe this section can be slightly tweaked to focus on the big important issues. As in, what can you as an experimenter gain from running a Bayesian power analysis before you run your experiment? Does it allow you to have more confidence in your conclusion? Do we need to talk about different kinds of errors (missing an actual effect, false alarm)?
  • The "Similarities and Differences to Frequentist Methods" section is nice. But I think it would be really strong if a number of links to wikipedia (or references) are added.
  • Not quite sure you're still getting the Model.model property is deprecated warnings. Having updated from main this shouldn't happen any more. Maybe the notebook wasn't re-run after doing that?
  • Where you say "the power analysis must be executed in a period of the same duration before the intervention", what do you mean by "same duration"? Are you just trying to say that we need to run the power analysis before any intervention?
  • I still think there's some more clarity to be had about the mean vs the cumulative. In both cases we're referring to the causal impact (difference between model prediction and reality) right? Spelling that out could make it clearer for readers.
  • At the end of the notebook, can you do the regular results.summary before you do the modified intervention summary (see note on the code about having this as a separate method).
  • Toward the end of the notebook I think the numbers you have in the text get a bit out of sync with the actual result values. The outcomes should be reproducible given the seed is provided, so can we update the values in the text?

Summary

So can we just summarise the core logic here and go through a few questions. What we did was basically:

  1. Fit a model on pre-intervention data. This model had poor predictive ability on hold-out data. So we...
  2. Fitted a better model (more predictors) on the same pre-intervention data. The power analysis shows a distribution of possible causal impact values that we'd expect to see in the absence of an intervention effect. Correct?
  3. We then ran the intervention, fitting a model on this full pre and post-intervention data.

I was expecting more of a comparison of the posterior observed causal effect with the predicted distribution of no effect from step (2). The idea being that we can maybe say that the observed effect falls way outside of what we'd expect from the control / no effect model from step (2). Is that right, or is this not how this works? Adding some clarification around this would help me and likely a lot of other readers :)

@@ -17,6 +17,7 @@ Synthetic Control
notebooks/sc_skl.ipynb
notebooks/sc_pymc_brexit.ipynb
notebooks/geolift1.ipynb
notebooks/power_analysis.ipynb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match up with the actual notebook name, so the docs won't render this page currently.

@@ -44,3 +50,45 @@ def test_round_num():
assert round_num(123.456, 5) == "123.46"
assert round_num(123.456, 6) == "123.456"
assert round_num(123.456, 7) == "123.456"


def test_compute_bayesian_tail_probability():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got multiple different scenarios here, each of which should be a different test. But for each test, we can have multiple asserts, like the output is a scalar float, between 0 and 1, etc

else:
probability = 2 * (1 - cdf_x + cdf_lower) / (1 - cdf_lower - cdf_upper)

return abs(probability)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting some type hint issues with abs(probability).

@@ -48,3 +49,29 @@ def _format_sig_figs(value, default=None):
if value == 0:
return 1
return max(int(np.log10(np.abs(value))) + 1, default)


def compute_bayesian_tail_probability(posterior, x) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add type hints for the function inputs.


results = {}
ci = (alpha * 100) / 2
# Cumulative calculations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this comment to be more descriptive in terms of what it is doing?

.mu.stack(sample=("chain", "draw"))
.sum("obs_ind")
)
# Mean calculations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this comment to be more descriptive in terms of what it is doing?


Returns
-------
- Dict: A dictionary containing key statistical measures such as posterior estimation,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how pedantic I should be, but are we actually getting a nested dictionary out here?

np.percentile(_mu_samples_mean, 100 - ci),
],
}
cumulative_upper_mde = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a comment here just saying that we are calculating Minimum Detectible Effect (bounds/intervals?).

"cumulative": cumulative_results,
"mean": mean_results,
}
results["_systematic_differences"] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a quick comment to explain what this is doing

print(f"Formula: {self.formula}")
# TODO: extra experiment specific outputs here
self.print_coefficients()
elif version == "intervention":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not 100% sure about this. At the moment we have a summary method for all classes which provides an estimate of the model coefficients. The approach used here is to also call the summary method but change what it does completely by changing a kwarg.

I'm thinking that it might be cleaner to completely separate this and just have a new method called power_analysis_summary, or intervention_summary, whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bayesian Power Analysis
4 participants