Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bimodal Hazard #52

Open
adamhaber-atidot opened this issue Nov 4, 2018 · 4 comments
Open

Bimodal Hazard #52

adamhaber-atidot opened this issue Nov 4, 2018 · 4 comments

Comments

@adamhaber-atidot
Copy link

adamhaber-atidot commented Nov 4, 2018

The process I'm trying to model using WTTE-RNN has two "typical" churn times, though customers can churn at any other time (they're simply a-priori more likely). This means that, at least when evaluated at T=0, the hazard rate for future Ts should be bimodal (right?).

Is it somehow possible to tweak/hack the Weibull log-likelihood (discrete version) to represent such a (non-weibull-anymore) bimodal hazard rate? Perhaps something like a mixture of weibulls? Is it possible to compute the loss in that case?

EDIT: For computing the loss, we need both the PMF and the SF. Calculating the PMF of the mixture is easy, but I'm not sure about the SF of the mixture. Is it simply the mixture of the SFs? Or perhaps some sort of convolution between them? Any help would be appreciated...

@ragulpr
Copy link
Owner

ragulpr commented Nov 5, 2018

Hi there,
There's actually good ways to model Bimodal hazards. The sum of cumulative hazard functions is a valid cumulative hazard function, see for example this paper. So if S(Y) = exp[-L(y)] is the survival function and L the cumulative hazard, you could make a new distribution using L= L_1(y)+L_2(y) with L_k a weibull chf with its own parameters. For all the loss-functions etc you thus only need to change the "cumulative hazard function".

I did not implement this in WTTE-RNN yet because keras wants output dimension to be the same as "target" dimension, which holds by chance in the case with 2 predicted parameters and 2 "targets" (time to event+censoring indicators). It should also be noted that this may be pretty numerically unstable.

Now to your problem, do you need bimodal distribution? I would be very very surprised if you do. While we like to believe that we can predict a weibull distribution that will be peaking around the actual tte, in practice reality is just much to noisy and the predicted distribution usually has Beta<1 (making hazard rate decreasing). I get a sense that many people are chasing peakedness of distributions like fools gold.
So since I'm sceptical about being accurate about one peak, predicting two peaks seems even harder. Anyway, your model query is typically "What's the probability of event within 30 days" Pr(Y<30) and that doesn't care so much about peakedness.

I would start with training a minimal, simple model and see if it predicts Beta>1 and if the errors are systematic s.t it could be helped using a bimodal distribution, first then try to implement it the bimodal. But if it's for the sake of research and fun, I'd go for it because it sounds interesting.

As a final note, can you really model churn as an event? I argue that what's often observable is the non-event (not buying, logging in etc)

@adamhaber-atidot
Copy link
Author

There's actually good ways to model Bimodal hazards. The sum of cumulative hazard functions is a valid cumulative hazard function, see for example this paper.

Thanks! Just to make sure I understand:

  1. Log-likelihood of the mixture is u*log(p_mix(t)^u * s_mix(t+1)^(u-1))
  2. The PMF of the mixture becomes p_mix(t) = r*p_1(t) + (1-r)*p_2(t)
  3. The SF of the mixture becomes the product of the SFs (since it is the exponent of the mixed CHF, which is the weighted sum of the individual CHFs) s_mix(t+1) = exp(-r*chf_1(t+1)-(1-r)*chf_2(t+1))

Is that correct? The "independence" assumption of the CHFs seems strange to me...

I did not implement this in WTTE-RNN yet because keras wants output dimension to be the same as "target" dimension, which holds by chance in the case with 2 predicted parameters and 2 "targets" (time to event+censoring indicators). It should also be noted that this may be pretty numerically unstable

I'm trying to re-implement everything is PyTorch, which (I hope) will make this easier and prevent the various "tf-hacks" in wtte.py.

Now to your problem, do you need bimodal distribution? I would be very very surprised if you do

Since in reality I have two different peaks, my understanding is that in order to minimize the loss (at T=0) the RNN would "smear" the PMF between the two peaks. I have a feeling this is tossing away a lot of information from the data, and underestimating the hazard at T=0. Makes sense?

I would start with training a minimal, simple model and see if it predicts Beta>1 and if the errors are systematic s.t it could be helped using a bimodal distribution, first then try to implement it the bimodal. But if it's for the sake of research and fun, I'd go for it because it sounds interesting.

Definitely. 😄

As a final note, can you really model churn as an event? I argue that what's often observable is the non-event (not buying, logging in etc)

Yes, in my case I think it makes more sense. I get your point in the general case, though (I think...).

@ragulpr
Copy link
Owner

ragulpr commented Nov 5, 2018

Log-likelihood of the mixture is ulog(p_mix(t)^u * s_mix(t+1)^(u-1))
The PMF of the mixture becomes p_mix(t) = r
p_1(t) + (1-r)p_2(t)
The SF of the mixture becomes the product of the SFs (since it is the exponent of the mixed CHF, which is the weighted sum of the individual CHFs) s_mix(t+1) = exp(-r
chf_1(t+1)-(1-r)*chf_2(t+1))

Looks fairly right but this is new territory. I actually don't know 😄.

I haven't looked into classical mixtures like you suggests (i.e pdf = sum_k p_k f_k, sum_k p_k =1) since I expect that the log-integral of this - needed for loss in discrete or censored case - may be pretty complex. If the point is to make multi-modal distributions I just think it might be less hassle to do as suggested in cited paper, I.e just use all regular formulas for the loss function and everything else but with new CHF which is a sum of individual CHFs. As the scale-parameter of the individual CHFs will do what the P's would be doing you don't need to think of it as separate.

Since in reality I have two different peaks, my understanding is that in order to minimize the loss (at T=0) the RNN would "smear" the PMF between the two peaks. I have a feeling this is tossing away a lot of information from the data

Yes I believe that would be the case. You are tossing away information, but I think that's usually something you want. Remember, if the event density (=Hazard) is in two peaks looking at it from T=0, the time to event at t=0 is just concerning the next event.
If the next event is predictable by the RNN, then you still want a unimodal prediction. Anyway, I think 99% of the information from your RNN will be injected into the scale parameter. Regardless of that, the pdf is exponentially decreasing with the CHF meaning basically that the shape of the future hazard is of exponentially decreasing importance to you when querying from T=0 😄. I like the saying that it smears. The distribution of Time To Event is kind of a smeared distribution of the exact Time Of Event.

and underestimating the hazard at T=0. Makes sense?

I assume that regardless you'll predict a weibull with Beta<1 meaning that alot of hazard will be around T=0.

Cool with Pytorch! Looking forward to see your implementation. I have a big pytorch-release coming up of this too, just awaiting for the formalities with the sponsoring company to come through to be able to release it

Edit: I'll say it again; I think multimodal predictions is a really cool idea!

@ragulpr ragulpr closed this as completed Nov 5, 2018
@ragulpr ragulpr reopened this Nov 5, 2018
@adamhaber-atidot
Copy link
Author

If the point is to make multi-modal distributions I just think it might be less hassle to do as suggested in cited paper, I.e just use all regular formulas for the loss function and everything else but with new CHF which is a sum of individual CHFs.

Very interesting read! Can you explain how one can implement a mixed hazard model? AFAIK, the workflow would be:

  1. Define the desired CHF parametric form (for example, something like (x/a)^b+(x/c)^d, or perhaps with some sort of weighting as an extra parameter)
  2. Derive the desired PMF using exp(-CHF(T))-exp(-CHF(T+1)).
  3. Derive the desired SF using exp(-CHF(T))
  4. Compose the likelihood, train, hope for the best.

Is that correct?

Yes I believe that would be the case. You are tossing away information, but I think that's usually something you want. Remember, if the event density (=Hazard) is in two peaks looking at it from T=0, the time to event at t=0 is just concerning the next event.

This is probably the crux of the matter; in my case, the event happens only once (more like the engine failure example and less like the repeated commits example, if you will). So there isn't really a "next event" - there's only "the event". And if it happens around T=50 and T=100, I don't want the mode to be at T=75 when I predict at T=0 - I want two modes, at 50 and 100.

The distribution of Time To Event is kind of a smeared distribution of the exact Time Of Event.

That's exactly the reason why I really like this approach, as well.

Cool with Pytorch! Looking forward to see your implementation. I have a big pytorch-release coming up of this too, just awaiting for the formalities with the sponsoring company to come through to be able to release it.

Interesting. Can we continue this discussion somewhere else? I'm struggling with the fact that PyTorch has no immediate replacement for the masking layer (not that I'm aware of, at least), and would be happy to hear how you solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants