Bimodal Hazard #52

adamhaber-atidot · 2018-11-04T19:57:40Z

The process I'm trying to model using WTTE-RNN has two "typical" churn times, though customers can churn at any other time (they're simply a-priori more likely). This means that, at least when evaluated at T=0, the hazard rate for future Ts should be bimodal (right?).

Is it somehow possible to tweak/hack the Weibull log-likelihood (discrete version) to represent such a (non-weibull-anymore) bimodal hazard rate? Perhaps something like a mixture of weibulls? Is it possible to compute the loss in that case?

EDIT: For computing the loss, we need both the PMF and the SF. Calculating the PMF of the mixture is easy, but I'm not sure about the SF of the mixture. Is it simply the mixture of the SFs? Or perhaps some sort of convolution between them? Any help would be appreciated...

ragulpr · 2018-11-05T08:40:29Z

Hi there,
There's actually good ways to model Bimodal hazards. The sum of cumulative hazard functions is a valid cumulative hazard function, see for example this paper. So if S(Y) = exp[-L(y)] is the survival function and L the cumulative hazard, you could make a new distribution using L= L_1(y)+L_2(y) with L_k a weibull chf with its own parameters. For all the loss-functions etc you thus only need to change the "cumulative hazard function".

I did not implement this in WTTE-RNN yet because keras wants output dimension to be the same as "target" dimension, which holds by chance in the case with 2 predicted parameters and 2 "targets" (time to event+censoring indicators). It should also be noted that this may be pretty numerically unstable.

Now to your problem, do you need bimodal distribution? I would be very very surprised if you do. While we like to believe that we can predict a weibull distribution that will be peaking around the actual tte, in practice reality is just much to noisy and the predicted distribution usually has Beta<1 (making hazard rate decreasing). I get a sense that many people are chasing peakedness of distributions like fools gold.
So since I'm sceptical about being accurate about one peak, predicting two peaks seems even harder. Anyway, your model query is typically "What's the probability of event within 30 days" Pr(Y<30) and that doesn't care so much about peakedness.

I would start with training a minimal, simple model and see if it predicts Beta>1 and if the errors are systematic s.t it could be helped using a bimodal distribution, first then try to implement it the bimodal. But if it's for the sake of research and fun, I'd go for it because it sounds interesting.

As a final note, can you really model churn as an event? I argue that what's often observable is the non-event (not buying, logging in etc)

adamhaber-atidot · 2018-11-05T10:57:02Z

There's actually good ways to model Bimodal hazards. The sum of cumulative hazard functions is a valid cumulative hazard function, see for example this paper.

Thanks! Just to make sure I understand:

Log-likelihood of the mixture is u*log(p_mix(t)^u * s_mix(t+1)^(u-1))
The PMF of the mixture becomes p_mix(t) = r*p_1(t) + (1-r)*p_2(t)
The SF of the mixture becomes the product of the SFs (since it is the exponent of the mixed CHF, which is the weighted sum of the individual CHFs) s_mix(t+1) = exp(-r*chf_1(t+1)-(1-r)*chf_2(t+1))

Is that correct? The "independence" assumption of the CHFs seems strange to me...

I did not implement this in WTTE-RNN yet because keras wants output dimension to be the same as "target" dimension, which holds by chance in the case with 2 predicted parameters and 2 "targets" (time to event+censoring indicators). It should also be noted that this may be pretty numerically unstable

I'm trying to re-implement everything is PyTorch, which (I hope) will make this easier and prevent the various "tf-hacks" in wtte.py.

Now to your problem, do you need bimodal distribution? I would be very very surprised if you do

Since in reality I have two different peaks, my understanding is that in order to minimize the loss (at T=0) the RNN would "smear" the PMF between the two peaks. I have a feeling this is tossing away a lot of information from the data, and underestimating the hazard at T=0. Makes sense?

I would start with training a minimal, simple model and see if it predicts Beta>1 and if the errors are systematic s.t it could be helped using a bimodal distribution, first then try to implement it the bimodal. But if it's for the sake of research and fun, I'd go for it because it sounds interesting.

Definitely. 😄

As a final note, can you really model churn as an event? I argue that what's often observable is the non-event (not buying, logging in etc)

Yes, in my case I think it makes more sense. I get your point in the general case, though (I think...).

ragulpr · 2018-11-05T11:49:31Z

Log-likelihood of the mixture is ulog(p_mix(t)^u * s_mix(t+1)^(u-1))
The PMF of the mixture becomes p_mix(t) = rp_1(t) + (1-r)p_2(t)
The SF of the mixture becomes the product of the SFs (since it is the exponent of the mixed CHF, which is the weighted sum of the individual CHFs) s_mix(t+1) = exp(-rchf_1(t+1)-(1-r)*chf_2(t+1))

Looks fairly right but this is new territory. I actually don't know 😄.

I haven't looked into classical mixtures like you suggests (i.e pdf = sum_k p_k f_k, sum_k p_k =1) since I expect that the log-integral of this - needed for loss in discrete or censored case - may be pretty complex. If the point is to make multi-modal distributions I just think it might be less hassle to do as suggested in cited paper, I.e just use all regular formulas for the loss function and everything else but with new CHF which is a sum of individual CHFs. As the scale-parameter of the individual CHFs will do what the P's would be doing you don't need to think of it as separate.

Since in reality I have two different peaks, my understanding is that in order to minimize the loss (at T=0) the RNN would "smear" the PMF between the two peaks. I have a feeling this is tossing away a lot of information from the data

Yes I believe that would be the case. You are tossing away information, but I think that's usually something you want. Remember, if the event density (=Hazard) is in two peaks looking at it from T=0, the time to event at t=0 is just concerning the next event.
If the next event is predictable by the RNN, then you still want a unimodal prediction. Anyway, I think 99% of the information from your RNN will be injected into the scale parameter. Regardless of that, the pdf is exponentially decreasing with the CHF meaning basically that the shape of the future hazard is of exponentially decreasing importance to you when querying from T=0 😄. I like the saying that it smears. The distribution of Time To Event is kind of a smeared distribution of the exact Time Of Event.

and underestimating the hazard at T=0. Makes sense?

I assume that regardless you'll predict a weibull with Beta<1 meaning that alot of hazard will be around T=0.

Cool with Pytorch! Looking forward to see your implementation. I have a big pytorch-release coming up of this too, just awaiting for the formalities with the sponsoring company to come through to be able to release it

Edit: I'll say it again; I think multimodal predictions is a really cool idea!

adamhaber-atidot · 2018-11-05T13:36:09Z

If the point is to make multi-modal distributions I just think it might be less hassle to do as suggested in cited paper, I.e just use all regular formulas for the loss function and everything else but with new CHF which is a sum of individual CHFs.

Very interesting read! Can you explain how one can implement a mixed hazard model? AFAIK, the workflow would be:

Define the desired CHF parametric form (for example, something like (x/a)^b+(x/c)^d, or perhaps with some sort of weighting as an extra parameter)
Derive the desired PMF using exp(-CHF(T))-exp(-CHF(T+1)).
Derive the desired SF using exp(-CHF(T))
Compose the likelihood, train, hope for the best.

Is that correct?

Yes I believe that would be the case. You are tossing away information, but I think that's usually something you want. Remember, if the event density (=Hazard) is in two peaks looking at it from T=0, the time to event at t=0 is just concerning the next event.

This is probably the crux of the matter; in my case, the event happens only once (more like the engine failure example and less like the repeated commits example, if you will). So there isn't really a "next event" - there's only "the event". And if it happens around T=50 and T=100, I don't want the mode to be at T=75 when I predict at T=0 - I want two modes, at 50 and 100.

The distribution of Time To Event is kind of a smeared distribution of the exact Time Of Event.

That's exactly the reason why I really like this approach, as well.

Cool with Pytorch! Looking forward to see your implementation. I have a big pytorch-release coming up of this too, just awaiting for the formalities with the sponsoring company to come through to be able to release it.

Interesting. Can we continue this discussion somewhere else? I'm struggling with the fact that PyTorch has no immediate replacement for the masking layer (not that I'm aware of, at least), and would be happy to hear how you solve this issue.

ragulpr closed this as completed Nov 5, 2018

ragulpr reopened this Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bimodal Hazard #52

Bimodal Hazard #52

adamhaber-atidot commented Nov 4, 2018 •

edited

ragulpr commented Nov 5, 2018 •

edited

adamhaber-atidot commented Nov 5, 2018

ragulpr commented Nov 5, 2018 •

edited

adamhaber-atidot commented Nov 5, 2018

Bimodal Hazard #52

Bimodal Hazard #52

Comments

adamhaber-atidot commented Nov 4, 2018 • edited

ragulpr commented Nov 5, 2018 • edited

adamhaber-atidot commented Nov 5, 2018

ragulpr commented Nov 5, 2018 • edited

adamhaber-atidot commented Nov 5, 2018

adamhaber-atidot commented Nov 4, 2018 •

edited

ragulpr commented Nov 5, 2018 •

edited

ragulpr commented Nov 5, 2018 •

edited