Different "knobs" to improve accuracy #36

adam-haber · 2017-12-13T09:46:25Z

Similar to #32, I'm also trying to use wtte-rnn for prediction on real data; I'm not getting very good performance, and trying to understand what are the different "knobs" I can play with for improving prediction.

Some general info:

Data is class-imbalanced - the event I'm trying to predict is rather rare during the time window I'm trying to predict (3-10%). Tried oversampling the training data to counteract this, didn't help...
Eyeballing plots of alpha and beta sequences, it seems like the networks learns what are the "good" features (pushing alpha up) and what are the "bad" features (pulling alpha down).
There's one "dominant" feature (time in the study) which is by far the most informative; I'm trying to do better than a simple model that uses just this features - capturing the "variance" when we remove it. Despite the fact that the network seems to learn the features, I'm not doing any better than the "vanilla", rate-per-time model.

I've tried using more GRUs, changing them to LSTMs, adding an initial dense layer, etc, but the whole thing feels too random. Any ideas on what what to tweak and how would be appreciated.

ragulpr · 2017-12-13T10:11:36Z

The knobspace is huge. It should not be possible to do worse than a simple exponential/cox regression model (since it's basically a wtte with beta=1 and only a single linear layer to alpha). What's your baseline that your comparing to?

By class imbalanced, do you mean that many sequences doesn't contain observed events? (I.e all timesteps are censored?

Unfortunately you have every knob there is in the neural network world 😄 Typically feature engineering is what'll give you the extra mile (seasonality features?, helping out with countdown since events? categorical features? etc) and I'm not the best to answer about what's the latest cool thing. But do try out increasing depth, trying different activation functions for the dense layers at the top, tweaking parameters in batchnormalization and if you wanna go wild consider the Keras Phased Lstm-layer which has a countdown built into it. It's exceptionally good at learning temporal features like your dominant one. But before all these things, remove the RNN layers altogether and only use dense layers with your favorite features to get a sense of what the RNN really learns and what the improvements are.

For wtte-specific things; If you're thinking about evaluation, consider #9, if you're getting right-shifted (exploding) distributions or NaNs consider #30, #33 , ideas about incorporating seasonal features check #31.

Also remember performance always comes with the risk of overfitting so be wary and good luck! Let me know if you have any more questions and happy to hear your results

adam-haber · 2017-12-13T13:13:31Z

The knobspace is huge. It should not be possible to do worse than a simple exponential/cox regression model (since it's basically a wtte with beta=1 and only a single linear layer to alpha). What's your baseline that your comparing to?

My baseline is the "vanilla" model - for each time period t in the study, I compute the probability of "churning" in the next timestep (t+1). This pools together all different cases, ignoring (presumably informative) covariates and events. The average precision of this model is what I'm trying to beat.

By class imbalanced, do you mean that many sequences doesn't contain observed events? (I.e all timesteps are censored?

Each of my sequences either has 1 event (and that's the sequence end) or 0 events (censored). By class imbalance I mean that in each "snapshot" in time, most subject will "survive" this timestep (or the next 2-3 timesteps) - therefore predicting who's going to "churn" is an imbalanced prediction.

Typically feature engineering is what'll give you the extra mile (seasonality features?, helping out with countdown since events? categorical features? etc)

At least for me, the biggest motivation for using a RNN was to automate (at least to some degree) the issue of feature engineering. My initial model was a vanilla logisitic regression, which was OK, but since I have more than a few features and I assume they interact, I decided to try the RNN. Does that makes sense?

But do try out increasing depth, trying different activation functions for the dense layers at the top, tweaking parameters in batchnormalization and if you wanna go wild consider the Keras Phased Lstm-layer which has a countdown built into it.

I'll keep trying and will update once I find something that works. :-) I'm having some troubles with NaNs when I'm trying out different layers - so far, using K.set_epsilon(1e-8) works best for me.

ragulpr · 2017-12-13T14:15:03Z

I'm not sure I'm following on the setup of the testing but I'm sure it's specific to your domain so maybe not relevant. But the question who's going to churn is dependent on how you align compare/timestep eg who's going to die next asked on a specific calendar day vs asked w.r.t alignment by age are completely different. I haven't really tried evaluating on similar metrics (eg concordance-index). You can also chose a bunch of things as the prediction to evaluate eg predicted expected value, expected median, probability of churning in t days etc.

Each of my sequences either has 1 event (and that's the sequence end) or 0 events (censored). By class imbalance I mean that in each "snapshot" in time, most subject will "survive" this timestep (or the next 2-3 timesteps) - therefore predicting who's going to "churn" is an imbalanced prediction.

But I guess you're trying to predict when they will churn and then comparing the prediction (probability of event within 1 timestep?) from the wtte-rnn to that of your vanilla (binary classification problem?)

RNNs/ANNs will usually help out with the first few miles of feature engineering for sure but for the extra miles great architecture + great feature engineering will help 😄

adam-haber · 2017-12-13T16:10:00Z

But I guess you're trying to predict when they will churn and then comparing the prediction (probability of event within 1 timestep?) from the wtte-rnn to that of your vanilla (binary classification problem?)

My general workflow is as follows:

I split the data to train and test by id - meaning 80% of the subjects are in the train data, and 20% are in the test (in contrast to more complicated splitting schemes such as splitting by time, etc). Both sets have similar features distributions.
I train. :-)
I rearrange the test set such that it would only consist of data on the test patients up to a certain point in time (say, the beginning of the year). I then run model.predict(x_test), and for each patient i I take the last predictions (a[i]=pred_test[i,-1,0], b[i]=pred_test[i,-1,1]). This is OK, to the best of my understanding, since the model pads the tensor with the last predicted pair of a,b along its second dimension (time dimension).
I then compute p[i]=weibull.cmf(3,a[i],b[i]) - this gives me the probability of churning in the next 3 timesteps after the beginning of the year.
Based on these 3-months churning probabilities and the ground truth (which I stored before rearranging the test set for prediction), I compute average precision.

Does this sounds like a reasonable way of using the model and evaluating the performance?

At the moment I'm trying different architectures and hyperparameters in order to improve this precision, but I keep getting NaNs in pretty much any architecture that's not the one from the example notebook...

ragulpr · 2017-12-14T01:52:18Z

That sounds very reasonable thanks for the description!
I personally don't like precision/classification based metrics as they are fully dependent on the calibration of the probabilities and the threshold you choose for classification, I think AUC is a more interesting and more robust metric (it'd be the probability that the ordering of churned/nonchurned customers in the 3 coming steps are correct).

Best trick to levy NaNs is to have the final layer before the linear dense 2-output-activation layer to have Tanh activations. That gives you more freedom. But NaNs and overfitting is always a risk of moving fast. Also, if you post the plots from the wtte callback I could get a better picture

Would really appreciate to hear feedback on what seem to work, not work and we can discuss here.

adam-haber · 2017-12-26T09:05:08Z

Would really appreciate to hear feedback on what seem to work, not work and we can discuss here.

Small update - I decided to try a much bigger network - instead of 1-layer-3-GRUs from the example, I tried 3-layers-128-lstm architecture, and it seems to work much better. I also started using keras fit_generator function which allows training on much larger datasets - the current wtte implementation is quite memory consuming (since even patients who died on t=1 are "mask-padded" till the final duration) and makes it hard to fit on large datasets.

ragulpr mentioned this issue Mar 5, 2018

Invalid loss; hello-world-datapipeline ragulpr/wtte-rnn-examples#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different "knobs" to improve accuracy #36

Different "knobs" to improve accuracy #36

adam-haber commented Dec 13, 2017

ragulpr commented Dec 13, 2017

adam-haber commented Dec 13, 2017

ragulpr commented Dec 13, 2017 •

edited

adam-haber commented Dec 13, 2017

ragulpr commented Dec 14, 2017

adam-haber commented Dec 26, 2017

Different "knobs" to improve accuracy #36

Different "knobs" to improve accuracy #36

Comments

adam-haber commented Dec 13, 2017

ragulpr commented Dec 13, 2017

adam-haber commented Dec 13, 2017

ragulpr commented Dec 13, 2017 • edited

adam-haber commented Dec 13, 2017

ragulpr commented Dec 14, 2017

adam-haber commented Dec 26, 2017

ragulpr commented Dec 13, 2017 •

edited