Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about preprocessing functions #37

Open
adam-haber opened this issue Dec 28, 2017 · 1 comment
Open

Question about preprocessing functions #37

adam-haber opened this issue Dec 28, 2017 · 1 comment

Comments

@adam-haber
Copy link

Hi,

I've two questions regarding the preprocessing functions:

  • Regarding prep_tensors - the lines
y  = y[:,1:]
x  = np.roll(x, shift=1, axis=1)[:,1:,]

Simply throw away the first event, right? Is this a necessity? In my data, a significant portion of the chruners churn at the beginning, and I'd be happy to try and predict these, as well.

  • Regarding the nanmask_to_keras_mask function: As far as I understand, the y variable returned by this function is of dimension (n_subjects,t_timesteps,2), such that y[i] is the matrix whose rows are the different times and its columns are time-to-event and censoring indicator, respectively, for subject i. In my data, each subject is either churned or not churned (no recurrent events). This means that for each subject, the second column is either all ones (if it's a churned subject) or all zeros (if it's a censored subject); this, of course, without taking into account the 0.95 mask. Is this the correct input format for training the model?
@ragulpr
Copy link
Owner

ragulpr commented Dec 29, 2017

Hi, great questions.
You understood it right, throw away the first timestep. There's alternatives but I think this was the most generally safe.

From the data pipeline template:

    # 1. Disalign features and targets otherwise truth is leaked.
    # 2. drop first timestep (that we now dont have features for)
    # 3. nan-mask the last timestep of features. (that we now don't have targets for)
    events = events[:,1:,]
    y  = y[:,1:]
    x  = np.roll(x, shift=1, axis=1)[:,1:,]
    x  = x + 0*np.expand_dims(events,-1)

The most thorough explanation can be found here

  • If a customer purchases something ("event") at 13.30 I can use this as feature input for the 23.59 batch job of predicting when customers purchases again (i.e tomorrow, day after tomorrow, ...) so we always need to disalign i.e roll the features.
  • If we leave an empty feature at first step we have a target value and can train, but in cases when event <-> datapoint i.e sequence birth comes from event it's always TTE=0 so it'll overfit.
  • If we also track clicks, logins, language etc event -> datapoint but datapoint -/-> event so now there's uncertainty about tte and you could probably use the first timestep.

So TL:DR, in your case (non-recurrent events) it might be safe, but does it make sense for inference? I.e, when does your data arrive?

I guess you want to predict will there be an event today? But if at signup 13.30 we get language, region, signup method etc this query is going to be tainted with the time of arrival of the data. (Things like less likelihood of event the later data arrives that day). I'm not saying it doesn't make sense, I'm saying it adds things to think about 😄

About question 2: Yes this sounds correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants