Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create a Label that indicates the Future Type (or state) of something #214

Open
S-UP opened this issue Apr 16, 2021 · 7 comments
Open

Comments

@S-UP
Copy link

S-UP commented Apr 16, 2021

I wonder about the best approach to create a Label that generates forward-looking classes.

Example: A customer might purchase 12 times (on different dates).

I want to assign a label that says he/she will do a next purchase (within X months) after a given event was observed. Thus, after having observed the first transaction, will this customer come back and register another transaction? If so, he/she should receive a label 'will purchase again'. Else 'will NOT purchase again'.

From what I've seen Compose always constructs labels using all events up until (but excluding) another event (for which the label is then set). So I wonder how to generate a label for the last transaction observed in the above example. The 12th transaction is the last recorded and thus we would label a 'will NOT purchase again' here as we know the customer will not transact again.

The overall goal is to identify customers who are most likely to re-engage. Maybe there is also a more suitable modeling approach to this.

@jeff-hernandez
Copy link
Collaborator

Thanks for the question! Would a row-based window size be a good modeling approach? The row-based window size can get you the current purchase and the next purchase. Then, you can compare the times for labeling. I'll go through an example using this data.

import composeml as cp
import pandas as pd

df = pd.read_csv(
    'data.csv',
    parse_dates=['transaction_time'],
    index_col='transaction_id',
)

df
transaction_time amount department customer_id
transaction_id
351 2021-01-02 14:24:55 18.64 computers 1
101 2021-01-04 11:44:13 12.15 automotive 1
1 2021-01-12 03:44:33 78.91 grocery 1
501 2021-01-15 11:54:25 50.91 garden 1
651 2021-01-21 06:55:16 11.62 books 1
51 2021-01-21 22:06:39 94.62 electronics 1
801 2021-01-25 19:20:22 53.26 shoes 1
901 2021-02-07 16:57:13 58.74 movies 1
401 2021-02-08 14:50:14 42.83 kids 1
851 2021-02-10 08:38:04 69.11 baby 1
151 2021-02-21 01:53:37 55.02 computers 1
251 2021-02-21 13:01:35 55.99 jewelery 1

This labeling function will get a data slice with two rows -- the current purchase and the next purchase. It also has a within parameter to determine whether the next purchases happened within a given time.

def next_purchase(df, within):
    if len(df) < 2: return False
    within = pd.Timedelta(within)
    next_time = df.index[1] - df.index[0]
    return within >= next_time

lm = cp.LabelMaker(
    target_entity='customer_id',
    time_index='transaction_time',
    labeling_function=next_purchase,
    window_size=2, # two rows to get current and next purchase
)

When running the search, the gap is set to one so that each data slice starts on the next purchase.

lt = lm.search(
    df=df.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1, # one row to start on next purchase
    within='3 days',
    verbose=False,
)

lt
customer_id time next_purchase
0 1 2021-01-02 14:24:55 True
1 1 2021-01-04 11:44:13 False
2 1 2021-01-12 03:44:33 False
3 1 2021-01-15 11:54:25 False
4 1 2021-01-21 06:55:16 True
5 1 2021-01-21 22:06:39 False
6 1 2021-01-25 19:20:22 False
7 1 2021-02-07 16:57:13 True
8 1 2021-02-08 14:50:14 True
9 1 2021-02-10 08:38:04 False
10 1 2021-02-21 01:53:37 True
11 1 2021-02-21 13:01:35 False

Let me know if this approach can work.

@S-UP
Copy link
Author

S-UP commented Apr 17, 2021

Interesting approach. Thanks for sharing!

Questions: Why can you use next_time = df.index[1] - df.index[0] given the index of the data frame is not a time index?

@jeff-hernandez
Copy link
Collaborator

jeff-hernandez commented Apr 19, 2021

Why can you use next_time = df.index[1] - df.index[0] given the index of the data frame is not a time index?

The data frame slices that are given to the labeling function do have the time index set as the index. During the search, the label maker sets the time index as the data frame index.

Does a row-based window size work for your use case?

@S-UP
Copy link
Author

S-UP commented Jul 22, 2021

May I ask how you would extend your approach for situations where there is only one product type to consider.

Or to stick with the above example: Assume we are just interested in Department==Computer type of transactions. The row-based approach will take two neighboring lines while in fact what is needed is a validation of whether or not a Computer transaction will happen any time within the specified time window.

Would be interested to hear your thoughts on this.

@jeff-hernandez
Copy link
Collaborator

@S-UP thanks for the question! In that case, I think it'd make sense to isolate the computer department before generating labels. We can group by the department and select computers.

computers = df.groupby('department').get_group('computers')

lt = lm.search(
    df=computers.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1, # one row to start on next purchase
    within='3 days',
    verbose=False,
)

@S-UP
Copy link
Author

S-UP commented Jul 23, 2021

Thanks. I realize I should have been more explicit.

I still want to create labels per Transaction ID or, potentially, Transaction Date (i.e. aggregating all transactions into a single transaction date). So, if a customer purchases from Garden and does not purchase from Computer within the specified window, then there shall be a Next Purchase == False flag for the Garden transaction.

@jeff-hernandez
Copy link
Collaborator

jeff-hernandez commented Jul 23, 2021

Ah okay, in that case, you can use the window_size to specify the time window and check if the department of first transaction occurred more than once.

def next_purchase(df):
    department = df.iloc[0].department
    return df.department.eq(department).sum() > 1

lm = cp.LabelMaker(
    target_entity='customer_id',
    time_index='transaction_time',
    labeling_function=next_purchase,
    window_size='3d',  # time window
)

lt = lm.search(
    df=df.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1,  # one to iterate over each transaction
    verbose=False,
)

If you only want labels for a single department, you can also make it a parameter to the labeling function.

def next_purchase(df, department):
    return df.department.eq(department).sum() > 1

lt = lm.search(
    ...,
    department='computers',
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants