Why I got different lift result when using get_cumlift() and calculating line by line? #706

AmyLin0515 · 2023-11-14T22:35:59Z

Describe the bug
Hi Team!
I used get_cumlift(), and got the lift for S-Learner like this:

When I tried to duplicate the result, calculating it manually, the result is different from what I had using get_cumlift().

sorted_df = df_try.sort_values(col, ascending=False).reset_index(drop=True)
sorted_df.index = sorted_df.index + 1
sorted_df["cumsum_tr"] = sorted_df['w'].cumsum()
sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"]
sorted_df["cumsum_y_tr"] = (sorted_df['y'] * sorted_df['w']).cumsum()
sorted_df["cumsum_y_ct"] = (sorted_df['y'] * (1 - sorted_df['w'])).cumsum()

This is how table looks like:

And then I calculate the lift:

lift=[]
lift.append(sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"] - sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"])
lift = pd.concat(lift, join="inner", axis=1)
lift.loc[0] = np.zeros((lift.shape[1],))
lift = lift.sort_index().interpolate()

This is how the final result looks like:

I plot the difference between the result from get_cumlif() and manual calculation.

Does anyone know why they are different?

Environment (please complete the following information):

OS: Windows
Python Version: 3.8
Versions of Major Dependencies (pandas, scikit-learn, cython):pandas==1.3.5, scikit-learn==1.0.2, cython==0.29.34]

The text was updated successfully, but these errors were encountered:

ras44 · 2023-11-16T17:57:50Z

hi @AmyLin0515

A couple ideas:

See the code for get_cumlift here:

causalml/causalml/metrics/visualize.py

Lines 54 to 135 in c154afe

    
           def get_cumlift( 
        
               df, outcome_col="y", treatment_col="w", treatment_effect_col="tau", random_seed=42 
        
           ): 
        
               """Get average uplifts of model estimates in cumulative population. 
        
               If the true treatment effect is provided (e.g. in synthetic data), it's calculated 
        
               as the mean of the true treatment effect in each of cumulative population. 
        
               Otherwise, it's calculated as the difference between the mean outcomes of the 
        
               treatment and control groups in each of cumulative population. 
        
               For details, see Section 4.1 of Gutierrez and G{\'e}rardy (2016), `Causal Inference 
        
               and Uplift Modeling: A review of the literature`. 
        
               For the former, `treatment_effect_col` should be provided. For the latter, both 
        
               `outcome_col` and `treatment_col` should be provided. 
        
               Args: 
        
                   df (pandas.DataFrame): a data frame with model estimates and actual data as columns 
        
                   outcome_col (str, optional): the column name for the actual outcome 
        
                   treatment_col (str, optional): the column name for the treatment indicator (0 or 1) 
        
                   treatment_effect_col (str, optional): the column name for the true treatment effect 
        
                   random_seed (int, optional): random seed for numpy.random.rand() 
        
               Returns: 
        
                   (pandas.DataFrame): average uplifts of model estimates in cumulative population 
        
               """ 
        
               assert ( 
        
                   (outcome_col in df.columns) 
        
                   and (treatment_col in df.columns) 
        
                   or treatment_effect_col in df.columns 
        
               ) 
        
               df = df.copy() 
        
               np.random.seed(random_seed) 
        
               random_cols = [] 
        
               for i in range(10): 
        
                   random_col = "__random_{}__".format(i) 
        
                   df[random_col] = np.random.rand(df.shape[0]) 
        
                   random_cols.append(random_col) 
        
               model_names = [ 
        
                   x 
        
                   for x in df.columns 
        
                   if x not in [outcome_col, treatment_col, treatment_effect_col] 
        
               ] 
        
               lift = [] 
        
               for i, col in enumerate(model_names): 
        
                   sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True) 
        
                   sorted_df.index = sorted_df.index + 1 
        
                   if treatment_effect_col in sorted_df.columns: 
        
                       # When treatment_effect_col is given, use it to calculate the average treatment effects 
        
                       # of cumulative population. 
        
                       lift.append(sorted_df[treatment_effect_col].cumsum() / sorted_df.index) 
        
                   else: 
        
                       # When treatment_effect_col is not given, use outcome_col and treatment_col 
        
                       # to calculate the average treatment_effects of cumulative population. 
        
                       sorted_df["cumsum_tr"] = sorted_df[treatment_col].cumsum() 
        
                       sorted_df["cumsum_ct"] = sorted_df.index.values - sorted_df["cumsum_tr"] 
        
                       sorted_df["cumsum_y_tr"] = ( 
        
                           sorted_df[outcome_col] * sorted_df[treatment_col] 
        
                       ).cumsum() 
        
                       sorted_df["cumsum_y_ct"] = ( 
        
                           sorted_df[outcome_col] * (1 - sorted_df[treatment_col]) 
        
                       ).cumsum() 
        
                       lift.append( 
        
                           sorted_df["cumsum_y_tr"] / sorted_df["cumsum_tr"] 
        
                           - sorted_df["cumsum_y_ct"] / sorted_df["cumsum_ct"] 
        
                       ) 
        
               lift = pd.concat(lift, join="inner", axis=1) 
        
               lift.loc[0] = np.zeros((lift.shape[1],)) 
        
               lift = lift.sort_index().interpolate() 
        
               lift.columns = model_names 
        
               lift[RANDOM_COL] = lift[random_cols].mean(axis=1) 
        
               lift.drop(random_cols, axis=1, inplace=True) 
        
               return lift

Note that get_cumlift iterates at least 10 times over random orderings and also other order orderings if your input df has columns other than outcome_col, treatment_col, and treatment_effect_col:

causalml/causalml/metrics/visualize.py

Lines 90 to 93 in c154afe

    
           for i in range(10): 
        
               random_col = "__random_{}__".format(i) 
        
               df[random_col] = np.random.rand(df.shape[0]) 
        
               random_cols.append(random_col)

causalml/causalml/metrics/visualize.py

Lines 102 to 104 in c154afe

    
           for i, col in enumerate(model_names): 
        
               sorted_df = df.sort_values(col, ascending=False).reset_index(drop=True) 
        
               sorted_df.index = sorted_df.index + 1

Also if treatment_effect_col is provided, it is used to calculate the ATE of the cumulative population:

causalml/causalml/metrics/visualize.py

Lines 106 to 108 in c154afe

    
           if treatment_effect_col in sorted_df.columns: 
        
               # When treatment_effect_col is given, use it to calculate the average treatment effects 
        
               # of cumulative population.

Not sure if you are providing the treatment_effect_col using synthetic data or not, but if that is the case, then 2) would apply.

If you're not providing treatment_effect_col, then 1) still applies- a repeated random ordering and subsequent interpolation of lift results.

FYI, also see work in #707

AmyLin0515 · 2023-12-19T18:05:36Z

Hi @ras44 ! Thanks for providing insights. I did find the difference decreased a lot after I added 10 random columns and included them to sort. However, I don't understand why we need to add these two random columns. And if eventually the order was changed by the final 10th random columns, what is the point that we added so many of them.

AmyLin0515 added the bug Something isn't working label Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why I got different lift result when using get_cumlift() and calculating line by line? #706

Why I got different lift result when using get_cumlift() and calculating line by line? #706

AmyLin0515 commented Nov 14, 2023

ras44 commented Nov 16, 2023 •

edited

AmyLin0515 commented Dec 19, 2023

Why I got different lift result when using get_cumlift() and calculating line by line? #706

Why I got different lift result when using get_cumlift() and calculating line by line? #706

Comments

AmyLin0515 commented Nov 14, 2023

ras44 commented Nov 16, 2023 • edited

AmyLin0515 commented Dec 19, 2023

ras44 commented Nov 16, 2023 •

edited