Data Processing in Pipeline #243

benTC74 · 2024-05-05T09:33:46Z

benTC74
May 5, 2024

Hi All,

I am quite new in this causal inference and causal ML, please bear with me if the question might sound stupid.

When I am using the package, I wanted to standardize the data during the training of the nuisance models by using pipeline. However, I have both categorical (one-hot encoded) and continuous variables in the dataset, so I cannot just StandardScaler() for all the columns. At the same time, I also cannot specify the columns to be receiving StandardScaler() in the pipeline as the DMLDataObject offers an numpy input. Would you have any advice on how to proceed with this? And I believe it would not be ideal if I use OneHotEncoder() in the pipeline after fitting the DMLDataObject instead of creating the dummies before fitting into DMLDataObject, right?

preprocessor_cml = ColumnTransformer(transformers=[
    ('num_transformer',num_transformer,[col for col in temp_adjcol_std.columns])
])

pipeline_cml = make_pipeline(preprocessor_cml, LassoCV(cv=5, max_iter=10000))

ml_l = clone(pipeline_cml)
ml_m = clone(pipeline_cml)

obj_dml_plr = DoubleMLPLR(dml_data_flex, ml_l=ml_l, ml_m=ml_m)
obj_dml_plr.fit()

Just out of curiosity, if preprocessing is implemented in this way, is that also applied to cross-fitting (just to avoid data leakage)?
This might not be related to the package, but I am just wondering, if I have categorical treatment variables and they are one-hot encoded, should I include the other choices of that categorical variable into the control when I am using one of the choices of that categorical variable as treatment in DMLPLR? E.g. for variable X with possible outcome [a, b, c, d], X is one-hot encoded and four new columns (X_a, X_b, X_c, X_d) are created, when I am setting X_a as treatment, should I exclude X_b, X_c, X_d from the set of control variables?

Thank you very much in advance for all the help!!!

Answered by SvenKlaassen

May 6, 2024

Hi,
thank you. These are great questions.

Doesn't the ColumnTransformer also work on numpy arrays as e.g.

import doubleml as dml
import numpy as np
from doubleml.datasets import make_plr_CCDDHNR2018
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline

data = make_plr_CCDDHNR2018(alpha=0.5, dim_x=5, return_type='DataFrame')
print(data.head())

ct = ColumnTransformer(
    [("norm", Normalizer(norm='l1'), [0, 1]),  # apply to columns 0 and 1
     ("standard", StandardScaler(), slice(2, 4))],  # apply to columns 2 and 3
     remai…

View full answer

SvenKlaassen · 2024-05-06T09:55:12Z

SvenKlaassen
May 6, 2024
Maintainer

Hi,
thank you. These are great questions.

Doesn't the ColumnTransformer also work on numpy arrays as e.g.

import doubleml as dml
import numpy as np
from doubleml.datasets import make_plr_CCDDHNR2018
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline

data = make_plr_CCDDHNR2018(alpha=0.5, dim_x=5, return_type='DataFrame')
print(data.head())

ct = ColumnTransformer(
    [("norm", Normalizer(norm='l1'), [0, 1]),  # apply to columns 0 and 1
     ("standard", StandardScaler(), slice(2, 4))],  # apply to columns 2 and 3
     remainder=MinMaxScaler())  # apply to all other columns

np.random.seed(1234)
ml_l = make_pipeline(ct, RandomForestRegressor())
ml_m = make_pipeline(ct, RandomForestRegressor())

obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

dml_plr_obj.fit().summary

Using a pipeline will only apply standardization at each cross-fitting level (and avoid data-leakage), but one could also make the argument, that the complexity of standardization is quite low and should not affect estimation if done the whole dataset.
As the features are also used to estimate the treatment probability, I would not include them as controls.

3 replies

benTC74 May 7, 2024
Author

Thank you for the prompt reply!! They definitely do make sense to me. I did not realize that ColumnTransformer can also perform with array indexing. Just one more clarification question, when I am performing the ColumnTransformer preprocessing, it will first transform the whole dataset with the specified columns, and then the needed data subset is sliced from the whole dataset and put into ml_l, ml_m and DML, right? I am just afraid I am using the wrong index numbers.

SvenKlaassen May 7, 2024
Maintainer

That is a very good point.

The transformer usually transforms the feature input x from fit(x,y) (see documentation).
As for the PLR both regression functions correspond to conditional expectations with respect to the feature x only (see

doubleml-for-py/doubleml/plm/plr.py

Line 182 in ba9cc57

    
           l_hat = _dml_cv_predict(self._learner['ml_l'], x, y, smpls=smpls, n_jobs=n_jobs_cv,

and

doubleml-for-py/doubleml/plm/plr.py

Line 193 in ba9cc57

    
           m_hat = _dml_cv_predict(self._learner['ml_m'], x, d, smpls=smpls, n_jobs=n_jobs_cv,

)

the ColumnTransformer is applied to the features x which you can find at obj_dml_data.x (https://github.com/DoubleML/doubleml-for-py/blob/ba9cc5726906d0d5bcef3102435f4e6bf2e8993c/doubleml/plm/plr.py#L161C31-L161C42)

So this can be tricky if one has multiple treatment columns as the role of d and x changes.

benTC74 May 8, 2024
Author

Thank you for the reply, really appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing in Pipeline #243

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data Processing in Pipeline #243

benTC74 May 5, 2024

Replies: 1 comment · 3 replies

SvenKlaassen May 6, 2024 Maintainer

benTC74 May 7, 2024 Author

SvenKlaassen May 7, 2024 Maintainer

benTC74 May 8, 2024 Author

benTC74
May 5, 2024

Replies: 1 comment 3 replies

SvenKlaassen
May 6, 2024
Maintainer

benTC74 May 7, 2024
Author

SvenKlaassen May 7, 2024
Maintainer

benTC74 May 8, 2024
Author