Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamicDML() issue: AttributeError: Provided crossfit folds contain training splits that don't contain all treatments DynamicDML #859

Open
samanbanafti opened this issue Mar 6, 2024 · 5 comments

Comments

@samanbanafti
Copy link

samanbanafti commented Mar 6, 2024

Hello,

When calling DynamicDML() as such:

est = DynamicDML(model_y = model_y, model_t=model_t, discrete_treatment=True)
est.fit(Y, T, X=X, W=None, groups=groups)

Here Y, T , X and groups are in long format and have the following shapes:

((32382,), (32382,), (32382, 8)), (32382,)

where n=N*Time=32,382; N=1542 cross-sectional units and Time=21 months and groups has N distinct ids corresponding to the distinct cross-sectional units.

I have already balanced the panel. T is a binary and discrete treatment and I see the default value for discrete_treatment is False, when setting instantiating with discrete_treatment=True I get:

AttributeError: Provided crossfit folds contain training splits that don't contain all treatments

arising from

if np.any(np.all(Target == 0, axis=0)) or (not np.any(np.all(Target == 0, axis=1))):
     raise AttributeError("Provided crossfit folds contain training splits that " +
"don't contain all treatments")

and it appears Target is a 1-hot encoding of T; if so, then this condition: (not np.any(np.all(pd.get_dummies(T,dtype=int) == 0, axis=1))) is True leading to the Attribute error. The way I am coding T is for each cross-sectional unit & month observation T=0 if that unit is not treated yet and T=1 once they become treated and remains 1; while controls have T=0 for all months. I imagine this is fine?

I'm using RandomForestClassifier for model_t and GradientBoostingRegressor for model_y.

The correct instantiation would be the one with discrete_treatment=True so that is the error I am more concerned about, just providing full context.

I get the following error (with discrete treatment is False):
Co-variance matrix is underdetermined. Inference will be invalid!

this holds with or without the inclusion of X, which has been standardized such that features have zero mean and unit variance.

Thanks,

Saman

@samanbanafti samanbanafti changed the title DynamicDML() issue: _Co-variance matrix is underdetermined. Inference will be invalid!_ and/or AttributeError: Provided crossfit folds contain training splits that don't contain all treatments DynamicDML DynamicDML() issue: AttributeError: Provided crossfit folds contain training splits that don't contain all treatments DynamicDML Mar 6, 2024
@adavis-85
Copy link

Was this solved? I'm having the same area using dowhy.fit.

@samanbanafti
Copy link
Author

No, I have not heard back yet

@adavis-85
Copy link

Realized my instrument variable wasn't binary and had to be. Using Econml and DoWhy in tandem from the sample notebooks online for EconMl.

@kbattocchi
Copy link
Collaborator

Sorry for the slow response - a couple of thoughts:

  1. It would help if you could provide a simplified repro; are a significant number of units treated and untreated at each time point?
  2. The DynamicDML class is intended for scenarios where treatments may be repeated, so it is not necessary to keep T=1 after the time of first treatment unless the units are actually continuing to receive treatment.

@benTC74
Copy link

benTC74 commented May 9, 2024

Hi @kbattocchi and @samanbanafti I am just wondering how can this issue be solved? Because I encountered the same problem when I am using Causal Forest DML with dowhy fit and set discrete treatment to be True for the treatment. My treatment is a categorical variable with category type, it has values such as "High Impact", "Medium Impact" and "Low Impact" etc. It was working when I use the model on a continuous treatment variable except it is not RandomForestClassifier and discrete treatment is False.

Code:

first_stage_reg = lambda: GridSearchCV(estimator=RandomForestRegressor(n_estimators=1000),
                                              param_grid={
                                                  'max_depth': max_depth,
                                                  'max_features': max_features,
                                                  'min_samples_split': min_samples_split
                                              }, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
                                             )

first_stage_class = lambda: GridSearchCV(estimator=RandomForestClassifier(n_estimators=1000),
                                              param_grid={
                                                  'max_depth': max_depth,
                                                  'max_features': max_features,
                                                  'min_samples_split': min_samples_split
                                              }, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
                                             )

model_y = first_stage_reg().fit(X, Y).best_estimator_
model_t = first_stage_class().fit(X, T).best_estimator_

est_nonparam = CausalForestDML(model_y=model_y, model_t=model_t, discrete_treatment=True, n_estimators=1000, cv=5)

est_nonparam_dw = est_nonparam.dowhy.fit(Y, T, X, W=None, groups=groups,
                                        outcome_names=target_feature1, 
                                        treatment_names=['RegulatoryIndex'],
                                        feature_names=Agg_df_imputed_transformed.iloc[:, ~Agg_df_imputed_transformed.columns.isin(['RegulatoryIndex']+
                                                                                      target_features + country_indicator)].columns.tolist(),
                                        inference='blb')

Error:
One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
econml has not been tested with dowhy versions >= 0.11

AttributeError Traceback (most recent call last)
Cell In[232], line 27
23 model_t = first_stage_class().fit(X, T).best_estimator_
25 est_nonparam = CausalForestDML(model_y=model_y, model_t=model_t, discrete_treatment=True, n_estimators=1000, cv=5)
---> 27 est_nonparam_dw = est_nonparam.dowhy.fit(Y, T, X, W=None, groups=groups,
28 outcome_names=target_feature1,
29 treatment_names=['RegulatoryIndex'],
30 feature_names=Agg_df_imputed_transformed.iloc[:, ~Agg_df_imputed_transformed.columns.isin(['RegulatoryIndex']+
31 target_features+
32 country_indicator)].columns.tolist(),
33 inference='blb')

File ~\AppData\Local\anaconda3\Lib\site-packages\econml\dowhy.py:180, in DoWhyWrapper.fit(self, Y, T, X, W, Z, outcome_names, treatment_names, feature_names, confounder_names, instrument_names, graph, estimand_type, proceed_when_unidentifiable, missing_nodes_as_confounders, control_value, treatment_value, target_units, **kwargs)
178 for p in self.get_params():
179 init_params[p] = getattr(self.cate_estimator, p)
--> 180 self.estimate
= self.dowhy
.estimate_effect(self.identified_estimand_,
181 method_name=method_name,
182 control_value=control_value,
183 treatment_value=treatment_value,
184 target_units=target_units,
185 method_params={
186 "init_params": init_params,
187 "fit_params": kwargs,
188 },
189 )
190 return self

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_model.py:360, in CausalModel.estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
349 causal_estimator = causal_estimator_class(
350 identified_estimand,
351 test_significance=test_significance,
(...)
355 **extra_args,
356 )
358 self._estimator_cache[method_name] = causal_estimator
--> 360 return estimate_effect(
361 self._data,
362 self._treatment,
363 self._outcome,
364 identifier_name,
365 causal_estimator,
366 control_value,
367 treatment_value,
368 target_units,
369 effect_modifiers,
370 fit_estimator,
371 method_params,
372 )

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_estimator.py:719, in estimate_effect(data, treatment, outcome, identifier_name, estimator, control_value, treatment_value, target_units, effect_modifiers, fit_estimator, method_params)
714 return CausalEstimate(
715 None, None, None, None, None, None, control_value=control_value, treatment_value=treatment_value
716 )
718 if fit_estimator:
--> 719 estimator.fit(
720 data=data,
721 effect_modifier_names=effect_modifiers,
722 **method_params["fit_params"] if "fit_params" in method_params else {},
723 )
725 estimate = estimator.estimate_effect(
726 data,
727 treatment_value=treatment_value,
(...)
730 confidence_intervals=estimator._confidence_intervals,
731 )
733 if estimator._significance_test:

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_estimators\econml.py:194, in Econml.fit(self, data, effect_modifier_names, **kwargs)
190 estimator_named_args = estimator_argspec.args + estimator_argspec.kwonlyargs
191 estimator_data_args = {
192 arg: named_data_args[arg] for arg in named_data_args.keys() if arg in estimator_named_args
193 }
--> 194 self.estimator.fit(**estimator_data_args, **kwargs)
196 return self

File ~\AppData\Local\anaconda3\Lib\site-packages\econml\dml\causal_forest.py:854, in CausalForestDML.fit(self, Y, T, X, W, sample_weight, groups, cache_values, inference)
852 if X is None:
853 raise ValueError("This estimator does not support X=None!")
--> 854 return super().fit(Y, T, X=X, W=W,
855 sample_weight=sample_weight, groups=groups,
856 cache_values=cache_values,
857 inference=inference)

File ~\AppData\Local\anaconda3\Lib\site-packages\econml\dml_rlearner.py:422, in _RLearner.fit(self, Y, T, X, W, sample_weight, freq_weight, sample_var, groups, cache_values, inference)
385 """
386 Estimate the counterfactual model from data, i.e. estimates function :math:\\theta(\\cdot).
387
(...)
419 self: _RLearner instance
420 """
421 # Replacing fit from _OrthoLearner, to enforce Z=None and improve the docstring
--> 422 return super().fit(Y, T, X=X, W=W,
423 sample_weight=sample_weight, freq_weight=freq_weight, sample_var=sample_var, groups=groups,
424 cache_values=cache_values,
425 inference=inference)

File ~\AppData\Local\anaconda3\Lib\site-packages\econml_cate_estimator.py:131, in BaseCateEstimator._wrap_fit..call(self, Y, T, inference, *args, **kwargs)
129 inference.prefit(self, Y, T, *args, **kwargs)
130 # call the wrapped fit method
--> 131 m(self, Y, T, *args, **kwargs)
132 self._postfit(Y, T, *args, **kwargs)
133 if inference is not None:
134 # NOTE: we call inference fit after calling the main fit method

File ~\AppData\Local\anaconda3\Lib\site-packages\econml_ortho_learner.py:832, in _OrthoLearner.fit(self, Y, T, X, W, Z, sample_weight, freq_weight, sample_var, groups, cache_values, inference, only_final, check_input)
830 nuisances, fitted_models, new_inds, scores = ray.get(self.nuisances_ref[idx])
831 else:
--> 832 nuisances, fitted_models, new_inds, scores = self._fit_nuisances(
833 Y, T, X, W, Z, sample_weight=sample_weight_nuisances, groups=groups)
834 all_nuisances.append(nuisances)
835 self._models_nuisance.append(fitted_models)

File ~\AppData\Local\anaconda3\Lib\site-packages\econml_ortho_learner.py:982, in _OrthoLearner._fit_nuisances(self, Y, T, X, W, Z, sample_weight, groups)
979 else:
980 folds = splitter.split(to_split, strata)
--> 982 nuisances, fitted_models, fitted_inds, scores = _crossfit(self._ortho_learner_model_nuisance, folds,
983 self.use_ray, self.ray_remote_func_options, Y, T,
984 X=X, W=W, Z=Z, sample_weight=sample_weight,
985 groups=groups)
986 return nuisances, fitted_models, fitted_inds, scores

File ~\AppData\Local\anaconda3\Lib\site-packages\econml_ortho_learner.py:284, in _crossfit(models, folds, use_ray, ray_remote_fun_option, *args, **kwargs)
282 nuisance_temp, model_out, score_temp = ray.get(fold_refs[idx])
283 else:
--> 284 nuisance_temp, model_out, score_temp = _fit_fold(model, train_idxs, test_idxs,
285 calculate_scores, accumulated_args, kwargs)
287 if idx == 0:
288 nuisances = tuple([np.full((n,) + nuis.shape[1:], np.nan)
289 for nuis in nuisance_temp])

File ~\AppData\Local\anaconda3\Lib\site-packages\econml_ortho_learner.py:99, in _fit_fold(model, train_idxs, test_idxs, calculate_scores, args, kwargs)
96 kwargs_train = {key: var[train_idxs] for key, var in kwargs.items()}
97 kwargs_test = {key: var[test_idxs] for key, var in kwargs.items()}
---> 99 model.train(False, None, *args_train, **kwargs_train)
100 nuisance_temp = model.predict(*args_test, **kwargs_test)
102 if not isinstance(nuisance_temp, tuple):

File ~\AppData\Local\anaconda3\Lib\site-packages\econml\dml_rlearner.py:53, in _ModelNuisance.train(self, is_selecting, folds, Y, T, X, W, Z, sample_weight, groups)
51 def train(self, is_selecting, folds, Y, T, X=None, W=None, Z=None, sample_weight=None, groups=None):
52 assert Z is None, "Cannot accept instrument!"
---> 53 self._model_t.train(is_selecting, folds, X, W, T, **
54 filter_none_kwargs(sample_weight=sample_weight, groups=groups))
55 self._model_y.train(is_selecting, folds, X, W, Y, **
56 filter_none_kwargs(sample_weight=sample_weight, groups=groups))
57 return self

File ~\AppData\Local\anaconda3\Lib\site-packages\econml\dml\dml.py:91, in _FirstStageSelector.train(self, is_selecting, folds, X, W, Target, sample_weight, groups)
86 if self._discrete_target:
87 # In this case, the Target is the one-hot-encoding of the treatment variable
88 # We need to go back to the label representation of the one-hot so as to call
89 # the classifier.
90 if np.any(np.all(Target == 0, axis=0)) or (not np.any(np.all(Target == 0, axis=1))):
---> 91 raise AttributeError("Provided crossfit folds contain training splits that " +
92 "don't contain all treatments")
93 Target = inverse_onehot(Target)
95 self._model.train(is_selecting, folds, _combine(X, W, Target.shape[0]), Target,
96 **filter_none_kwargs(groups=groups, sample_weight=sample_weight))

AttributeError: Provided crossfit folds contain training splits that don't contain all treatments

Really appreciate if any help can be provided! Thank you very much in advance!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants