Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about econml and CausalForestDML #884

Open
benTC74 opened this issue May 10, 2024 · 0 comments
Open

Questions about econml and CausalForestDML #884

benTC74 opened this issue May 10, 2024 · 0 comments

Comments

@benTC74
Copy link

benTC74 commented May 10, 2024

Hi All,

I have a couple of questions when I am using the CausalForest DML, any help is super much appreciated!!

  1. When I am evaluating the performance of the model, how do I know the model is performing well or reliable (e.g. providing trustful treatment estimation) as there is no ground truth to be compared with, and there is not a metrics such as R2 in linear regression that shows how well the model is explaining. Without this, how can I explain to people the model is reliable?

  2. With the above question, it brings me to my second question, can the treatment estimation always be trusted as long as the it is significant (pvalue < 0.05)? And how large are the standard error and confidence interval for them to be considered too large? For example, my dependent variable has a range from around -2000 to +8000, and one of the treatment estimation is around -200, standard error is around 60, and confidence interval from around -300 to -90. But another treatment estimation is around -4500, standard error is around 1400, and confidence interval from around -7000 to -1700, They are both significant.

  3. If I observe only negative values such as -0.8 in nuisance_scores_y, does it mean the model is bad?

  4. When I am combining CausalForestDML with dowhy, it takes quite a long time to run for even a small dataset of only a few hundred observations and around 40 features, it take around 40 mins to 1.5 hours depending on the treatment variables (probably due to the graph creation). I using RandomForest as nuisance models with grid search of 3 different parameters. Is that normal? It is even worse when I use Lasso and LogisticRegression as nuisance models (as there are also poly terms, it takes a couple of hours and still running).

  5. With the above question, it brings me to my fourth question, can I actually do not run with dowhy together (or implementing the whole estimation process with dowhy), but in the end, the trained model is somehow still connected to dowhy and can use those estimate refutation functions such as adding random common cause and unobserved common cause?

  6. Also relating to the combination of both CausalForestDML and dowhy, when I run it for binary or categorical treatment variables, I ran into the below error ("KeyError(f"{not_found} not in index")") that would not happen if I am running the model on continuous treatment variables. Any idea of how to solve that if question 5 is not possible?

  7. Is there a way in econml for checking whether my data is violating the positivity or overlap assumptions in the propensity score models? For both binary and continuous treatment variables? If not, is there any pointers on how can I be validating those? It will be really really good to know.

  8. Just a side question; not really econml related, if I have some categorical variables (e.g. number of products - "Many" & "Few") that could either be treatment or control and that are static for the whole dataset in each group (I have 8 groups in the observations), meaning they are always of the same value in each group. Is that actually a problem in modelling? Especially in the case that I have a very small dataset?

Sorry for all of these long questions, but I am super new in this area and am wanting to really understand! I am really appreciating your help here!

The error from question 7:

KeyError Traceback (most recent call last)
Cell In[21], line 31
27 est_nonparam.tune(Y, T, X=X, W=None, groups=groups)
29 #est_nonparam.fit(Y, T, X=X, W=None, groups=groups)
---> 31 est_nonparam_dw = est_nonparam.dowhy.fit(Y, T,
32 X=X, W=None, groups=groups,
33 outcome_names=target_feature1,
34 treatment_names=['RegulatoryIndex'],
35 feature_names=df_processed.iloc[:, 4:].columns.tolist())

File ~\AppData\Local\anaconda3\Lib\site-packages\econml\dowhy.py:180, in DoWhyWrapper.fit(self, Y, T, X, W, Z, outcome_names, treatment_names, feature_names, confounder_names, instrument_names, graph, estimand_type, proceed_when_unidentifiable, missing_nodes_as_confounders, control_value, treatment_value, target_units, **kwargs)
178 for p in self.get_params():
179 init_params[p] = getattr(self.cate_estimator, p)
--> 180 self.estimate
= self.dowhy
.estimate_effect(self.identified_estimand_,
181 method_name=method_name,
182 control_value=control_value,
183 treatment_value=treatment_value,
184 target_units=target_units,
185 method_params={
186 "init_params": init_params,
187 "fit_params": kwargs,
188 },
189 )
190 return self

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_model.py:360, in CausalModel.estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
349 causal_estimator = causal_estimator_class(
350 identified_estimand,
351 test_significance=test_significance,
(...)
355 **extra_args,
356 )
358 self._estimator_cache[method_name] = causal_estimator
--> 360 return estimate_effect(
361 self._data,
362 self._treatment,
363 self._outcome,
364 identifier_name,
365 causal_estimator,
366 control_value,
367 treatment_value,
368 target_units,
369 effect_modifiers,
370 fit_estimator,
371 method_params,
372 )

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_estimator.py:725, in estimate_effect(data, treatment, outcome, identifier_name, estimator, control_value, treatment_value, target_units, effect_modifiers, fit_estimator, method_params)
718 if fit_estimator:
719 estimator.fit(
720 data=data,
721 effect_modifier_names=effect_modifiers,
722 **method_params["fit_params"] if "fit_params" in method_params else {},
723 )
--> 725 estimate = estimator.estimate_effect(
726 data,
727 treatment_value=treatment_value,
728 control_value=control_value,
729 target_units=target_units,
730 confidence_intervals=estimator._confidence_intervals,
731 )
733 if estimator._significance_test:
734 estimator.test_significance(data, estimate.value, method=estimator._significance_test)

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_estimators\econml.py:248, in Econml.estimate_effect(self, data, treatment_value, control_value, target_units, **_)
245 # Changing shape to a list for a singleton value
246 # Note that self._control_value is assumed to be a singleton value
247 self._treatment_value = parse_state(self._treatment_value)
--> 248 est = self.effect(X_test)
249 ate = np.mean(est, axis=0) # one value per treatment value
251 if len(ate) == 1:

File ~\AppData\Local\anaconda3\Lib\site-packages\dowhy\causal_estimators\econml.py:332, in Econml.effect(self, df, *args, **kwargs)
329 def effect_fun(filtered_df, T0, T1, *args, **kwargs):
330 return self.estimator.effect(filtered_df, T0=T0, T1=T1, *args, **kwargs)
--> 332 Xdf = df[self._effect_modifier_names] if df is not None else df
333 return self.apply_multitreatment(Xdf, effect_fun, *args, **kwargs)

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\frame.py:3767, in DataFrame.getitem(self, key)
3765 if is_iterator(key):
3766 key = list(key)
-> 3767 indexer = self.columns._get_indexer_strict(key, "columns")[1]
3769 # take() does not accept boolean indexers
3770 if getattr(indexer, "dtype", None) == bool:

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:5877, in Index._get_indexer_strict(self, key, axis_name)
5874 else:
5875 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5877 self._raise_if_missing(keyarr, indexer, axis_name)
5879 keyarr = self.take(indexer)
5880 if isinstance(key, Index):
5881 # GH 42790 - Preserve name from an Index

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:5938, in Index._raise_if_missing(self, key, indexer, axis_name)
5936 if use_interval_msg:
5937 key = list(key)
-> 5938 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
5940 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
5941 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['...'],\n dtype='object')] are in the [columns]"

@benTC74 benTC74 changed the title CausalForestDML Categorical Treatment Variable Error Questions about econml and CausalForestDML May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant