Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Data with Interventions #181

Open
chrisquatjr opened this issue Apr 22, 2024 · 4 comments
Open

Handling Data with Interventions #181

chrisquatjr opened this issue Apr 22, 2024 · 4 comments

Comments

@chrisquatjr
Copy link

Hi,

Thank you for the excellent repository! It has been exciting exploring the discovery tools in this repository lately. I was using the data from Sachs et al. 2005 before I realized the data was already implemented as an internal dataset here. I am running into some confusion, however, and thought I would ask here.

For clarity's sake, here is how I am loading in the internal dataset:

from causallearn.utils.Dataset import load_dataset
data, labels = load_dataset(dataset_name="sachs")
df_internal = pd.DataFrame(data=data,columns=labels)

From what I can tell, the internal implementation of the dataset is some subset of the 14 excel tables one retrieves if they download from the paper directly. First, I noticed the internal dataset contains exactly the same columns as all 14 of the excel tables I have from the paper. The rows, by contrast, differ substantially. There are on the order of 11 thousand rows present across all 14 tables but there are only around 7 thousand rows present in the internal dataset. (I did also confirm that the first 5 rows of the internal dataset match exactly to those of the 1. cd3cd28.xls file, so it does not look like any normalization/processing has altered the values themselves).

Taken together, it seems the internal dataset is a row-joined subset of the original Sachs dataset. Is this a correct assessment? If so, what subset of the tables are included? Why aren't all conditions included?

Please let me know if I have simply missed some tutorial or documentation somewhere. Any assistance would be greatly appreciated.

Overall, my goal is to reproduce the graph seen in Figure 3A. I know the authors used a simulated annealing approach, but I want to try more current approaches.

@jdramsey
Copy link
Collaborator

jdramsey commented Apr 22, 2024 via email

@chrisquatjr
Copy link
Author

Thank you for the great explanation! I have been following the paper you suggested and have been able to follow everything using Tetrad's GUI, which I switched over to as I do not see an implementation of FASK in this library (let me know if I simply missed it). I followed the paper up to this point:

"After running FASK, we deleted the intervention variables from the resulting graph keeping only the graph over the measured variables."

I am not sure how to do this in Tetrad. I do not see anything in the manual about deleting or removing variables in this way. This format also does not appear to conform to Tetrad's "status and value" convention. If I adjust the data to conform to this format, would Tetrad immediately know to not include these variables in the graph output?

@jdramsey
Copy link
Collaborator

Oh my gosh, I missed your message! Let me think how to respond.

@jdramsey
Copy link
Collaborator

jdramsey commented May 25, 2024

Ah I see. Here's the data:

https://github.com/cmu-phil/example-causal-datasets/blob/main/real/sachs/data/sachs.2005.logxplus10.jittered.eperimental.continuous.txt

The intervention variables are all variables after 'jnk'--these are experimental variables that have been jittered with a small amount of Gaussian noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants