Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is categorical feature currently supported by causalnex with label encoding? #170

Open
tonyabracadabra opened this issue Sep 1, 2022 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@tonyabracadabra
Copy link
Contributor

I know conducting label encoding on categorical variable would make the algorithm works with categorical variables, but is it mathematically valid for validating their causal relationships when those label encoding are applied?

@oentaryorj oentaryorj added the question Further information is requested label Sep 6, 2022
@tonyabracadabra
Copy link
Contributor Author

tonyabracadabra commented Sep 27, 2022

Hey folks, is there any updates on this question? @oentaryorj @GabrielAzevedoFerreiraQB Any insights would be helpful. I think we might need to handle the independence test for categorical variable separately and I am not sure if that is implemented in the system now.

@GabrielAzevedoFerreiraQB
Copy link
Contributor

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

  • For NOTEARS, we do need continuous variables as you correctly mentioned.
  • It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.
  • However, in certain situations it is still possible to do such encoding:
    • case where variables are binary
    • case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

@tonyabracadabra
Copy link
Contributor Author

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

  • For NOTEARS, we do need continuous variables as you correctly mentioned.

  • It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.

  • However, in certain situations it is still possible to do such encoding:

    • case where variables are binary
    • case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

Thanks Gabriel for answering my question!

I saw that in the release note, it says Added categorical distributed data support for pytorch NOTEARS., what does that mean?

Is there any plans on supporting causal discoveries with mixed type of data with newly published papers?

@jinowork
Copy link

in that case, can i do one hot encoding for categorical variables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants