-
I have been exploring the deepchecks for setting up some model monitoring checks. It's an awesome package! From what I could understand from the online resources, Cramer-V gives the magnitude of association between 2 variables. From Wikipedia, "Cramér's V varies from 0 to 1 (complete association) and can reach 1 only when each variable is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation." So, this is my understanding here, if the Cramer-V value is closer to 1, then the test and train distribution for the particular feature are strongly associated (less drift), and if it closer to 0, then weakly associated (more drift). Please point if I am wrong here (since the implementation is flagging it as completely opposite). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@arpan-sil Thank you so much for your question! So, this is right - Cramer's V is closer to 1 the more correlated the 2 variables are. Cramer's V (and by proxy, Chi-Square) checks for co-occurrences in the data - meaning when did feature A had value x, while feature B had value y. And counts all of these and uses them to calculate its statistic, which Cramer's V is just a simple normalization of that statistic. However, when comparing 2 different datasets (train and test), there are no co-occurrences. This is the same feature in different datasets with no overlap (samples cannot repeat in train and test). So what do we do? We don't compare the feature's distribution in train to the feature's distribution in test directly, as there are no co-occurrences. So we create another "dummy" variable which has 2 values - "train" and "test". And then, we compare these 2 variables. So, in conclusion - What you wrote here is completely correct, Cramer's V is a measure of association. We just use it to check the association of the feature's values to the dataset it came from. I hope this helps, if you have any further questions I would be happy to answer them :) |
Beta Was this translation helpful? Give feedback.
@arpan-sil Thank you so much for your question!
So, this is right - Cramer's V is closer to 1 the more correlated the 2 variables are.
However, our "2 variables" here are not "variable A in train" and "Variable A in test".
Cramer's V (and by proxy, Chi-Square) checks for co-occurrences in the data - meaning when did feature A had value x, while feature B had value y. And counts all of these and uses them to calculate its statistic, which Cramer's V is just a simple normalization of that statistic.
However, when comparing 2 different datasets (train and test), there are no co-occurrences. This is the same feature in different datasets with no overlap (samples cannot repeat in train and test).