Cramer's V measure for feature drift #1786

arpan-sil · 2022-07-18T07:30:04Z

arpan-sil
Jul 18, 2022

I have been exploring the deepchecks for setting up some model monitoring checks. It's an awesome package!
However, I am not able to interpret the Cramer-V score.
For categorical variables, Cramer-V test has been used as a drift score and by default if the score is >0.2, we flag that feature, stating there is significant difference between the feature distribution in the train and test data.

From what I could understand from the online resources, Cramer-V gives the magnitude of association between 2 variables. From Wikipedia, "Cramér's V varies from 0 to 1 (complete association) and can reach 1 only when each variable is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation."

So, this is my understanding here, if the Cramer-V value is closer to 1, then the test and train distribution for the particular feature are strongly associated (less drift), and if it closer to 0, then weakly associated (more drift). Please point if I am wrong here (since the implementation is flagging it as completely opposite).
Any detailed explanation of the drift score interpreted from Cramer-V value (for categorical variables) will be highly appreciated. Thanks!

Answered by nirhutnik

Jul 18, 2022

@arpan-sil Thank you so much for your question!

So, this is right - Cramer's V is closer to 1 the more correlated the 2 variables are.
However, our "2 variables" here are not "variable A in train" and "Variable A in test".

Cramer's V (and by proxy, Chi-Square) checks for co-occurrences in the data - meaning when did feature A had value x, while feature B had value y. And counts all of these and uses them to calculate its statistic, which Cramer's V is just a simple normalization of that statistic.

However, when comparing 2 different datasets (train and test), there are no co-occurrences. This is the same feature in different datasets with no overlap (samples cannot repeat in train and test).

View full answer

nirhutnik · 2022-07-18T09:01:25Z

nirhutnik
Jul 18, 2022

@arpan-sil Thank you so much for your question!

So, this is right - Cramer's V is closer to 1 the more correlated the 2 variables are.
However, our "2 variables" here are not "variable A in train" and "Variable A in test".

Cramer's V (and by proxy, Chi-Square) checks for co-occurrences in the data - meaning when did feature A had value x, while feature B had value y. And counts all of these and uses them to calculate its statistic, which Cramer's V is just a simple normalization of that statistic.

However, when comparing 2 different datasets (train and test), there are no co-occurrences. This is the same feature in different datasets with no overlap (samples cannot repeat in train and test).

So what do we do? We don't compare the feature's distribution in train to the feature's distribution in test directly, as there are no co-occurrences. So we create another "dummy" variable which has 2 values - "train" and "test". And then, we compare these 2 variables.
If there is a strong correlation between these variables, meaning that the values of feature A can predict the "train/test" variable, it means that you can know from the values of feature A from which dataset it is - meaning, there's a significant difference between train and test.
However, if there's no correlation, it means you can't predict from which dataset the information came from, therefor there is no significant difference - hence no drift.

So, in conclusion - What you wrote here is completely correct, Cramer's V is a measure of association. We just use it to check the association of the feature's values to the dataset it came from.

I hope this helps, if you have any further questions I would be happy to answer them :)

1 reply

arpan-sil Jul 18, 2022
Author

@nirhutnik Thank you so much for the detailed explanation! The score and the interpretation makes complete sense to me now. Thank you once again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cramer's V measure for feature drift #1786

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cramer's V measure for feature drift #1786

arpan-sil Jul 18, 2022

Replies: 1 comment · 1 reply

nirhutnik Jul 18, 2022

arpan-sil Jul 18, 2022 Author

arpan-sil
Jul 18, 2022

Replies: 1 comment 1 reply

nirhutnik
Jul 18, 2022

arpan-sil Jul 18, 2022
Author