Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Chapter 2: Definition of similarities is subject to information leakage? #125

Open
liganega opened this issue Mar 19, 2024 · 2 comments

Comments

@liganega
Copy link

This question is referring to the jupyter notebook of Chapter 2.

===

The code below creates new 10 similarity features based on the location of the districts.
But it also uses the information of "median_house_value" as sample weight.

housing_labels = strat_train_set["median_house_value"].copy()
...
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
                                           sample_weight=housing_labels)

But isn't it kind of information leakage to the model?
The model is going to be trained on predicting the median house value and should NOT have any direct information about it.

@liganega
Copy link
Author

liganega commented Mar 19, 2024

Using "median_house_value" as sample weight is nonsense because for prediction in the future it shouldn't be available.
On the other hand, the "median_income" feature instead would be adequate for sample weight.

@liganega
Copy link
Author

In fact, the sample_weight option is used only for the demonstration of how to use ClusterSimilarity class and is ignored after that. There is therefore no information leakage during the training.

However, it is still misleading to use "median_house_value" as the value for sample_weight. Using instead "median_income" results in almost the same clustering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant