Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boston housing dataset removed from sklearn v 1.2+ #37

Open
singlesp opened this issue Apr 3, 2024 · 0 comments
Open

Boston housing dataset removed from sklearn v 1.2+ #37

singlesp opened this issue Apr 3, 2024 · 0 comments

Comments

@singlesp
Copy link

singlesp commented Apr 3, 2024

Describe the bug
If you have a version of sklearn >1.2 installed, then importing Dominance will fail because load_boston is no longer available.

To Reproduce
Steps to reproduce the behavior:

  1. open python environment
  2. from dominance_analysis import Dominance

Expected behavior
nothing

Screenshots
File "/path/dominance_analysis_al857.py", line 53, in
from dominance_analysis import Dominance
File "/path/opt/anaconda3/lib/python3.8/site-packages/dominance_analysis/init.py", line 1, in
from dominance_analysis.dominance import *
File "/path/opt/anaconda3/lib/python3.8/site-packages/dominance_analysis/dominance.py", line 6, in
from sklearn.datasets import load_boston
File "/path/opt/anaconda3/lib/python3.8/site-packages/sklearn/datasets/init.py", line 156, in getattr
raise ImportError(msg)
ImportError:
load_boston has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
recommend using a different example dataset.
force installing a scikit-learn version <1.2 also works but is less ideal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant