You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code seems not robust as it assumes that consecutive percentiles have different values. This is not the case for e.g. the housing_median_age feature. Below is a suggested alternative.
# extra code – just shows that we get a uniform distribution
percentiles = pd.DataFrame(columns=['percentile'],
index=[p for p in range(1, 100)],
data=[np.percentile(housing["median_income"], p) for p in range(1, 100)])
percentiles.drop_duplicates(keep='last', inplace=True)
flattened_median_income = pd.cut(housing["median_income"],
bins=[-np.inf] + percentiles['percentile'].tolist() + [np.inf],
labels=percentiles.index.tolist() + [100])
flattened_median_income.hist(bins=len(percentiles) // 2 + 1)
plt.xlabel("Median income percentile")
plt.ylabel("Number of districts")
plt.show()
# Note: incomes below the 1st percentile are labeled 1, and incomes above the
# 99th percentile are labeled 100. This is why the distribution below ranges
# from 1 to 100 (not 0 to 100).
Additional context
Having said that, the housing_median_age indeed requires a different approach not only because of it’s multimodal distribution (as you state in the book) but also because there are multiple duplicate values in the distribution (probably as expected from an age feature that does not span across many years), which makes it difficult to break into evenly sized buckets.
The text was updated successfully, but these errors were encountered:
Notebook name: 02_end_to_end_machine_learning_project
Section 3.3 “Feature Scaling”
Cell 95.
The code seems not robust as it assumes that consecutive percentiles have different values. This is not the case for e.g. the housing_median_age feature. Below is a suggested alternative.
Additional context
Having said that, the housing_median_age indeed requires a different approach not only because of it’s multimodal distribution (as you state in the book) but also because there are multiple duplicate values in the distribution (probably as expected from an age feature that does not span across many years), which makes it difficult to break into evenly sized buckets.
The text was updated successfully, but these errors were encountered: