[IDEA] Chapter 2, Improving the extra code for getting a uniform distribution for a feature. #103

eranr · 2023-11-08T20:15:00Z

Notebook name: 02_end_to_end_machine_learning_project
Section 3.3 “Feature Scaling”
Cell 95.

The code seems not robust as it assumes that consecutive percentiles have different values. This is not the case for e.g. the housing_median_age feature. Below is a suggested alternative.

# extra code – just shows that we get a uniform distribution
percentiles = pd.DataFrame(columns=['percentile'],
                           index=[p for p in range(1, 100)],
                           data=[np.percentile(housing["median_income"], p) for p in range(1, 100)])
percentiles.drop_duplicates(keep='last', inplace=True)
flattened_median_income = pd.cut(housing["median_income"],
                                 bins=[-np.inf] + percentiles['percentile'].tolist() + [np.inf],
                                 labels=percentiles.index.tolist() + [100])
flattened_median_income.hist(bins=len(percentiles) // 2 + 1)
plt.xlabel("Median income percentile")
plt.ylabel("Number of districts")
plt.show()
# Note: incomes below the 1st percentile are labeled 1, and incomes above the
# 99th percentile are labeled 100. This is why the distribution below ranges
# from 1 to 100 (not 0 to 100).

Additional context
Having said that, the housing_median_age indeed requires a different approach not only because of it’s multimodal distribution (as you state in the book) but also because there are multiple duplicate values in the distribution (probably as expected from an age feature that does not span across many years), which makes it difficult to break into evenly sized buckets.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IDEA] Chapter 2, Improving the extra code for getting a uniform distribution for a feature. #103

[IDEA] Chapter 2, Improving the extra code for getting a uniform distribution for a feature. #103

eranr commented Nov 8, 2023

[IDEA] Chapter 2, Improving the extra code for getting a uniform distribution for a feature. #103

[IDEA] Chapter 2, Improving the extra code for getting a uniform distribution for a feature. #103

Comments

eranr commented Nov 8, 2023