Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IDEA] Chapter 2, Improving the extra code for getting a uniform distribution for a feature. #103

Open
eranr opened this issue Nov 8, 2023 · 0 comments

Comments

@eranr
Copy link

eranr commented Nov 8, 2023

Notebook name: 02_end_to_end_machine_learning_project
Section 3.3 “Feature Scaling”
Cell 95.

The code seems not robust as it assumes that consecutive percentiles have different values. This is not the case for e.g. the housing_median_age feature. Below is a suggested alternative.

# extra code – just shows that we get a uniform distribution
percentiles = pd.DataFrame(columns=['percentile'],
                           index=[p for p in range(1, 100)],
                           data=[np.percentile(housing["median_income"], p) for p in range(1, 100)])
percentiles.drop_duplicates(keep='last', inplace=True)
flattened_median_income = pd.cut(housing["median_income"],
                                 bins=[-np.inf] + percentiles['percentile'].tolist() + [np.inf],
                                 labels=percentiles.index.tolist() + [100])
flattened_median_income.hist(bins=len(percentiles) // 2 + 1)
plt.xlabel("Median income percentile")
plt.ylabel("Number of districts")
plt.show()
# Note: incomes below the 1st percentile are labeled 1, and incomes above the
# 99th percentile are labeled 100. This is why the distribution below ranges
# from 1 to 100 (not 0 to 100).

Additional context
Having said that, the housing_median_age indeed requires a different approach not only because of it’s multimodal distribution (as you state in the book) but also because there are multiple duplicate values in the distribution (probably as expected from an age feature that does not span across many years), which makes it difficult to break into evenly sized buckets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant