Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

09_tabular: ProductSize histogram's y-axis is mislabeled #590

Open
rigdern opened this issue Jun 22, 2023 · 2 comments
Open

09_tabular: ProductSize histogram's y-axis is mislabeled #590

rigdern opened this issue Jun 22, 2023 · 2 comments

Comments

@rigdern
Copy link

rigdern commented Jun 22, 2023

Problem

The book's histogram of ProductSizes in the "Partial Dependence" section has a mislabeled y-axis. Consequently, the histogram communicates the wrong counts for some of the ProductSizes. Here are some ProductSizes it mislabeled:

ProductSize Correct Count Book's Incorrect Count
Large 280 ~500
Mini 627 ~100

See below for details.

Book's incorrect histogram

The "Partial Dependence" section has a ProductSize histogram that is produced by this code:

p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c);

and renders like this:
image

Corrected histogram

We can reveal the mistake in the book's histogram by inspecting a textual histogram from the dataframe:

cond = (df.saleYear<2011) | (df.saleMonth<10)
df_valid = df[~cond]
df_valid.ProductSize.value_counts(dropna=False)

That code produces this textual histogram:

NaN               3930
Medium            1331
Large / Medium    1223
Mini               627
Small              484
Large              280
Compact            113
Name: ProductSize, dtype: int64

See the table at the top of this issue for a comparison between the counts of these ProductSizes and the ones from the book's histogram.

Cause

The problem is that the code that labels the y-axis assumes that the bottom bar is ProductSize 0, the next bar is ProductSize 1, etc. but this isn't the case. The bars do not appear to be ordered by ProductSize.

Example fix

Here's some code that properly labels the y-axis by sorting the y-axis labels to match the order of the bars:

counts = valid_xs_final['ProductSize'].value_counts(sort=False)
p = counts.plot.barh()
c = [to.classes['ProductSize'][i] for i in counts.index.values]
plt.yticks(range(len(c)), c)
image
@rigdern
Copy link
Author

rigdern commented Jun 22, 2023

Looks like a fix was submitted in pull request #410.

@jhanschoo
Copy link

I can confirm this issue; I ran into it while doing my own notes. My fix was as follows:

p = valid_xs_final['ProductSize'].value_counts(sort=False).sort_index().plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants