Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate column where "Number of Distinct values" = "Total Number of rows" #170

Open
dangus-aktivbo opened this issue Aug 25, 2022 · 1 comment

Comments

@dangus-aktivbo
Copy link

Scanning numeric columns, I quickly wish to find out which columns have unique, distinct, values on each row.

The usefulness of dfSummary in scanning columns quickly, and figuring out the structural and statistical properties of each column. Normally, when I dig into datasets, I try to quickly find out if natural keys, like social security number, housing address, customer id etc are duplicated. The simplest way now, is to do a count-distinct (eg n_distinct(x) in dplyr) and compare distinct values to the row number of the data frame. I'm using dfSummary a lot, and think this would be a super enhancement.

One possible solution is to add a "% distinct" value on the marked columns since you have a (% of valid) in the column header. Or a "flag" like a string saying "Unique" or "(all unique)" or something. Now I have to check the Freqs against the row count, which of course is just a minor inconvenience... Anyway.

image

@dcomtois
Copy link
Owner

This is a good idea. I'd go for the "All distinct values", however, a new term ("All") will need to be added to the translations dataset, which will require some work. Help is always welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants