Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement dplyr::glimpse() in pandas #51668

Open
1 of 3 tasks
Holer90 opened this issue Feb 27, 2023 · 11 comments
Open
1 of 3 tasks

ENH: Implement dplyr::glimpse() in pandas #51668

Holer90 opened this issue Feb 27, 2023 · 11 comments
Assignees
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@Holer90
Copy link

Holer90 commented Feb 27, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Pandas is missing a quick and easy way to get an overview of multi-column data. Fortunate, the R-community has found a solution: dplyr::glimpse(). Link to dplyr.

Example:

>>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
>>> iris.glimpse()
DataFrame with 150 rows and 5 columns.
sepal_length  <float64>  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5 ...
sepal_width   <float64>  3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3 ...
petal_length  <float64>  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1 ...
petal_width   <float64>  0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0 ...
species       <object>   'setosa', 'setosa', 'setosa', 'setosa', 'setosa', ' ...

Feature Description

I have implemented the glimpse() function based on the info() function for both DataFrame and Series. I have also slightly extended the functionality to include the following options:

Parameters
----------
index : bool, optional
    Whether to print the column indices.
dtype : bool, optional
    Whether to print the dtypes of the columns.
isna : bool, optional
    Whether to print the null counts of the columns.
notna : bool, optional
    Whether to print the non-null counts of the columns.
nunique: bool, optional
    Whether to print the number of unique values.
unique_values: bool, optional
    Whether to print a glimpse of the unique values instead of the first values.
verbose : bool, optional
    Whether to print the headers and count descriptions. By default,
    the setting goes to false if only dtype is enabled otherwise it
    goes to true.
emphasize: bool, optional
    Whether to emphasize the optional information columns. By 
    default, it is enabled if verbose is false.
buf : writable buffer, defaults to sys.stdout
    Where to send the output. By default, the output is printed to
    sys.stdout. Pass a writable buffer if you need to further
    process the output.
width : int, optional
    The width at which the output is trimmed. By default, the width
    is determined by the pandas display.width option.   

An example of the extended functionality:

>>> iris.glimpse(unique_values=True, isna=True, notna=True, width=100)
DataFrame with 150 rows and 5 columns.
Column        Dtype    Null    Non-null      Unique values                                          
------        -----    ----    --------      -------------                                          
sepal_length  float64  0 null  150 non-null  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.4, 4.8, 4.3, 5.8, 5 ...
sepal_width   float64  0 null  150 non-null  3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 2.9, 3.7, 4.0, 4 ...
petal_length  float64  0 null  150 non-null  1.4, 1.3, 1.5, 1.7, 1.6, 1.1, 1.2, 1.0, 1.9, 4.7, 4 ...
petal_width   float64  0 null  150 non-null  0.2, 0.4, 0.3, 0.1, 0.5, 0.6, 1.4, 1.5, 1.3, 1.6, 1 ...
species       object   0 null  150 non-null  'setosa', 'versicolor', 'virginica'                    

Alternative Solutions

The functionality could be implemented in a separate package and monkey-patched into pandas, but this solution would not make the function easily accessible to the vast majority of people using pandas.

Additional Context

I will provide a pull request implementing this functionality shortly.

In siuba, which is a dplyr implementation in python, there is an open issue to Support glimpse function, which shows the desire for this functionality in the python/pandas community.

Edit: The glimpse function is also implemented in polars, which also highlights the desire for this functionality.

@Holer90 Holer90 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 27, 2023
@Holer90
Copy link
Author

Holer90 commented Feb 27, 2023

take

@phofl
Copy link
Member

phofl commented Feb 27, 2023

Hi, thanks for your report. Please wait for consensus before submitting a pr

@Holer90
Copy link
Author

Holer90 commented Feb 27, 2023

Hi, thanks for your report. Please wait for consensus before submitting a pr

Will do.

For reference, the code is (mostly) available in pandas/io/formats/glimpse.py in my fork if its interesting while considering the consensus.

@pourmoayed
Copy link

I think this would be a great feature for pandas. In R it gives helpful data summary overview for R DataFrames and it makes sense to have a similar feature for pandas.

@chriscardillo
Copy link

Plus one.

Originally opened the issue in the siuba repo. Would be great to see this added here.

@phofl
Copy link
Member

phofl commented Feb 28, 2023

Just a general comment: It's not only about the feature, we have to be comfortable maintaining it as well (long-term speaking)

@Holer90
Copy link
Author

Holer90 commented Mar 1, 2023

Just a general comment: It's not only about the feature, we have to be comfortable maintaining it as well (long-term speaking)

Fully understand. Regarding this, it has been designed with an architecture that is 1-to-1 with the info() function, which should make it easier to both maintain and understand.

@cheTesta
Copy link

cheTesta commented Mar 1, 2023

Isnt't this the same of doing df.T or in full df.transpose() ?

@Holer90
Copy link
Author

Holer90 commented Mar 1, 2023

Isnt't this the same of doing df.T.head().T # or df.transpose.head() ?

Would that not only print the first 5 columns? Also, this would print/show all the data?

@Holer90
Copy link
Author

Holer90 commented Jun 2, 2023

@phofl has any discussion happened regarding this feature ?

@JustinKurland
Copy link

Late to the show here @Holer90 but I have written a .glimpse function in the pytimetk package that does this just like with dplyr. The issue with the polars implementation of .glimpse() is that if you transform your pandas.DataFrame into a polars.DataFrame the dtypes are not like for like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

6 participants