-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Apologies if this has been submitted or considered in the past, I searched through the GitHub issues and couldn't find any information pertaining to this.
The idea is that instead of specifying all of the columns that you wish to delete from a DataFrame
via the .drop
method, you specify instead the columns you wish to keep through a .keep_cols
method - all other columns are deleted. This would save typing in cases where there are many columns, and we only want to keep a small subset of columns. The prime use case here is method chaining, where using [[
doesn't really work in the middle of many methods being chained together.
A small, complete example of the issue
import pandas as pd
# Create an example DataFrame
data = [
[1, 'ABC', 4, 10, 6.3],
[2, 'BCD', 10, 9, 11.6],
[3, 'CDE', 7, 4, 10.0],
[4, 'DEF', 7, 10, 5.4],
[5, 'EFG', 2, 9, 5.3],
]
data = pd.DataFrame(data,
columns = ['Id', 'Name', 'Rating1', 'Rating2', 'ThisIsANumber'])
# Just want columns Id and Ratings2
new_data = data.drop(['Name', 'Rating1', 'ThisIsANumber'], axis = 1)
new_data.head()
# ** It would be nice to be able to only specify the columns we want
# ** to keep to save typing - similar to dplyr in R
def keep_cols(DataFrame, keep_these):
"""Keep only the columns [keep_these] in a DataFrame, delete
all other columns.
"""
drop_these = list(set(list(DataFrame)) - set(keep_these))
return DataFrame.drop(drop_these, axis = 1)
new_data = data.pipe(keep_cols, ['Id', 'Rating2'])
new_data.head()
# In this specific example there was not much more typing between
# `.drop` and the `keep_cols` function, but often when a `DataFrame`
# has many columns this is not the case!
In this contrived example I created a keep_cols
function as a rough draft of a .keep_columns
method to the DataFrame
object, and used the .pipe
method to pipe that function to the DataFrame as if it were a method.
I don't think using [[
cuts if here. Yes, doing new_data[['Id', 'Rating2]]
would work, but when method chaining, people often want to drop columns somewhere in the middle of a bunch of methods.
Just in case it's helpful, here's a good article demonstrating the power/beauty of method chaining in Pandas: https://tomaugspurger.github.io/modern-1.html.
Thanks!