Skip to content

pandas methods

Daisho Komiyama edited this page Jan 7, 2023 · 6 revisions
import pandas as pd

Read a CSV file

pd.read_csv(file_path)

Drop column(s)

# drop single column
data.drop('car name', axis='columns')
# same as data.drop('car name', axis=1), 1 == 'columns' 

# drop multiple columns
x = data.drop(columns = {'mpg','origin_europe'})

axis=1 == axis='columns'
axis=0 == axis='index'

Create dummy variables

We create 3 simple true or false columns with titles equivalent to "Is this car America?", "Is this car European?" and "Is this car Asian?". These will be used as independent variables without imposing any type of ordering between the three regions.

data = pd.get_dummies(data, columns=['origin'])

Above function changes values of origin column from this,

     model year   origin
0            70  america
1            70     asia
2            70  america

to this; simple True or False type value.

     model year  origin_america  origin_asia  origin_europe
0            70               1            0              0  
1            70               0            1              0  
2            70               1            0              0

Find non-digit values

hpIsDigit = pd.DataFrame(data.horsepower.str.isdigit())

Print out hpIsDigit (type DataFrame)

print(hpIsDigit.to_string())

     horsepower
0          True
1          True
2          True
3          True
4          True
5          False
6          True
7          True

So item of index 5 (False) is a non-digit value.

Print out non-digit values for horsepower column

data[hpIsDigit['horsepower'] == False]

      mpg  cylinders  displacement horsepower  weight  acceleration  \
32   25.0          4          98.0          ?    2046          19.0   
126  21.0          6         200.0          ?    2875          17.0   
330  40.9          4          85.0          ?    1835          17.3   
336  23.6          4         140.0          ?    2905          14.3   
354  34.5          4         100.0          ?    2320          15.8   
374  23.0          4         151.0          ?    3035          20.5   

Replace missing value with NaN

In this case, replacing '?' with NaN

data = data.replace('?', np.nan)

Filling the missing values with median value

medianFiller = lambda x: x.fillna(x.median())
data = data.apply(medianFiller, axis=0)
Clone this wiki locally