pandas methods

import pandas as pd

Read a CSV file

pd.read_csv(file_path)

Drop column(s)

# drop single column
data.drop('car name', axis='columns')
# same as data.drop('car name', axis=1), 1 == 'columns' 

# drop multiple columns
x = data.drop(columns = {'mpg','origin_europe'})

axis=1 == axis='columns'
axis=0 == axis='index'

Create dummy variables

We create 3 simple true or false columns with titles equivalent to "Is this car America?", "Is this car European?" and "Is this car Asian?". These will be used as independent variables without imposing any type of ordering between the three regions.

data = pd.get_dummies(data, columns=['origin'])

Above function changes values of origin column from this,

     model year   origin
0            70  america
1            70     asia
2            70  america

to this; simple True or False type value.

     model year  origin_america  origin_asia  origin_europe
0            70               1            0              0  
1            70               0            1              0  
2            70               1            0              0

Find non-digit values

hpIsDigit = pd.DataFrame(data.horsepower.str.isdigit())

Print out hpIsDigit (type DataFrame)

print(hpIsDigit.to_string())

     horsepower
0          True
1          True
2          True
3          True
4          True
5          False
6          True
7          True

So item of index 5 (False) is a non-digit value.

Print out non-digit values for `horsepower` column

data[hpIsDigit['horsepower'] == False]

      mpg  cylinders  displacement horsepower  weight  acceleration  \
32   25.0          4          98.0          ?    2046          19.0   
126  21.0          6         200.0          ?    2875          17.0   
330  40.9          4          85.0          ?    1835          17.3   
336  23.6          4         140.0          ?    2905          14.3   
354  34.5          4         100.0          ?    2320          15.8   
374  23.0          4         151.0          ?    3035          20.5

Replace missing value with NaN

In this case, replacing '?' with NaN

data = data.replace('?', np.nan)

Filling the missing values with median value

medianFiller = lambda x: x.fillna(x.median())
data = data.apply(medianFiller, axis=0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas methods

Read a CSV file

Drop column(s)

Create dummy variables

Find non-digit values

Print out non-digit values for `horsepower` column

Replace missing value with NaN

Filling the missing values with median value

Menu

Clone this wiki locally

pandas methods

Read a CSV file

Drop column(s)

Create dummy variables

Find non-digit values

Print out non-digit values for horsepower column

Replace missing value with NaN

Filling the missing values with median value

Menu

Clone this wiki locally

Print out non-digit values for `horsepower` column