Skip to content

Machine Learning, Classification problem. Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



3 Commits

Repository files navigation

Classification - Predict Salary group


Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Attribute Information:

Listing of attributes:

Labels : >50K, <=50K.

age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Scope of this notebook

In this notebook, various classification algorithms are fed the training data (part of entire set) and the scores are compared. Just as a learning mechanism & to confirm how different algorithms work with adults dataset

import pandas as pd
import matplotlib.pyplot as plt
adults = pd.read_csv('adult.csv',names=['Age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','label'])
adults_test = pd.read_csv('adult.csv',names=['Age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','label'])
train_data = adults.drop('label',axis=1)

test_data = adults_test.drop('label',axis=1)

data = train_data.append(test_data)

label = adults['label'].append(adults_test['label'])
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;

.dataframe tbody tr th {
    vertical-align: top;
Age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
full_dataset = adults.append(adults_test)
0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: label, dtype: object
data_binary = pd.get_dummies(data)

<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;

.dataframe tbody tr th {
    vertical-align: top;
Age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked ... native_country_ Portugal native_country_ Puerto-Rico native_country_ Scotland native_country_ South native_country_ Taiwan native_country_ Thailand native_country_ Trinadad&Tobago native_country_ United-States native_country_ Vietnam native_country_ Yugoslavia
0 39 77516 13 2174 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 50 83311 13 0 0 13 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 38 215646 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 53 234721 7 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 28 338409 13 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 108 columns

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data_binary,label)
performance = []
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB()
 # Binary data,y_train)
train_score = GNB.score(x_train,y_train)
test_score = GNB.score(x_test,y_test)
print(f'Gaussian Naive Bayes : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Gaussian Naive Bayes', 'training_score':train_score, 'testing_score':test_score})
Gaussian Naive Bayes : Training score - 0.7961753444851661 - Test score - 0.7928259934893435
# LogisticRegression
from sklearn.linear_model import LogisticRegression

logClassifier = LogisticRegression(),y_train)
train_score = logClassifier.score(x_train,y_train)
test_score = logClassifier.score(x_test,y_test)

print(f'LogisticRegression : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'LogisticRegression', 'training_score':train_score, 'testing_score':test_score})
LogisticRegression : Training score - 0.7986527712372802 - Test score - 0.7952214237454702
from sklearn.neighbors import KNeighborsClassifier
knn_scores = []
train_scores = []
test_scores = []

for n in range(1,20,2):
    knn = KNeighborsClassifier(n_neighbors=n),y_train)
    train_score = knn.score(x_train,y_train)
    test_score = knn.score(x_test,y_test)
    print(f'KNN : Training score - {train_score} -- Test score - {test_score}')
    knn_scores.append({'algorithm':'KNN', 'training_score':train_score})
plt.scatter(x=range(1, 20, 2),y=train_scores,c='b')
plt.scatter(x=range(1, 20, 2),y=test_scores,c='r')
KNN : Training score - 0.9999795253987429 -- Test score - 0.9323751612308826
KNN : Training score - 0.946233697098749 -- Test score - 0.7712671211842025
KNN : Training score - 0.8647652586965869 -- Test score - 0.8119894355383576
KNN : Training score - 0.847730390450646 -- Test score - 0.7886493458632762
KNN : Training score - 0.8347085440511046 -- Test score - 0.7997051778146306
KNN : Training score - 0.8288528080915624 -- Test score - 0.7950371598796143
KNN : Training score - 0.8205196453799062 -- Test score - 0.7985381733308765
KNN : Training score - 0.8186769312667636 -- Test score - 0.7991523862170629
KNN : Training score - 0.815093876046764 -- Test score - 0.7985995946194951
KNN : Training score - 0.8123502794783072 -- Test score - 0.7995823352373933


knn = KNeighborsClassifier(n_neighbors=5),y_train)


train_score = knn.score(x_train,y_train)
test_score = knn.score(x_test,y_test)

print(f'K Neighbors : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'K Neighbors', 'training_score':train_score, 'testing_score':test_score})
K Neighbors : Training score - 0.8647652586965869 - Test score - 0.8119894355383576
[{'algorithm': 'Gaussian Naive Bayes',
  'testing_score': 0.79282599348934346,
  'training_score': 0.79617534448516614},
 {'algorithm': 'LogisticRegression',
  'testing_score': 0.79522142374547022,
  'training_score': 0.79865277123728018},
 {'algorithm': 'K Neighbors',
  'testing_score': 0.81198943553835756,
  'training_score': 0.86476525869658694}]
from sklearn.ensemble import RandomForestClassifier
rndTree = RandomForestClassifier(),y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
train_score = rndTree.score(x_train,y_train)
test_score = rndTree.score(x_test,y_test)

print(f'Random Forests : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Random Forests', 'training_score':train_score, 'testing_score':test_score})
Random Forests : Training score - 0.9960893511598862 - Test score - 0.9484675388489651
from sklearn import svm

svc = svm.SVC(kernel='linear')

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(),label)
StandardScaler(copy=True, with_mean=True, with_std=True)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test),y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
train_score = svc.score(x_train_scaled,y_train)
test_score = svc.score(x_test_scaled,y_test)

print(f'Support Vector Machine: Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Support Vector Machine', 'training_score':train_score, 'testing_score':test_score})
Support Vector Machine: Training score - 0.8533199565938453 - Test score - 0.8501320557705301
performance_df = pd.DataFrame(performance)
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;

.dataframe tbody tr th {
    vertical-align: top;
algorithm testing_score training_score
0 Gaussian Naive Bayes 0.792826 0.796175
1 LogisticRegression 0.795221 0.798653
2 K Neighbors 0.811989 0.864765
3 Random Forests 0.948468 0.996089
4 Support Vector Machine 0.850132 0.853320


Machine Learning, Classification problem. Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.







No releases published


No packages published