Skip to content

Machine Learning, Classification problem. https://archive.ics.uci.edu/ml/datasets/Adult Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Notifications You must be signed in to change notification settings

sadhana1002/PredictingSalaryClass-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classification - Predict Salary group

Dataset

https://archive.ics.uci.edu/ml/datasets/Adult

Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Attribute Information:

Listing of attributes:

Labels : >50K, <=50K.

age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Scope of this notebook

In this notebook, various classification algorithms are fed the training data (part of entire set) and the scores are compared. Just as a learning mechanism & to confirm how different algorithms work with adults dataset

import pandas as pd
import matplotlib.pyplot as plt
adults = pd.read_csv('adult.csv',names=['Age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','label'])
adults_test = pd.read_csv('adult.csv',names=['Age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','label'])
train_data = adults.drop('label',axis=1)

test_data = adults_test.drop('label',axis=1)

data = train_data.append(test_data)

label = adults['label'].append(adults_test['label'])
data.head()
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
Age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
full_dataset = adults.append(adults_test)
label.head()
0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: label, dtype: object
data_binary = pd.get_dummies(data)

data_binary.head()
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
Age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked ... native_country_ Portugal native_country_ Puerto-Rico native_country_ Scotland native_country_ South native_country_ Taiwan native_country_ Thailand native_country_ Trinadad&Tobago native_country_ United-States native_country_ Vietnam native_country_ Yugoslavia
0 39 77516 13 2174 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 50 83311 13 0 0 13 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 38 215646 9 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 53 234721 7 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 28 338409 13 0 0 40 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 108 columns

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data_binary,label)
performance = []
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB()
 # Binary data
GNB.fit(x_train,y_train)
train_score = GNB.score(x_train,y_train)
test_score = GNB.score(x_test,y_test)
print(f'Gaussian Naive Bayes : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Gaussian Naive Bayes', 'training_score':train_score, 'testing_score':test_score})
Gaussian Naive Bayes : Training score - 0.7961753444851661 - Test score - 0.7928259934893435
# LogisticRegression
from sklearn.linear_model import LogisticRegression


logClassifier = LogisticRegression()
logClassifier.fit(x_train,y_train)
train_score = logClassifier.score(x_train,y_train)
test_score = logClassifier.score(x_test,y_test)

print(f'LogisticRegression : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'LogisticRegression', 'training_score':train_score, 'testing_score':test_score})
LogisticRegression : Training score - 0.7986527712372802 - Test score - 0.7952214237454702
from sklearn.neighbors import KNeighborsClassifier
knn_scores = []
train_scores = []
test_scores = []

for n in range(1,20,2):
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(x_train,y_train)
    train_score = knn.score(x_train,y_train)
    test_score = knn.score(x_test,y_test)
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f'KNN : Training score - {train_score} -- Test score - {test_score}')
    knn_scores.append({'algorithm':'KNN', 'training_score':train_score})
    
plt.scatter(x=range(1, 20, 2),y=train_scores,c='b')
plt.scatter(x=range(1, 20, 2),y=test_scores,c='r')

plt.show()
KNN : Training score - 0.9999795253987429 -- Test score - 0.9323751612308826
KNN : Training score - 0.946233697098749 -- Test score - 0.7712671211842025
KNN : Training score - 0.8647652586965869 -- Test score - 0.8119894355383576
KNN : Training score - 0.847730390450646 -- Test score - 0.7886493458632762
KNN : Training score - 0.8347085440511046 -- Test score - 0.7997051778146306
KNN : Training score - 0.8288528080915624 -- Test score - 0.7950371598796143
KNN : Training score - 0.8205196453799062 -- Test score - 0.7985381733308765
KNN : Training score - 0.8186769312667636 -- Test score - 0.7991523862170629
KNN : Training score - 0.815093876046764 -- Test score - 0.7985995946194951
KNN : Training score - 0.8123502794783072 -- Test score - 0.7995823352373933

png

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train)

knn.score(x_train,y_train)

train_score = knn.score(x_train,y_train)
test_score = knn.score(x_test,y_test)

print(f'K Neighbors : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'K Neighbors', 'training_score':train_score, 'testing_score':test_score})
K Neighbors : Training score - 0.8647652586965869 - Test score - 0.8119894355383576
[{'algorithm': 'Gaussian Naive Bayes',
  'testing_score': 0.79282599348934346,
  'training_score': 0.79617534448516614},
 {'algorithm': 'LogisticRegression',
  'testing_score': 0.79522142374547022,
  'training_score': 0.79865277123728018},
 {'algorithm': 'K Neighbors',
  'testing_score': 0.81198943553835756,
  'training_score': 0.86476525869658694}]
from sklearn.ensemble import RandomForestClassifier
rndTree = RandomForestClassifier()
rndTree.fit(x_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
rndTree.score(x_test,y_test)
0.94846753884896506
rndTree.score(x_train,y_train)
0.99608935115988617
train_score = rndTree.score(x_train,y_train)
test_score = rndTree.score(x_test,y_test)

print(f'Random Forests : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Random Forests', 'training_score':train_score, 'testing_score':test_score})
Random Forests : Training score - 0.9960893511598862 - Test score - 0.9484675388489651
from sklearn import svm

svc = svm.SVC(kernel='linear')


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(data_binary,label)
StandardScaler(copy=True, with_mean=True, with_std=True)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
svc.fit(x_train_scaled,y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
svc.score(x_test_scaled,y_test)
0.85013205577053008
train_score = svc.score(x_train_scaled,y_train)
test_score = svc.score(x_test_scaled,y_test)

print(f'Support Vector Machine: Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Support Vector Machine', 'training_score':train_score, 'testing_score':test_score})
Support Vector Machine: Training score - 0.8533199565938453 - Test score - 0.8501320557705301
performance_df = pd.DataFrame(performance)
performance_df
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
algorithm testing_score training_score
0 Gaussian Naive Bayes 0.792826 0.796175
1 LogisticRegression 0.795221 0.798653
2 K Neighbors 0.811989 0.864765
3 Random Forests 0.948468 0.996089
4 Support Vector Machine 0.850132 0.853320

About

Machine Learning, Classification problem. https://archive.ics.uci.edu/ml/datasets/Adult Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published