Classification - Predict Salary group

Dataset

https://archive.ics.uci.edu/ml/datasets/Adult

Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Attribute Information:

Listing of attributes:

Labels : >50K, <=50K.

age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Scope of this notebook

In this notebook, various classification algorithms are fed the training data (part of entire set) and the scores are compared. Just as a learning mechanism & to confirm how different algorithms work with adults dataset

import pandas as pd
import matplotlib.pyplot as plt

adults = pd.read_csv('adult.csv',names=['Age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','label'])
adults_test = pd.read_csv('adult.csv',names=['Age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','label'])

train_data = adults.drop('label',axis=1)

test_data = adults_test.drop('label',axis=1)

data = train_data.append(test_data)

label = adults['label'].append(adults_test['label'])

data.head()

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	Age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba

full_dataset = adults.append(adults_test)

label.head()

0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: label, dtype: object

data_binary = pd.get_dummies(data)

data_binary.head()

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	Age	fnlwgt	education_num	capital_gain	hours_per_week	...	native_country_ United-States
0	39	77516	13	2174	40	...	1
1	50	83311	13	0	13	...	1
2	38	215646	9	0	40	...	1
3	53	234721	7	0	40	...	1
4	28	338409	13	0	40	...	0

5 rows × 108 columns

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data_binary,label)

performance = []

# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB()

 # Binary data
GNB.fit(x_train,y_train)
train_score = GNB.score(x_train,y_train)
test_score = GNB.score(x_test,y_test)
print(f'Gaussian Naive Bayes : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Gaussian Naive Bayes', 'training_score':train_score, 'testing_score':test_score})

Gaussian Naive Bayes : Training score - 0.7961753444851661 - Test score - 0.7928259934893435

# LogisticRegression
from sklearn.linear_model import LogisticRegression


logClassifier = LogisticRegression()

logClassifier.fit(x_train,y_train)
train_score = logClassifier.score(x_train,y_train)
test_score = logClassifier.score(x_test,y_test)

print(f'LogisticRegression : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'LogisticRegression', 'training_score':train_score, 'testing_score':test_score})

LogisticRegression : Training score - 0.7986527712372802 - Test score - 0.7952214237454702

from sklearn.neighbors import KNeighborsClassifier

knn_scores = []

train_scores = []
test_scores = []

for n in range(1,20,2):
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(x_train,y_train)
    train_score = knn.score(x_train,y_train)
    test_score = knn.score(x_test,y_test)
    train_scores.append(train_score)
    test_scores.append(test_score)
    print(f'KNN : Training score - {train_score} -- Test score - {test_score}')
    knn_scores.append({'algorithm':'KNN', 'training_score':train_score})
    
plt.scatter(x=range(1, 20, 2),y=train_scores,c='b')
plt.scatter(x=range(1, 20, 2),y=test_scores,c='r')

plt.show()

KNN : Training score - 0.9999795253987429 -- Test score - 0.9323751612308826
KNN : Training score - 0.946233697098749 -- Test score - 0.7712671211842025
KNN : Training score - 0.8647652586965869 -- Test score - 0.8119894355383576
KNN : Training score - 0.847730390450646 -- Test score - 0.7886493458632762
KNN : Training score - 0.8347085440511046 -- Test score - 0.7997051778146306
KNN : Training score - 0.8288528080915624 -- Test score - 0.7950371598796143
KNN : Training score - 0.8205196453799062 -- Test score - 0.7985381733308765
KNN : Training score - 0.8186769312667636 -- Test score - 0.7991523862170629
KNN : Training score - 0.815093876046764 -- Test score - 0.7985995946194951
KNN : Training score - 0.8123502794783072 -- Test score - 0.7995823352373933

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train)

knn.score(x_train,y_train)

train_score = knn.score(x_train,y_train)
test_score = knn.score(x_test,y_test)

print(f'K Neighbors : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'K Neighbors', 'training_score':train_score, 'testing_score':test_score})

K Neighbors : Training score - 0.8647652586965869 - Test score - 0.8119894355383576

[{'algorithm': 'Gaussian Naive Bayes',
  'testing_score': 0.79282599348934346,
  'training_score': 0.79617534448516614},
 {'algorithm': 'LogisticRegression',
  'testing_score': 0.79522142374547022,
  'training_score': 0.79865277123728018},
 {'algorithm': 'K Neighbors',
  'testing_score': 0.81198943553835756,
  'training_score': 0.86476525869658694}]

from sklearn.ensemble import RandomForestClassifier

rndTree = RandomForestClassifier()

rndTree.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

rndTree.score(x_test,y_test)

0.94846753884896506

rndTree.score(x_train,y_train)

0.99608935115988617

train_score = rndTree.score(x_train,y_train)
test_score = rndTree.score(x_test,y_test)

print(f'Random Forests : Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Random Forests', 'training_score':train_score, 'testing_score':test_score})

Random Forests : Training score - 0.9960893511598862 - Test score - 0.9484675388489651

from sklearn import svm

svc = svm.SVC(kernel='linear')


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(data_binary,label)

StandardScaler(copy=True, with_mean=True, with_std=True)

x_train_scaled = scaler.transform(x_train)

x_test_scaled = scaler.transform(x_test)

svc.fit(x_train_scaled,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

svc.score(x_test_scaled,y_test)

0.85013205577053008

train_score = svc.score(x_train_scaled,y_train)
test_score = svc.score(x_test_scaled,y_test)

print(f'Support Vector Machine: Training score - {train_score} - Test score - {test_score}')

performance.append({'algorithm':'Support Vector Machine', 'training_score':train_score, 'testing_score':test_score})

Support Vector Machine: Training score - 0.8533199565938453 - Test score - 0.8501320557705301

performance_df = pd.DataFrame(performance)

performance_df

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	algorithm	testing_score	training_score
0	Gaussian Naive Bayes	0.792826	0.796175
1	LogisticRegression	0.795221	0.798653
2	K Neighbors	0.811989	0.864765
3	Random Forests	0.948468	0.996089
4	Support Vector Machine	0.850132	0.853320

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Classification.ipynb		Classification.ipynb
README.md		README.md
adult.csv		adult.csv
adult_test.csv		adult_test.csv
output_19_1.png		output_19_1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification.ipynb

Classification.ipynb

README.md

README.md

adult.csv

adult.csv

adult_test.csv

adult_test.csv

output_19_1.png

output_19_1.png

Repository files navigation

Classification - Predict Salary group

Dataset

Attribute Information:

Listing of attributes:

Scope of this notebook

About

Releases

Packages

Languages

sadhana1002/PredictingSalaryClass-Classification

Folders and files

Latest commit

History

Repository files navigation

Classification - Predict Salary group

Dataset

Attribute Information:

Listing of attributes:

Scope of this notebook

About

Topics

Resources

Stars

Watchers

Forks

Languages