In this project for CS345 Software Enginnering, I implement two Machine Learning classification algorithms from scratch: Multinomial Naive Bayes and AdaBoost.
- My API is inspired by the API of
scikit-learn
with aClassifier()
object and two methodsfit()
andpredict()
. An example call would be:
clf = Classifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
- All the
X
arrays should be two-dimensional and is of the formm x n
wherem
is the number of rows/samples andn
is the number of columns/features. They
array, i.e. the target, should be a one-dimensional array of the formm x 1
. - Arguments of the class
NaiveBayesClassifier
:alpha
: the pseudo amount used in Laplace smoothing. The default value is 1.
- Arguments of the class
AdaBoost
:learner_num
: the number of individual (weak) learners usedlearner_type
: the type of learner used. By default,DecisionStump
is used. However, any learner object withfit()
andpredict()
methods can be used. For example, you can useNaiveBayesClassifier()
.**learner_kwargs
: the keyword arguments for the learner object.
- To quickly test the algorithms, from the command line, run:
$ python3 test_naive_bayes.py <dataset.csv>
. You can follow the same syntax fortest_ada_boost.py
. - The two available datasets in the repository are
play_tennis.csv
andwill_wait.csv
. Note: the two algorithms only work with categorical features for now. - When you run
test_naive_bayes.py
, it will print out the leave-one-out cross validation scores of my implementation and the implementation ofscikit-learn
for comparision. - When you run
test_ada_boost.py
, it will print out the leave-one-out cross validation score of oneDecisionStump
and the max score ofAdaBoost
after 100 trials. This is because my AdaBoost makes use of (random) resampling instead of reweighing, so there is some randomness involved.
- For building the algorithms:
numpy
- For testing the algorithms:
pandas
scikit-learn