my first ML_Algorithm
KNN:
the algorithm has to carry around the full dataSet ,implies a large amount of storage
need to calculate the distance measurement for every piece of data in database
doesn't give you any idea of the underlying strcture of the data
1.calculate the distance between input and the current data
2.sort the distances in increasing order
3.take k item with lowest distances to input
4.finnd majority class among these item
5.return majority class as out predict for the class of input
def createDataSet():
will return group,labels
def classify0(idx,dataSet,labels,k):
predict your input with dataSet using Euclid distance , k = vote
def file2matrix(filename):
file to matrix
def show_data2img(DataMat,DataLabels,datatype1,datatype2):
data visualization
def autoNorm(dataSet):
normalization
def datingClassTest():
Test your model and observation your errorRate
def classifyPerson():
verify
def img2vector(filename):
img 2 32*32 vector
def handwritingClassTest():
Test handwriting model
classify0(dataSet) observation errorrate
-
dating
import kNN
datingMat , datingLabels = file2matrix('datingTestSet.txt')
kNN.show_data2img(datingMat,datingLabels , 0 ,1) ==> Data visualization
kNN.datingClassTest()
kNN.classifyPerson()
-
handwriting recognition
import kNN
kNN.handwritingClassTest()
kNN => large computations and large memory
kd_tree allows you make this smaller and take fewer computations