K Nearest Neighbour Cluster Algorithm
$Machine learning = Representation + Optimization + Evaluation$ 机器学习是建立/求解/评估模型的一个过程。
kNN算法 输入: 训练实例集 $T = {(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{N},y_{N}),}$ 其中$x_{i}$为实例的特征向量,$y_{i}$为实例的类别
输出: 实例$x$所属的类$y$ (1) 根据给定的距离度量,在$T$中找到和$x$最近邻的$k$个点,记为邻域$N_{k}(x)$ (2) 在邻域$N_{k}(x)$中根据分类决策规则(如多数表决)决定$x$所属的类$y$ $$ y = \arg \mathop{\max}{c {j}} \mathop{\sum}{x {i} \in N_{k}(x)} I(y_{i}=c_{j}) $$ Here $i=1,2,\cdots,N; j=1,2,\cdots,k$
kNN模型三要素 :k值,距离度量以及分类决策规则。
距离度量 设特征空间$\chi$是$n$维实数向量空间$R^{n}$,$x_{i},x_{j} \in \chi$,则有 $$ L_{p}(x_{i},x_{j})=\left( \sum_{l=1}^{n} \left| x_{i}^{l}- x_{j}^{l}\right|^{p} \right)^{\frac{1}{p}} $$ 此处$p \ge 1$,1对应Manhattan距离,2即为常用的欧式距离,
实际例子(电影分类) 欧式距离定义分类方式。(首先需要归一化数值)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 """ Created on Sun May 9 16:13:12 2020 @author: chen """ from matplotlib.font_manager import FontPropertiesimport matplotlib.lines as mlinesimport matplotlib.pyplot as pltimport timeimport numpy as npimport operator""" 函数说明:创建数据集 Parameters: None Returns: group - 数据集 labels - 分类标签 """ def createDataSet (): group = np.array([[3 ,104 ],[2 ,100 ],[1 ,81 ],[101 ,10 ],[99 ,5 ],[98 ,2 ]]) labels = ['爱情片' ,'爱情片' ,'爱情片' ,'动作片' ,'动作片' ,'动作片' ] return group, labels
定义欧氏距离并生成字典获得排序判断
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 """ 函数说明:kNN算法,分类器 Parameters: inX - 用于分类的数据(测试集) dataSet - 用于训练的数据(训练集)(n*1维列向量) labels - 分类标准(n*1维列向量) k - kNN算法参数,选择距离最小的k个点 Returns: sortedClassCount[0][0] - 分类结果 """ def classify0 (inX, dataSet, labels, k ): dataSetSize = dataSet.shape[0 ] diffMat = np.tile(inX, (dataSetSize, 1 )) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum (axis=1 ) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount = {} for i in range (k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0 ) + 1 sortedClassCount = sorted (classCount.items(), key = operator.itemgetter(1 ), reverse = True ) return sortedClassCount[0 ][0 ]
1 2 3 4 5 6 7 8 9 10 11 def main(): group , labels = createDataSet() test = [18,90] test_class = classify0 (test, group, labels, 3 ) print (test_class) main ()