1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
|
""" Created on Tue May 18 19:02:27 2020
@author: chen """ import numpy as np from functools import reduce
""" 函数说明:创建实验样本
Parameters: None Returns: postingList - 样本词表 classVec - 类别标签向量 """ def loadDataSet(): postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0, 1, 0, 1, 0, 1] return postingList, classVec
""" 函数说明:将切分的实验样本词条整理成不重复的词汇表
Parameters: dataSet - 整理的样本数据集 Returns: vocabSet - 返回不重复的词汇表 """ def createVocabList(dataSet): vocabSet = set([]) for document in dataSet: vocabSet = vocabSet | set(document) return list(vocabSet)
""" 函数说明:根据vocabList词汇表,将inputSet向量化,向量的每个元素为1或0
Parameters: vocabList - createVocabList返回的列表 inputSet - 切分的词条列表 Returns: returnVec - 文档向量,词集模型 """ def setOfWords2Vec(vocabList, inputSet): returnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary" % word) return returnVec
""" 函数说明:朴素贝叶斯分类器训练函数
Parameters: trainMatrix - 训练文档矩阵,即setOfWords2Vec返回的returnVec构成的矩阵 trainCategory - 训练类标签向量,即loadDataSet返回的classVec Returns: p0Vect - 侮辱类的条件概率数组 p1Vect - 非侮辱类的条件概率数组 pAbusive - 文档属于侮辱类的概率 """ def trainNB0(trainMatrix, trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = np.ones(numWords) p1Num = np.ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = np.log(p1Num / p1Denom) p0Vect = np.log(p0Num / p0Denom) return p0Vect, p1Vect, pAbusive
""" 函数说明:朴素贝叶斯分类器分类函数
Parameters: vec2Classify - 待分类的词条数组 p0Vec - 侮辱类的条件概率数组 p1Vec - 非侮辱类的条件概率数组 pClass1 - 文档属于侮辱类的概率 Returns: 0 - 属于非侮辱类 1 - 属于侮辱类 """ def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + np.log(pClass1) p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) if p1 > p0: return 1 else: return 0
""" 函数说明:测试朴素贝叶斯分类器
Parameters: None Returns: None """ def testingNB(): listOPosts, listclasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listclasses)) testEntry = ['love', 'my', 'dalmation'] thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry)) if classifyNB(thisDoc, p0V, p1V, pAb): print(testEntry, '属于侮辱类') else: print(testEntry, '属于非侮辱类') testEntry = ['stupid', 'garbage'] thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry)) if classifyNB(thisDoc, p0V, p1V, pAb): print(testEntry, '属于侮辱类') else: print(testEntry, '属于非侮辱类') def main(): testingNB() if __name__ == '__main__': main()
|