运用朴素贝叶斯分类器应用判断文档是否还有侮辱性词汇请判断 love my dalmatia

点击联系发帖人 时间：2016-05-21 07:47

朴素的贝叶斯分类器

朴素贝叶斯python代码实现_百度文库
两大类热门资源免费畅读
续费一年阅读会员，立省24元！
朴素贝叶斯python代码实现
上传于||暂无简介
阅读已结束，如果下载本文需要使用0下载券
想免费下载更多文档？
下载文档到电脑，查找使用更方便
还剩5页未读，继续阅读
你可能喜欢朴素贝叶斯，基本思想就是，给出一个分类问题，对于待求项，属于哪个分类的概率最大，那这个待求项就属于哪个分类。
给出基本公式
假设要分类物有n个特征，分别为F1、F2、F3、…、Fn，现在有m个类别分别是C1、C2、C3、…、Cm.贝叶斯就是计算出概率最大的那个分类。
具体贝叶斯定理参考
对于多个特征，我们进行求解。
P(C|F1F2...Fn)= P(F1F2...Fn|C)P(C) / P(F1F2...Fn)
解释也就是，在都有这些特征的基础上属于类别C的概率等于在已知是C类，含有这些特征的概率乘以该分类是C类的概率，最终除以该分类物都含有这些特征的概率。
好，对于P(F1F2...Fn)对所有的分类物都是一样的，这样问题就可以简化为求
P(F1F2...Fn|C)P(C)的最大值。
而对于朴素贝叶斯来说，它更简化了，它认为分类物所有的特征都是独立的。这样我们就可以进一步简化为：
P(F1F2...Fn|C)P(C)= P(F1|C)P(F2|C) ... P(Fn|C)P(C)
这样我们的计算就变得简单很多了，在给定分类下某个特征发生的概率，这个我们根据样本数据是可以得到的。左边就可以计算出来。
虽然现实中可能这些所有的特征都相互独立，不过通过这样的假设求出的结果还是比较准确的。
这边是假设相互独立，而如果假设一个特征只与前面一个特征或者i个特征有关的话，那这个又转化成马尔科夫的问题。
好，那下面就通过Python代码来进一步的了解这个问题。这是一个社区留言板的例子，主要就是对文本进行分类，对有侮辱性的言论的文本进行屏蔽。
1: #-*- coding: utf-8 -*
 #导入数据 postingList对应社区的6条留言，[0,1,0,1,0,1]对应于2、4、6句是侮辱性语句，即所属类别
def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1]
return postingList,classVec
 #得到词汇表
def createVocabList(dataSet):
vocabSet = set([])
#set是不重复集
for document in dataSet:
vocabSet = vocabSet | set(document) #对每个文档求得的结果求并,去重复的
return list(vocabSet)#转换成list
#根据词汇表构成0向量，对每个文档每个词对应向量中的位置赋值为1
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)#构建vocabList长度的0向量
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else: print &the word: %s is not in my Vocabulary!& % word
return returnVec
上述代码所做的事情是找出所有文档中所有不重复的词，构成词汇表，根据词汇表及其长度构建每个文档的词向量（文档中的词对应词汇表中的位置为1，在这边没有考虑这个词在文档中出现了几次，只考虑了这个词在文档中是否出现，没有考虑不同的词的权重不一样，后续讨论）。
得到如下结果，比如第一个文档的词向量
&&& import bayes
&&& listOPosts,listClasses = bayes.loadDataSet()
&&& myVocabList = bayes.createVocabList(listOPosts)
&&& bayes.setOfWords2Vec(myVocabList,listOPosts[0])
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
下面给出计算每一个词在每个类别下的概率。
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix)# 6
numWords = len(trainMatrix[0])#得到词汇表的长度32
pAbusive = sum(trainCategory)/float(numTrainDocs)#(0+1+0+1+0+1)/6=0.5
p0Num = zeros(numWords)#长度为numWords32全为0的数组
p1Num = zeros(numWords)
p0Denom = 0.0
p1Denom = 0.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = p1Num/p1Denom
p0Vect = p0Num/p0Denom
return p0Vect,p1Vect,pAbusive
&&& import bayes
&&& trainMatrix,trainCategory = bayes.getTrainMatrix()
&&& p0V,p1V,pAb = bayes.trainNB0(trainMatrix,trainCategory)
array([ 0.,
array([ 0.
这边我添加了一个函数
def getTrainMatrix():
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
return trainMat,listClasses
这边其实是有问题的，之所有这边没有直接套用贝叶斯公式，因为这边计算单个词在某个类别下的概率为0很正常，具体看公式
P(c0|w)=P(w|c0)P(c0)/P(w)
在计算P(w|c0) = p(w0,w1,w2…,wn|c0),这边为朴素贝叶斯，则考虑w相互独立，
则P(w|c0) = P(w0|c0).P(w1|c0)…….P(wn|c0)
而其中任意一个wi在某个类别下不出现很正常，则P(wi|c0)就为0，则P(c0|w)= 0，显然这个不是我们想要的。
为了降低这种影响，我们将所有词出现初始化为1，并将分母初始化为2
修改部分如下：
p0Num = ones(numWords)#长度为numWords32全为1的数组
p1Num = ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
因为由于是计算机浮点型运算会有精度问题，比如多个很小数相乘最终会四舍五入为0，如下：
0....*0.4*0.*0.*0...*0.00001
这个我们一般的处理办法是通过取对数的方式。这样小数就可以转化大值类型的数，避免最后的四舍五入，可以看如下：
&&& log(0.)
则把源代码修改如下：
p1Vect = log(p1Num/p1Denom)
#change to log()
p0Vect = log(p0Num/p0Denom)
#change to log()
另外要说明的是取对数了不影响函数的单调性，形象一点可以看下图：
这几个问题解决后，下面开始写出最后的分类算法,，继续给出点公式，我们知道
P(w|c0)P(c0) = P(w0|c0).P(w1|c0)…….P(wn|c0)P(c0)
这样两边取对数
ln（P(w|c0)P(c0)） = ln((w0|c0))+ln(P(w1|c0))+……+ln(P(wn|c0))+ln(P(c0))
之前说了，对于P(w)大家一样，要求的话通过全概率公式即可以求。我们就不求了。
由于函数加ln不影响函数单调性，这样对于P(c|w)我们只要求ln((w0|c0))+ln(P(w1|c0))+……+ln(P(wn|c0))+ln(P(c0))
而每一个ln（wi|ci)我们已经求出。这样对于一个新的w，我们看w中有哪些wi,这样我们构建一个词汇表大小的向量，对i位赋值为1，其它位赋值为0，我们通过向量相乘，这样就可以求得每一个ln((wi|ci)，然后相加，最后再加上ln（ci）就可以求得。
代码如下：
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 & p0:
最后我们给出测试函数：
def testingNB():
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
testEntry = ['love', 'my', 'dalmation']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
testEntry = ['stupid', 'garbage']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
最终结果如下：
&&& import bayes
&&& bayes.testingNB()
['love', 'my', 'dalmation'] classified as:
['stupid', 'garbage'] classified as:
之前我们说过上面只考虑了，这个词是否出现过，没有考虑这个词出现了几次。下面将考虑词出现的次数，代码如下：
def bagOfWords2VecMN(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1 #这边改成了累加
return returnVec
示例：使用朴素贝叶斯过滤垃圾邮件
给出过程如下：
这边是英文，英文由于单词之间有空格，方便切分，而中文除了句子之间有标点符号外就不好分割了，这样就有了一个新的问题，中文分词问题。这个在以后会介绍，这边根据书先对英文进行分词。
python中有split()方法很容易就切分。
&&& myStr = 'This book is the best book on Python.'
&&& myStr.split()
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python.']
但是最后一个词有标点符号，这个我们通过正则表达式解决，正则表达式在文本分类中是有很大作用的。
&&& import re
&&& regEx = re.compile('\\W*')
&&& listOfTokens = regEx.split(myStr)
&&& listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', '']
之后还有一个空格，我们通过查看字符的长度来解决
&&& [tok for tok in listOfTokens if len(tok)&0]
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python']
另外我们考虑构建词库，并不用考虑单词的大小写，全部改为小写
&&& [tok.lower() for tok in listOfTokens if len(tok)&0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python']
文本解析是很大的一个工程，这个正则表达式成书的就好几本了，值得研究还有很多，这边只作简单的解析，复杂的先不予考虑。
比如对于如下文档解析
&&& emailText = '/support/sites/bin/answer.py?hl=en&answer=174623'
&&& listOfTokens = regEx.split(emailText)
&&& listOfTokens
['http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '174623']
里面会含有py、en这些不是单词，所有在考虑对于这些的时候，我们处理的时候去掉长度小于3的字符串。
下面给出怎么训练及测试代码
def textParse(bigString):
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) & 2]
def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#创建词汇表
trainingSet = range(50); testSet=[]
#创建从50个样本集随机取10个作为测试集
for i in range(10):
#random.uniform(a, b)，用于生成一个指定范围内的随机符点数，两个参数其中一个是上限，一个是下限。
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
print('len(trainingSet)',len(trainingSet))#这边是40，要说明一下这个方法
for docIndex in trainingSet:
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print &classification error&,docList[docIndex]
print 'the error rate is: ',float(errorCount)/len(testSet)
随机选取一部分作为训练集，剩下的作为测试集，用80%作为训练集，20%作为测试集，这种方法叫留存交叉验证。现在我们只作出一次迭代，为了更精确的估计分类器的错误率，我们应该多次迭代后求出平均错误率。
最后结果如下：
&&& import bayes
&&& bayes.spamTest()
('len(trainingSet)', 40)
classification error ['oem', 'adobe', 'microsoft', 'softwares', 'fast', 'order', 'and', 'download', 'microsoft', 'office', 'professional', 'plus', '2007', '2010', '129', 'microsoft', 'windows', 'ultimate', '119', 'adobe', 'photoshop', 'cs5', 'extended', 'adobe', 'acrobat', 'pro', 'extended', 'windows', 'professional', 'thousand', 'more', 'titles']
the error rate is:
这边有误判的邮件，如果将垃圾邮件判为正常邮件，这个还好，如果把正常邮件判别为垃圾邮件这个就有问题了，下面我们会给出模型的评估，以及模型的修正，这个再后面会继续介绍。
下面给出另外一个例子。
从个人广告中获取区域倾向
前面介绍了过滤网站恶意留言，第二个过滤垃圾邮件，这个例子来发现地域相关的用词
这边给出一个题外话，这边要安装feedparser，安装之前用Setuptools 参考 windows安装直接下载脚本
或者直接下载了安装版
不过我用的是2.7的它有个bug，这个要修改一下，不然安装报错，具体参考
python安装目录lib下的mimetypes.py要修改，修改地方我贴过来
import sys
import posixpath
import urllib
+from itertools import count
import _winreg
except ImportError:
@@ -239,19 +240,11 @@
def enum_types(mimedb):
while True:
for i in count():
ctype = _winreg.EnumKey(mimedb, i)
yield _winreg.EnumKey(mimedb, i)
except EnvironmentError:
ctype = ctype.encode(default_encoding) # omit in 3.x!
except UnicodeEncodeError:
yield ctype
下面我们就可以简单的使用这个feedparser，下面查看一下条目的列表数目
&&& import feedparser
&&& ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
&&& len(ny['entries'])
下面写下程序RSS源分类器及高频词去除函数。
在对文本进行解析的时候，我们分析每个词出现的次数，但是有些词出现的很多，但是却没有实际的意思，反而影响权重，比如我们中文中的，的、得等词，英文中的一些简单的代词，谓语动词等等，因此处理的时候要去掉这些高频词汇。
下面处理的时候添加去除前30个高频词的函数
def calcMostFreq(vocabList,fullText):
import operator
freqDict = {}
for token in vocabList:#遍历词汇表
freqDict[token]=fullText.count(token)#统计token出现的次数构成词典
sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedFreq[:30]
#这个跟spamTest()基本上一样，不同在于这边访问的是RSS源，最后返回词汇表，以及不同分类每个词出现的概率
def localWords(feed1,feed0):#使用两个RSS源作为参数
import feedparser
docList=[]; classList = []; fullText =[]
minLen = min(len(feed1['entries']),len(feed0['entries']))
for i in range(minLen):
wordList = textParse(feed1['entries'][i]['summary'])
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(feed0['entries'][i]['summary'])
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#创建词汇表
top30Words = calcMostFreq(vocabList,fullText)
#去掉top30的词
for pairW in top30Words:
if pairW[0] in vocabList: vocabList.remove(pairW[0])
trainingSet = range(2*minLen); testSet=[]
#创建测试集
for i in range(20):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print 'the error rate is: ',float(errorCount)/len(testSet)
return vocabList,p0V,p1V
&&& import bayes
&&& import feedparser
&&& ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
&&& sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
&&& vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:
为了得到错误率的精确估计，我们对上述实验进行多次，后取平均值。
另外这个错误率挺高的，不过我们现在关心的是单词的概率，我们可以改变上面去掉30个词，我们可以去掉前100，另外我们可以通过整理的停用词表，就是用于句子结构的辅助词表，这样最后的错误率会有一定的改观。
最后我们分析一下数据，显示地域相关的用词，也就是计算在这个分类下出现概率最大的词。
def getTopWords(ny,sf):
import operator
vocabList,p0V,p1V=localWords(ny,sf)
topNY=[]; topSF=[]
for i in range(len(p0V)):
if p0V[i] & -5.4 : topSF.append((vocabList[i],p0V[i]))
if p1V[i] & -5.4 : topNY.append((vocabList[i],p1V[i]))
sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
print &SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**&
for item in sortedSF:
print item[0]
sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
print &NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**&
for item in sortedNY:
print item[0]
结果如下：
&&& import bayes
&&& import feedparser
&&& ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
&&& sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
&&& bayes.getTopWords(ny,sf)
the error rate is:
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
那这样这朴素贝叶斯的使用基本完成，虽然我们假设属性值相互独立不准确，不过有研究表明，在有些领域该算法完全可以跟决策树，神经网络这样算法媲美。
如果独立假设可以满足的话，则该分类算法和其他分类算法相比的话，目前有最高的准确率跟效率。
对于贝叶斯方法，下面可以研究的贝叶斯网络方法，这个在后续工作中会继续学习。
阅读(...) 评论()}

常信村百科网