文本向量空间模型

752 查看

我们需要开始思考如何将文本集合转化为可量化的东西。最简单的方法是考虑词频。

我将尽量尝试不使用NLTK和Scikits-Learn包。我们首先使用Python讲解一些基本概念。

基本词频

首先，我们回顾一下如何得到每篇文档中的词的个数：一个词频向量。

#examples taken from here: http://stackoverflow.com/a/1750187

mydoclist = ['Julie loves me more than Linda loves me',

'Jane likes me more than Julie loves me',

'He likes basketball more than baseball']

#mydoclist = ['sun sky bright', 'sun sun bright']

from collections import Counter

for doc in mydoclist:

tf = Counter()

for word in doc.split():

tf[word] +=1

print tf.items()

[(‘me’, 2), (‘Julie’, 1), (‘loves’, 2), (‘Linda’, 1), (‘than’, 1), (‘more’, 1)]
[(‘me’, 2), (‘Julie’, 1), (‘likes’, 1), (‘loves’, 1), (‘Jane’, 1), (‘than’, 1), (‘more’, 1)]
[(‘basketball’, 1), (‘baseball’, 1), (‘likes’, 1), (‘He’, 1), (‘than’, 1), (‘more’, 1)]

这里我们引入了一个新的Python对象，被称作为Counter。该对象只在Python2.7及更高的版本中有效。Counters非常的灵活，利用它们你可以完成这样的功能：在一个循环中进行计数。

根据每篇文档中词的个数，我们进行了文档量化的第一个尝试。但对于那些已经学过向量空间模型中“向量”概念的人来说，第一次尝试量化的结果不能进行比较。这是因为它们不在同一词汇空间中。

我们真正想要的是，每一篇文件的量化结果都有相同的长度，而这里的长度是由我们语料库的词汇总量决定的。

import string #allows for format()

def build_lexicon(corpus):

lexicon = set()

for doc in corpus:

lexicon.update([word for word in doc.split()])

return lexicon

def tf(term, document):

return freq(term, document)

def freq(term, document):

return document.split().count(term)

vocabulary = build_lexicon(mydoclist)

doc_term_matrix = []

print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'

for doc in mydoclist:

print 'The doc is "' + doc + '"'

tf_vector = [tf(word, doc) for word in vocabulary]

tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)

print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)

doc_term_matrix.append(tf_vector)

# here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int...

# try it! type(mydoclist.index(doc) + 1)

print 'All combined, here is our master document term matrix: '

print doc_term_matrix

我们的词向量为[me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]

文档”Julie loves me more than Linda loves me”的词频向量为：[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]

文档”Jane likes me more than Julie loves me”的词频向量为：[2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]

文档”He likes basketball more than baseball”的词频向量为：[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]

合在一起，就是我们主文档的词矩阵：

[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]

好吧，这看起来似乎很合理。如果你有任何机器学习的经验，你刚刚看到的是建立一个特征空间。现在每篇文档都在相同的特征空间中，这意味着我们可以在同样维数的空间中表示整个语料库，而不会丢失太多信息。

标准化向量，使其L2范数为1

一旦你在同一个特征空间中得到了数据，你就可以开始应用一些机器学习方法：分类、聚类等等。但实际上，我们同样遇到一些问题。单词并不都包含相同的信息。

如果有些单词在一个单一的文件中过于频繁地出现，它们将扰乱我们的分析。我们想要对每一个词频向量进行比例缩放，使其变得更具有代表性。换句话说，我们需要进行向量标准化。

我们真的没有时间过多地讨论关于这方面的数学知识。现在仅仅接受这样一个事实：我们需要确保每个向量的L2范数等于1。这里有一些代码，展示这是如何实现的。

import math

def l2_normalizer(vec):

denom = np.sum([el**2 for el in vec])

return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = [ class="crayon-h"> vec]

doc_term_matrix_l2 = [iv>

我们需要开始思考如何将文本集合转化为可量化的东西。最简单的方法是考虑词频。

我将尽量尝试不使用NLTK和Scikits-Learn包。我们首先使用Python讲解一些基本概念。

基本词频

首先，我们回顾一下如何得到每篇文档中的词的个数：一个词频向量。

#examples taken from here: http://stackoverflow.com/a/1750187

mydoclist = ['Julie loves me more than Linda loves me',

'Jane likes me more than Julie loves me',

'He likes basketball more than baseball']

#mydoclist = ['sun sky bright', 'sun sun bright']

from collections import Counter

for doc in mydoclist:

tf = Counter()

for word in doc.split():

tf[word] +=1

print tf.items()

我们真正想要的是，每一篇文件的量化结果都有相同的长度，而这里的长度是由我们语料库的词汇总量决定的。

import string #allows for format()

def build_lexicon(corpus):

lexicon = set()

for doc in corpus:

lexicon.update([word for word in doc.split()])

return lexicon

def tf(term, document):

return freq(term, document)

def freq(term, document):

return document.split().count(term)

vocabulary = build_lexicon(mydoclist)

doc_term_matrix = []

print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'

for doc in mydoclist:

print 'The doc is "' + doc + '"'

tf_vector = [tf(word, doc) for word in vocabulary]

tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)

print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)

doc_term_matrix.append(tf_vector)

# here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int...

# try it! type(mydoclist.index(doc) + 1)

print 'All combined, here is our master document term matrix: '

print doc_term_matrix

我们的词向量为[me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]

文档”Julie loves me more than Linda loves me”的词频向量为：[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]

文档”Jane likes me more than Julie loves me”的词频向量为：[2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]

文档”He likes basketball more than baseball”的词频向量为：[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]

合在一起，就是我们主文档的词矩阵：

[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]

标准化向量，使其L2范数为1

import math

def l2_normalizer(vec):

denom = np.sum([el**2 for el in vec])

return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = [f tf(term, document): return freq(term, document) def freq(term, document): return document.split().count(term) vocabulary = build_lexicon(mydoclist) doc_term_matrix = [] print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']' for doc in mydoclist: print 'The doc is "' + doc + '"' tf_vector = [tf(word, doc) for word in vocabulary] tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector) print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string) doc_term_matrix.append(tf_vector) # here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int... # try it! type(mydoclist.index(doc) + 1) print 'All combined, here is our master document term matrix: ' print doc_term_matrix