iNoteMate - 博文

每个问题的文本数据准备都不同，准备工作从简单的步骤开始，例如加载数据，但是随着任务的具体，任务需要的数据清理工作会变得越来越不容易，一个合理的准备数据和清洗数据的套路是必须的，我们要知道从何处开始以及从原始数据到准备建模的数据的步骤的工作顺序。在本教程中，您将逐步了解如何为电影评论的情绪分析准备文本数据。完成本教程后，您将了解：

如何加载文本数据并清理它以删除标点符号和其他非单词。
如何开发词汇表，定制词汇表并将其保存到文件中。
如何使用清洁和预定义的词汇表准备电影评论，并将其保存到准备建模的新文件中。

6.1 教程概述

本教程分为以下几部分：

电影评论数据集
加载文本数据
清洗文本数据
开发词汇
保存准备好的数据

6.2 电影评论数据集

电影评论数据是Bo Pang和Lillian Lee在21世纪初从imdb.com网站上爬取到的电影评论的数据包。收集的评论作为他们自然语言处理研究的一部分供自己使用，评论数据集最初于2002年发布，但更新和清理版本于2004年发布，称为v2.0。该数据集由从IMDB托管的rec.arts.movies.reviews新闻组的档案中抽取的1,000个正面和1,000个负面电影评论组成。作者将此数据集称为极性数据集。

我们的数据包含2002年之前写的1000份正面和1000份负面评论，每个作者的评论上限为20（每位作者共312位）。我们将此语料库称为极性数据集。

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

数据已经有所清理，例如：

数据集仅包含英语评论。
所有文本都已转换为小写。
标点符号周围有空格，如句点，逗号和括号。
文本每行分为一个句子。

该数据已用于一些相关的自然语言处理任务。对于分类，经典模型（例如支持向量机）下数据的分类的准确度在70％至80％的范围内（例如78％至82％）。数据准备更精细点，在通过10折叠的交叉验证分类准确度可以达到86％。如果我们想在现代方法的实验中使用这个数据集，这给了我们80年代中期的分析的一个大致的准确度的起点。

...根据下游极性分类器的选择，我们可以实现一定程度统计上的改善（从82.8％到86.4％）

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

您可以从此处下载数据集：

电影评论Polarity Dataset（评论polarity.tar.gz，3MB）。 http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

解压缩文件后，您将拥有一个名为txt_sentoken的目录，其中包含两个子目录，其中包含文本neg和pos，用于负面和正面评论。对于neg和pos中的每一个，每个文件存储一个评论，其中包含从cv000到cv999的命名约定。接下来，我们来看看加载文本数据。

6.3 加载文本数据

在本节中，我们将介绍加载单个文本文件，然后处理目录里的文件。我们将假设评论数据已下载并在文件txt_sentoken中的当前工作目录中可用。我们加载文件，以ASCII编码加载单个文本，然后关闭文件。这是标准的文件处理。例如，我们可以加载第一个负面评论文件cv000 29416.txt，如下所示：

# load one file

filename = ' txt_sentoken/neg/cv000_29416.txt '

# open the file as read only

file = open(filename, ' r ' )

# read all text

text = file.read()

# close the file

file.close()

代码清单6.1：加载单个电影评论的示例

以ASCII编码方式加载文本并保留任何空白：如新行。我们可以把它定义为load doc（）的函数，文件名作为函数参数，加载指定文件并返回文本。

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, ' r ' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

代码清单6.2：将文档加载到内存中的函数

我们有两个目录，每个目录有1,000个文档。我们可以依次处理每个目录，首先使用listdir（）函数获取目录中的文件列表，然后依次加载每个文件。例如，我们可以使用load doc（）函数在负目录中加载每个文档来进行实际加载。

from os import listdir

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# specify directory to load
directory = 'txt_sentoken/neg'
# walk through all files in the folder
for filename in listdir(directory):
    # skip files that do not have the right extension
    if not filename.endswith(".txt"):
        next
    # create the full path of the file to open
    path = directory + '/' + filename
    # load document
    doc = load_doc(path)
    print(' Loaded %s ' % filename)

代码清单6.3：加载所有电影评论的示例

运行此示例会在加载后打印每个评论的文件名。

Loaded cv618_9469.txt

Loaded cv793_15235.txt

Loaded cv493_14135.txt

Loaded cv047_18722.txt

Loaded cv420_28631.txt

Loaded cv436_20564.txt

Loaded cv975_11920.txt

代码清单6.4：加载所有电影评论的示例输出

我们也可以将文档的处理过程转换成函数，稍后将其用作模板，以开发处理文件夹中所有文档的函数。例如，下面我们可以把上面的处理过程定义为process_docs（）函数，功能和上面的代码一样，但是我们可以在后面任意调用。

from os import listdir
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load all docs in a directory
def process_docs(directory):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # load document
        doc = load_doc(path)
        print('Loaded %s' % filename)
    # specify directory to load
directory = 'txt_sentoken/neg'
process_docs(directory)

代码清单6.5：使用函数加载所有电影评论的示例

现在我们知道了如何加载电影评论文本数据，让我们来看看如何清洗这些数据。

6.4 清洗文本数据

在本节中，我们将了解可能需要对电影评论数据进行哪些方面的清理，假设使用词袋模型或者可能是不需要太多做准备的单词嵌入模型。

6.4.1 分词

首先，让我们加载一个文档，然后查看由空格分割的原始标记。我们将使用上一节中定义的 load_doc（）函数加载文档，然后使用split（）函数将加载的文档拆分为由空格分隔的标记。

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
print(tokens)

代码6.6：加载电影评论并按空格分割。

运行该示例从文档中提供了很长的原始token列表。

'.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare', 'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10', ')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento', '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']

代码清单6.7：按空格分配评论的示例输出

只要查看原始分词结果就可以给我们提供很多的想法，例如：

从单词中删除标点符号（例如'what's'）。
删除只是标点符号的标记（例如' - '）。
删除包含数字的标记（例如'10 / 10'）。
删除具有一个字符的标记（例如“a”）。
删除没有多大意义的token（例如'and'）。

一些想法：

我们可以使用正则表达式从标记中过滤出标点符号。
我们可以使用isalpha()删除只是标点符号或包含数字的token。
我们可以使用NLTK删除英语停用词。

我们可以通过限制单词长度来过滤掉短分词。以下是评论清洗后的更新版本。

from nltk.corpus import stopwords
import string
import re
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)
# split into tokens by white space
tokens = text.split()
# prepare regex for char filtering
re_punc = re.compile( '[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punc.sub( '' , w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words( 'english' ))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

代码清单6.8：加载并清理一个电影评论。

运行该示例可以得到更清晰的分词列表。

'horror', 'teen', 'slasher', 'flick', 'packaged', 'look', 'way', 'someone', 'apparently', 'assuming', 'genre', 'still', 'hot', 'kids', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'sitting', 'shelves', 'ever', 'since', 'whatever', 'skip', 'wheres', 'joblo', 'coming', 'nightmare', 'elm', 'street', 'blair', 'witch', 'crow', 'crow', 'salvation', 'lost', 'highway', 'memento', 'others', 'stir', 'echoes']

代码清单6.9：清洗后一个电影评论的示例输出

我们可以将这个处理过程定义为clean_doc()的函数，并使用另一个评论中进行测试，这次使用的是正面评论：

from nltk.corpus import stopwords
import string
import re
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile( '[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub( '' , w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words( 'english' ))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

代码清单6.10：清理电影评论的功能

同样，清洗过程似乎产生了一组较好的分词，至少和刚才负面评论的分割效果差不多。

'surroundings', 'one', 'creepy', 'place', 'even', 'acting', 'hell', 'solid', 'dreamy', 'depp', 'turning', 'typically', 'strong', 'performance', 'deftly', 'handling', 'british', 'accent', 'ians', 'holm', 'joe', 'goulds', 'secret', 'richardson', 'dalmatians', 'log', 'great', 'supporting', 'roles', 'big', 'surprise', 'graham', 'cringed', 'first', 'time', 'opened', 'mouth', 'imagining', 'attempt', 'irish', 'accent', 'actually', 'wasnt', 'half', 'bad', 'film', 'however', 'good', 'strong', 'violencegore', 'sexuality', 'language', 'drug', 'content']

代码清单6.11：清理电影评论的函数输出示例

6.5 开发词汇表

在开发文本预测模型时，例如词袋模型，存在减小词汇量的压力，因为词汇量越大，每个单词或文档的表示越稀疏。为情感分析准备文本的一部分工作是定义和定制文本预测模型支持的单词的词汇表。我们可以通过加载数据集中的所有文档并构建一组单词来完成此操作，我们可以所有这些通过文本获取的单词，也可以只取一部分单词，然后将最终选择的词汇表保存到文件中供以后使用。

我们使用Counter类来管理和处理词汇表，Counter类是一个词典，词典的内容样式是单词：单词的计数，接着我们定义一些额外的函数以便更好的处理词汇表：定义一个新函数来处理文档并将其添加到词汇表中，该函数通过调用load_doc()函数加载文档.，使用先前定义的clean_doc()函数清洗加载的文档，然后将所有清洗后的分词添加到Counter，并更新计数。我们可以通过调用counter对象上的update()函数来完成最后一步。下面定义一个名为add_doc_to_vocab()函数，它将文档文件名和计数器词汇表作为参数。

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts
    vocab.update(tokens)

代码清单6.12：Function将电影评论添加到词汇表中。

最后，我们将上面处理目录中文件的模板的过程定义成 process_docs()函数，并在这个函数调用add_doc_vocab()更新词汇表。

def process_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
    # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            next
    # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

代码清单6.13：更新了流程文档的功能

我们可以将所有这些处理代码放在一起，并加载数据集中的所有文档来构建的词汇表。

import string

import re

from os import listdir

from collections import Counter

from nltk.corpus import stopwords

# load doc into memory

def load_doc(filename):

# open the file as read only

file = open(filename, 'r' )

# read all text

text = file.read()

# close the file

file.close()

return text

# turn a doc into clean tokens

def clean_doc(doc):

# split into tokens by white space

tokens = doc.split()

# prepare regex for char filtering

re_punc = re.compile( ' [%s] ' % re.escape(string.punctuation))

# remove punctuation from each word

tokens = [re_punc.sub( '' , w) for w in tokens]

# remove remaining tokens that are not alphabetic

tokens = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words( 'english' ))

tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens

tokens = [word for word in tokens if len(word) > 1]

return tokens

# load doc and add to vocab

def add_doc_to_vocab(filename, vocab):

# load doc

doc = load_doc(filename)

# clean doc

tokens = clean_doc(doc)

# update counts

vocab.update(tokens)

# load all docs in a directory

def process_docs(directory, vocab):

# walk through all files in the folder

for filename in listdir(directory):

# skip files that do not have the right extension

if not filename.endswith(".txt"):

# create the full path of the file to open

path = directory + '/' + filename

# add doc to vocab

add_doc_to_vocab(path, vocab)

# define vocab

vocab = Counter()

# add all docs to vocab

process_docs( 'txt_sentoken/neg' , vocab)

process_docs( 'txt_sentoken/pos' , vocab)

# print the size of the vocab

print(len(vocab))

# print the top words in the vocab

print(vocab.most_common(50))

代码清单6.14：清理所有评论和构建词汇表的示例

运行该示例将创建包含数据集中所有文档的词汇表，包括正面和负面评论。我们可以看到所有评论中有超过37589个独特单词(作者写书的时候46000，但是我在数据集测试是37589)，前三个单词是film、one和movie。

37589

[('film', 8849), ('one', 5514), ('movie', 5429), ('like', 3543), ('even', 2554), ('good', 2313), ('time', 2280), ('story', 2110), ('would', 2041), ('much', 2022), ('also', 1965), ('get', 1920), ('character', 1902), ('two', 1824), ('characters', 1813), ('first', 1766), ('see', 1726), ('way', 1668), ('well', 1654), ('make', 1590), ('really', 1556), ('films', 1513), ('little', 1487), ('life', 1467), ('plot', 1448), ('people', 1418), ('could', 1395), ('bad', 1372), ('scene', 1372), ('never', 1360), ('best', 1298), ('new', 1275), ('many', 1267), ('scenes', 1262), ('man', 1255), ('know', 1207), ('movies', 1180), ('great', 1138), ('another', 1111), ('love', 1087), ('action', 1073), ('go', 1073), ('us', 1065), ('director', 1054), ('something', 1047), ('end', 1044), ('still', 1037), ('seems', 1032), ('back', 1031), ('made', 1025)]

代码清单6.15：构建词汇表的示例输出

最不常见的单词，也就是那些在所有评论中只出现过一次的单词，对我们的预测几乎不起作用，还有一些最常见的词在我们的情感分析中也有可能没用，使用特定的预测模型进行分析时，我们对词汇表都要做适度的取舍。一般来说，在2000条评论中只出现一次或几次的单词可能对我们的数据分析贡献会微乎其微，可以从词汇表中删除，这样就大大减少了我们需要建模的分词。我们可以通过单词和它们的计数来完成这个操作过程，只保留单词计数高于所选阈值的计数。这里我们选5次。

# keep tokens with > 5 occurrence
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print(len(tokens))

代码清单6.16：按出现次数过滤词汇表的示例

这将词汇量从37589减少到13864个单词，这是一个巨大的下降。也许至少5次发生过于激进;你可以尝试不同的阈值。然后，我们可以将选择的单词词汇保存到新文件中，我喜欢将词汇表保存为ASCII，每行一个单词。下面定义了一个名为save_list（）的函数，用于保存词汇列表，在本例中每行一个词汇。

def save_list(lines, filename):
    data = '\n ' .join(lines)
    file = open(filename, ' w ')
    file.write(data)
    file.close()

代码清单6.17：将词汇表保存到文件的功能

下面列出了定义和保存词汇表的完整示例。

import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile( '[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub( '' , w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words( 'english' ))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts
    vocab.update(tokens)
    # load all docs in a directory
def process_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
          next
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)
# save list to file
def save_list(lines, filename):
    data = '\n' .join(lines)
    file = open(filename, 'w' )
    file.write(data)
    file.close()
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs( 'txt_sentoken/neg' , vocab)
process_docs( 'txt_sentoken/pos' , vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))
# keep tokens with > 5 occurrence
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt' )

代码清单6.18：构建和保存最终词汇表的示例

运行代码，在创建词汇表后，会将其获得的单词保存到文件。最好仔细查看，甚至研究您选择的词汇表，以便获得更好的数据或文本数据，或者改进这数据的预处理办法。

american

werewolf

paris

failed

attempt

recapture

humor

horror

代码清单6.19：保存的词汇表文件的示例

接下来，我们可以看一下使用词汇表创建的处理好版本的电影评论数据集。

6.6 保存准备好的数据

我们可以使用数据清理和选择的词汇表来处理每条电影评论，并保存处理好的评论版本，接着进行后续的建模，这个处理过程将数据准备与建模分离，如果您对数据准备或者建模有新想法，则可以专注于建模或者数据准备。我们可以从vocab.txt加载词汇开始。

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

代码清单6.20：加载保存的词汇表

接下来，我们可以清洗每条电影评论，使用加载的词汇来过滤掉不需要的词汇，并将干净的评论保存在新文件中。一种方法可以是将所有正面评论保存在一个文件中，将所有负面评论保存在另一个文件中，将每条过滤后的评论自成一行，词汇用空格分隔。首先，我们定义一个函数来处理评论文档，清洗、过滤，然后将它形成一行保存到文件中。下面定义doc_to_line()函数来完成这一过程，将文件名和词汇（作为一组）作为参数。它调用先前定义的load_doc()函数来加载文档并clean_doc（）对文档分词。

代码清单6.21：通过词汇表过滤评论的功能

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
    # load the doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    return ' '.join(tokens)

代码清单6.21：通过词汇表过滤评论的功能

接下来，我们定义一个新版本的process_docs()来逐步处理文件夹中的所有评论，并通过调用doc_to_line()将每个文档转换为行，然后返回行列表。

# load all docs in a directory
def process_docs(directory, vocab):
    lines = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # load and clean the doc
        line = doc_to_line(path, vocab)
        # add to list
        lines.append(line)
    return lines

代码清单6.22：更新了process_docs()功能，用于按词汇表过滤所有文档。

然后，我们调用process_docs()处理正面和负面评论的目录，然后从上一节调用save_list（）将每个已处理评论列表保存到文件中。

完整的代码清单如下：

import string
import re
from os import listdir
from nltk.corpus import stopwords
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile( '[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub( '' , w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words( 'english' ))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
# save list to file
def save_list(lines, filename):
    data = '\n' .join(lines)
    file = open(filename, 'w' )
    file.write(data)
    file.close()
# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
    # load the doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    return ' ' .join(tokens)
# load all docs in a directory
def process_docs(directory, vocab):
    lines = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            next
        # create the full path of the file to open
        path = directory + '/' + filename
        # load and clean the doc
        line = doc_to_line(path, vocab)
        # add to list
        lines.append(line)
    return lines
# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# prepare negative reviews
negative_lines = process_docs( 'txt_sentoken/neg' , vocab)
save_list(negative_lines, 'negative.txt' )
# prepare positive reviews
positive_lines = process_docs( 'txt_sentoken/pos' , vocab)
save_list(positive_lines, 'positive.txt' )

代码清单6.23：通过词汇清理和过滤所有评论并将结果保存到文件的示例。

运行该示例将保存两个新文件，negative.txt和positive.txt，分别包含准备好的负面和正面评论。数据已准备好用于单词包甚至单词嵌入模型。

第6章如何为电影评论的情感分析准备数据

作者： oliverwang
发表时间: 2022-01-24 11:14

6.1 教程概述

6.2 电影评论数据集

6.3 加载文本数据

6.4 清洗文本数据

6.4.1 分词

6.5 开发词汇表

6.6 保存准备好的数据

0 条查看最新评论

没有评论

暂时无法发表评论

公告

欢迎关注

关于我

分类

第6章 如何为电影评论的情感分析准备数据

作者： oliverwang 发表时间: 2022-01-24 11:14

6.1 教程概述

6.2 电影评论数据集

6.3 加载文本数据

6.4 清洗文本数据

6.4.1 分词

6.5 开发词汇表

6.6 保存准备好的数据

0 条 查看最新 评论

没有评论

暂时无法发表评论

公告

欢迎关注

关于我

分类

第6章如何为电影评论的情感分析准备数据

作者： oliverwang
发表时间: 2022-01-24 11:14

0 条查看最新评论