第23章项目：开发神经图像字幕生成模型

字幕生成是一个具有挑战性的人工智能问题，必须为给定的图片生成文本描述，它既需要计算机视觉方面的方法来理解图像的内容，也需要来自自然语言处理领域的语言模型将图像的理解转化为正确的单词。最近，深度学习方法已经在这个问题上取得了不错的成果。

深度学习方法在字幕生成领域的最新结果已经远远好于经典方法。这些方法最令人印象深刻的是，可以定义单个端到端模型来预测字幕，给定图片，而不是需要复杂的数据准备或专门设计模型的管道。在本教程中，您将了解如何从头开发图片字幕深度学习模型。完成本教程后，您将了解：

如何准备图片和文本数据来训练深度学习模型。
如何设计和训练深度学习字幕生成模型。
如何评估训练字幕生成模型并使用它来标注全新的图片。

23.1 教程概述

本教程分为以下几部分：

图片和标题数据集
准备图片数据
准备文本数据
开发深度学习模型
评估模型
生成新字幕

23.2 图片和字幕数据集

在本教程中，我们将使用Flickr8k数据集。此数据集之前已在第25章中介绍过。数据集可免费获得。您必须填写一份申请表，并将通过电子邮件发送给您的数据集链接。我很乐意为您链接，但电子邮件地址明确要求：请不要重新分发数据集。您可以使用以下链接来请求数据集：

数据集申请表。https://illinois.edu/fb/sec/1713398

在短时间内，您将收到一封电子邮件，其中包含指向两个文件的链接：

Flickr8k Dataset.zip（1GB）所有图片的存档。
Flickr8k text.zip（2.2MB）图片的所有文字说明的存档。下载数据集并将其解压缩到当前工作目录中。您将有两个目录：
Flicker8k_Dataset:：包含8092张JPEG格式的图片（是的目录名称拼写为'Flicker'而不是'Flickr'）。
Flickr8k_text：包含许多包含图片不同描述来源的文件。

数据集具有预定义的训练数据集（6,000个图像），开发数据集（1,000个图像）和测试数据集（1,000个图像）。一个可用于评估模型性能衡量的标准是BLEU分数。作为参考，下面是一些性能较好的模型在测试数据集上评估得到的大致BLEU分数（取自2017年论文Where to put the Image in an Image Caption Generator）：

BLEU-1：0.401至0.578。
BLEU-2：0.176至0.390。
BLEU-3：0.099至0.260。
BLEU-4：0.059至0.170。

我们后期在评估模型时也会描述BLEU指标。接下来，我们来看看如何加载图像。

23.3 准备图片数据

我们将使用预先训练的模型来解释图片的内容，有很多的模型可供选择，在这种情况下，我们选择使用2014年赢得ImageNet大赛的Oxford Visual Geometry Group或VGG模型，Keras直接提供这种预先训练的模型。请注意，第一次使用此模型时，Keras将从Internet下载模型权重，大约为500兆字节。这可能需要几分钟，具体取决于您的互联网连接。请注意，第23章介绍了VGG预训练模型的使用。

我们可以将此模型用作更广泛的图像字幕模型的一部分，问题是，它是一个大型模型，每次我们想要测试新的语言模型配置（下游）时，通过模型分析每张图片显然是重复工作。为了方起见，我们可以使用预先训练的模型，预先分析好图片特征并将其保存到文件中，然后，我们稍后可以加载这些特征，并将它们作为数据集中给定图片的解释提供给我们的模型，通过完整的VGG模型处理图片也没有什么不一样，我们只不过是提前做过了而已。

这是一种优化，可以更快地训练我们的模型并消耗更少的内存，我们可以使用VGG类在Keras中加载VGG模型。我们将从加载的模型中删除最后一层，因为这是用于预测图片分类的模型，我们对图像分类不感兴趣，但我们对分类前的图片内部表示感兴趣，这些是模型从图片中提取的特征。

Keras还提供了用于将加载的图片整形为模型的适合尺寸的工具（例如，3通道224×224像素图像）。下面定义extract_features()函数：给定目录名称，将加载每张图片，为VGG准备，并从VGG模型中收集预测的特征，图像特征是一维4,096元素向量。

该函数返回图像标识符的字典到图像特征。

# extract features from each photo in the directory
def extract_features(directory):
    # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # summarize
    model.summary()
    # extract features from each photo
    features = dict()
    for name in listdir(directory):
        # load an image from file
        filename = directory + ' / ' + name
        image = load_img(filename, target_size=(224, 224))
        # convert the image pixels to a NumPy array
        image = img_to_array(image)
        # reshape data for the model
        image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
        # prepare the image for the VGG model
        image = preprocess_input(image)
        # get features
        feature = model.predict(image, verbose=0)
        # get image id23.3. Prepare Photo Data
        301
        image_id = name.split( '.' )[0]
        # store feature
        features[image_id] = feature
        print( ' >%s ' % name)
    return features

代码清单23.1：提取图片函数的函数

调用此函数准备图片数据以测试我们的模型，然后将生成的字典保存到名为features.pkl的文件中。下面列出了完整的示例。

from os import listdir
from os import path
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
# extract features from each photo in the directory
def extract_features(directory):
    # load the model
    model = VGG16()
    # re-structure the model
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    # summarize
    model.summary()
    # extract features from each photo
    features = dict()
    for name in listdir(directory):
        # load an image from file
        filename = path.join(directory, name)
        image = load_img(filename, target_size=(224, 224))
        # convert the image pixels to a numpy array
        image = img_to_array(image)
        # reshape data for the model
        image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
        # prepare the image for the VGG model
        image = preprocess_input(image)
        # get features
        feature = model.predict(image, verbose=0)
        # get image id
        image_id = name.split( '.' )[0]
        # store feature
        features[image_id] = feature
        print( ' >%s ' % name)
    return features
# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print( 'Extracted Features: %d ' % len(features))
# save to file
dump(features, open( 'features.pkl' , 'wb' ))

代码清单23.2：提取图片函数的完整示例

运行此数据准备步骤可能需要一段时间，具体取决于您的硬件，使用CPU处理可能需要一个小时，ＧＰＵ时间会短点，1080TI大约不到１分钟。在运行结束时，您将提取的函数存储在features.pkl中供以后使用。这个文件的大小约是一百多MB。

23.4 准备文本数据

数据集包含每张图片的多个描述，描述文本需要一些最小的清洗。注意，第25章描述了如何准备这些文本数据。首先，我们将加载包含所有描述的文件。

代码清单23.3：将图片描述加载到内存中的示例

每张图片都有唯一的标识符，此标识符用于图片文件名和描述的文本文件中，接下来，我们将逐步浏览图片描述列表。下面定义load descriptions()函数：给定加载的文档文本，它将返回图片标识符－描述映射字典，每个图片标识符映射到一个或多个文本描述的列表。

# extract descriptions for images
def load_descriptions(doc):
    mapping = dict()
    # process lines
    for line in doc.split( '\n' ):
        # split line by white space
        tokens = line.split()
        if len(line) < 2:
            continue
        # take the first token as the image id, the rest as the description
        image_id, image_desc = tokens[0], tokens[1:]
        # remove filename from image id
        image_id = image_id.split( '.' )[0]
        # convert description tokens back to string
        image_desc = ' ' .join(image_desc)
        # create the list if needed
        if image_id not in mapping:
            mapping[image_id] = list()
        # store description
        mapping[image_id].append(image_desc)
    return mapping
# parse descriptions
descriptions = load_descriptions(doc)
print( ' Loaded: %d ' % len(descriptions))

代码清单23.4：从图片标识符中拆分描述的示例

接下来，我们需要清理描述文本，描述已经被标记化并且易于使用，我们将通过以下方式清理文本，以减少单词词汇量的大小：

将所有单词转换为小写。
删除所有标点符号。
删除所有长度不超过一个字符的单词（例如“a”）。
删除包含数字的所有单词。

下面定义了clean description（）函数，它给出了描述图像标识符的字典，逐步介绍每个描述并清理文本。

def clean_descriptions(descriptions):
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    for key, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            # tokenize
            desc = desc.split()
            # convert to lower case
            desc = [word.lower() for word in desc]
            # remove punctuation from each token
            desc = [re_punc.sub('', w) for w in desc]
            # remove hanging ' s ' and ' a '
            desc = [word for word in desc if len(word) > 1]
            # remove tokens with numbers in them
            desc = [word for word in desc if word.isalpha()]
            # store as string
            desc_list[i] = ' '.join(desc)
# clean descriptions
clean_descriptions(descriptions)

代码清单23.5：清理描述文本的示例

清理后，我们可以总结词汇量的大小，理想情况下，我们想要一个既富有表现力又尽可能小的词汇。较小的词汇量能让较小的模型将更快地训练。作为参考，我们可以将干净的描述转换为一个集合并打印其大小，以了解我们的数据集词汇表的大小。

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
    # build a list of all description strings
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print( 'Vocabulary Size: %d' % len(vocabulary))

代码清单23.6：定义描述文本词汇表的示例

最后，我们可以将图像标识符和描述字典保存到名为descriptions.txt的文件中，每行一个图像标识符和描述。下面定义了save_doc()函数，给定包含标识符到描述映射字典和文件名，将映射保存到文件。

# save descriptions to file, one per line
def save_doc(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + ' ' + desc)
    data = '\n' .join(lines)
    file = open(filename, 'w' )
    file.write(data)
    file.close()
# save descriptions
save_doc(descriptions, 'descriptions.txt' )

代码清单23.7：将清洗后的描述保存到文件的示例

将所有内容放在一起，下面提供了完整的清单。

import string
import re

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# extract descriptions for images
def load_descriptions(doc):
    mapping = dict()
    # process lines
    for line in doc.split('\n'):
        # split line by white space23.4. Prepare Text Data
        tokens = line.split()
        if len(line) < 2:
            continue
        # take the first token as the image id, the rest as the description
        image_id, image_desc = tokens[0], tokens[1:]
        # remove filename from image id
        image_id = image_id.split('.')[0]
        # convert description tokens back to string
        image_desc = ' '.join(image_desc)
        # create the list if needed
        if image_id not in mapping:
            mapping[image_id] = list()
        # store description
        mapping[image_id].append(image_desc)
    return mapping
def clean_descriptions(descriptions):
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    for _, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            # tokenize
            desc = desc.split()
            # convert to lower case
            desc = [word.lower() for word in desc]
            # remove punctuation from each token
            desc = [re_punc.sub('', w) for w in desc]
            # remove hanging ' s ' and ' a '
            desc = [word for word in desc if len(word) > 1]
            # remove tokens with numbers in them
            desc = [word for word in desc if word.isalpha()]
            # store as string
            desc_list[i] = ' '.join(desc)
# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
    # build a list of all description strings
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    return all_desc
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + ' ' + desc)
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print(' Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print(' Vocabulary Size: %d ' % len(vocabulary))
# save to file
save_descriptions(descriptions, 'descriptions.txt')

代码清单23.8：完整的文本数据准备示例

首先运行该示例，打印加载的图片描述的数量（8,092）和清洗后词汇的大小（8,763个单词）。

Loaded: 8092

Vocabulary Size: 8763

代码清单23.9：准备文本数据的示例输出

最后，将清洗后的描述写入description.txt。看一下这个文件，我们可以看到这些描述已经为建模做好了准备。文件中的描述顺序可能有所不同。

2252123185_487f21e336 bunch on people are seated in stadium

2252123185_487f21e336 crowded stadium is full of people watching an event

2252123185_487f21e336 crowd of people fill up packed stadium

2252123185_487f21e336 crowd sitting in an indoor stadium

2252123185_487f21e336 stadium full of people watch game

代码清单23.10：干净的图片描述中的文本样本

23.5 开发深度学习模型

在本节中，我们定义深度学习模型，并将其拟合到训练数据集上。本节分为以下几部分：

加载数据。
定义模型。
拟合模型。
完整的例子。

23.5.1 加载数据中

首先，我们必须加载准备好的图片和文本数据，以便我们可以使用它来拟合模型，我们将在模型上拟合训练数据集中所有图片和字幕的数据，而在训练中，我们将监视模型在开发数据集上的表现，并使用该性能来决定何时将模型保存到文件。

训练和开发数据集已分别在Flickr_8k.trainImages.txt和Flickr_8k.devImages.txt文件中预定义，这两个文件都包含图片文件名列表，从这些文件名中，我们可以提取图片标识符并使用这些标识符来过滤每组的图片和描述。下面的函数load_set（）将在给定训练集或开发集文件名的情况下加载一组预定义的标识符。

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split( '\n' ):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split( '.' )[0]
        dataset.append(identifier)
    return set(dataset)

代码清单23.11：用于加载图片描述文本和标识符的函数

现在，我们可以使用预定义的训练或开发标识符来加载图片和描述，下面是函数load_clean_description()，它从descriptions.txt中加载已清洗的文本描述，用于给定的标识符集，并返回标识符－文本描述映射字典列表。

我们将开发的模型用于生成给定图片的字幕，并且字幕将一次生成一个单词，先前生成的单词序列将作为输入再提供给模型，因此，我们需要第一个词来启动生成过程，最后一个词来表示字幕的结束，为此，我们将使用字符串startseq和endseq，这对标记在加载时会添加到已加载的描述中。在对文本进行编码之前，执行此操作非常重要，只有这样才能正确编码标记。

代码清单23.12：加载干净的图片描述的函数

接下来，我们可以加载给定数据集的图片特征。下面定义了一个名为load_photo_features()的函数，它加载了全部图片的特征，然后返回指定图片标识符集的感兴趣子集。这不是很有效；尽管如此，这可以让我们先动起来，优化等功能完成之后在说。

# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb' ))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

代码清单23.13：加载预先计算的图片函数的函数

我们可以暂停一下，测试到目前为止开发的所有完整的代码，示例如下所示。

from pickle import load
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r' )
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split( '\n' ):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split( '.' )[0]
        dataset.append(identifier)
    return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split( '\n' ):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' ' .join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions
# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb' ))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print( 'Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions( 'descriptions.txt' , train)
print( 'Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features( 'features.pkl' , train)
print( 'Photos: train=%d' % len(train_features))

代码清单23.14：加载准备好的数据的完整示例

运行此示例首先在测试数据集中加载6,000个图片标识符。然后，这些函数用于过滤和加载已清理的描述文本和预先计算的图片特征。如下。

Dataset: 6000

Descriptions: train=6000

Photos: train=6000

代码清单23.15：准备文本数据的示例输出

描述文本需要先编码为数字，然后才能呈现给模型，作为输入或者与模型的预测进行比较。编码数据的第一步是创建从单词到唯一整数值的一致映射，Keras提供Tokenizer类，可以从加载的描述数据中学习到此映射。下面定义to_lines()函数，用于将描述字典转换为字符串列表，以及create_tokenizer()函数，创建一个Tokenizer的实例，调用实例上的fit_on_texts()拟合描述文本，生成分词。

# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print( ' Vocabulary Size: %d ' % vocab_size)

代码清单23.16：准备Tokenizer的示例

我们现在对文本进行编码，每个描述将被分为单词，提供一个单词和图片给模型，并生成下一个单词，然后，将描述的前两个单词再作为输入提供给模型，以生成下一个单词，这就是模型的训练方式。例如，输入序列“little girl running in field”将被分成6个输入-输出对来训练模型：

X1, X2 (text sequence), y (word)

photo startseq, little

photo startseq, little, girl

photo startseq, little, girl, running

photo startseq, little, girl, running, in

photo startseq, little, girl, running, in, field

photo startseq, little, girl, running, in, field, endseq

代码清单23.17：如何将图片描述转换为输入和输出序列的示例

之后，当模型用于生成描述时，生成的单词被连接到输入序列，并递归地作为输入喂给模型，以生成图像的字幕。定义函数create_sequences()，参数为tokenizer，最大序列长度，以及所有图片描述和图片的字典，将数据转换为输入-输出数据对，用于训练模型，模型有两个输入数组：一个是图片特征，另一个是编码文本，模型有一个输出，它是文本序列中编码的下一个单词。

输入文本被编码为整数，其将被送到字嵌入层，图片特征作为输入传送给模型的另一部分，该模型将给出输出单词预测，该预测是词汇表中所有单词的概率分布，因此，输出数据是每个单词one-hot编码：除实际字位置为１以外，其他所有字位置都是0值的概率分布，其值为1。

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
    X1, X2, y = list(), list(), list()
    # walk through each image identifier
    for key, desc_list in descriptions.items():
    # walk through each description for the image
        for desc in desc_list:
        # encode the sequence
            seq = tokenizer.texts_to_sequences([desc])[0]
            # split one sequence into multiple X,y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                # store
                X1.append(photos[key][0])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)

代码清单23.18：创建输入和输出序列的函数

我们需要计算最长描述中的最大字数，定义辅助max_length()函数来完成这个过程，如下:

# calculate the length of the description with the most words
def max_length(descriptions):
lines = to_lines(descriptions)
return max(len(d.split()) for d in lines)

代码清单23.19：计算最大序列长度的函数

我们现在有足够的数据来加载训练和开发数据集，并将加载的数据转换为输入-输出对，以拟合深度学习模型。

23.5.2 定义模型

我们将以Marc Tanti等人2017年提出的合并模型为基础来定义深度学习模型。注意，图像字幕的合并模型在第22章中介绍。我们将分三部分描述模型：

图片特征提取器。这是在ImageNet数据集上预训练的16层VGG模型，我们使用其预处理图片（没有输出图层），并使用此模型提取的图片特征作为输入。

序列处理器。这是一个单词嵌入层，用于处理文本输入，然后是一个长短期记忆(LSTM)递归神经网络层。
解码器（缺少一个更好的名字）。特征提取器和序列处理器都输出固定长度的矢量，这些被合并在一起并由Dense层处理以进行最终预测。

图片特征提取器模型要求输入图片要素是4,096个元素的向量，经Dense层处理以产生图片的256个元素表示。序列处理器模型期望具有预定义长度（34个字）的输入序列被输入到嵌入层，该嵌入层使用掩码来忽略填充值。接下来是具有256个存储器单元的LSTM层。

两个输入模型都产生256个元素向量。此外，因为这种模型配置学习得非常快，两种输入模型都以50％的概率丢弃正则化，目的为了减少过拟合。解码器模型使用加法运算合并来自两个输入模型的向量。然后将其输入到具有256个神经元的Dense层，然后馈送到最终输出Dense层，该层针对序列中的下一个字给出了整个词汇表长度的softmax预测，输出一个词汇表长度的概率分布。以下函数 define_model()定义并返回准备好的模型。

# define the captioning model
def define_model(vocab_size, max_length):
    # feature extractor model
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation= 'relu' )(fe1)
    # sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation= 'relu' )(decoder1)
    outputs = Dense(vocab_size, activation= 'softmax' )(decoder2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    # compile model
    model.compile(loss= 'categorical_crossentropy' , optimizer= 'adam' )
    # summarize model
    model.summary()
    plot_model(model, to_file= 'model.png' , show_shapes=True)
    return model

代码清单23.20：定义标题生成模型的函数

创建模型图，有助于更好地理解网络结构和两个输入流。

图23.1：定义的标题生成模型的图。

23.5.3 拟合模型

现在我们知道如何定义模型，我们可以将它放在训练数据集上拟合。该模型学习速度快，很快就会在训练数据集上过拟合，因此，我们要在开发数据集上监控训练模型的性能。当开发数据集上的模型性能在迭代结束时得到了改善，我们就将整个模型保存到文件中。

运行结束后，可以使用训练数据集上性能最好的已保存模型作为最终模型，我们可以通过Keras的ModelCheckpoint指定它来监视验证数据集的最小损失，并将模型保存到文件，模型包含训练和验证损失来实现。

# define checkpoint callback
checkpoint = ModelCheckpoint( 'model.h5' , monitor= 'val_loss' , verbose=1,save_best_only=True, mode= 'min' )

代码清单23.21：检查点配置示例

我们可以通过fit()函数的callbacks参数指定检查点，还通过fit函数中的validation_data参数指定开发数据集。模型训练20个迭代，但考虑到训练数据的数量，每个迭代在CPU上可能需要30分钟。

# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint],validation_data=([X1test, X2test], ytest))

代码清单23.22：拟合字幕生成模型的示例

23.5.4 完整的例子

下面列出了在训练数据上拟合模型的完整示例。请注意，运行此示例可能需要具有8 GB或更多GB RAM的计算机。如果需要，请参阅附录以了解如何使用AWS。

from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set23.5. Develop Deep Learning Model
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions
# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features
# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer
# calculate the length of the description with the most words
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
    X1, X2, y = list(), list(), list()
    # walk through each image identifier
    for key, desc_list in descriptions.items():
        # walk through each description for the image
        for desc in desc_list:
            # encode the sequence
            seq = tokenizer.texts_to_sequences([desc])[0]
            # split one sequence into multiple X,y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                # store
                X1.append(photos[key][0])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)
# define the captioning model
def define_model(vocab_size, max_length):
    # feature extractor model
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    # sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    # summarize model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions,train_features)
# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions,test_features)
# define the model
model = define_model(vocab_size, max_length)
# define checkpoint callback
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1,save_best_only=True, mode='min')
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint],validation_data=([X1test, X2test], ytest))

代码清单23.23：训练字幕生成模型的完整示例

首先运行该示例将打印已加载的培训和开发数据集的摘要。

Dataset: 6000

Descriptions: train=6000

Photos: train=6000

Vocabulary Size: 7579

Description Length: 34

在ＧＰＵ和低于８ＧＢ内存的机器上会出

memoryError

建议在使用ＣＰＵ和大内的计算机上训练

代码清单23.24：安装标题生成模型的示例输出

在对模型进行总结之后，我们可以了解培训和验证（开发）输入 - 输出对的总数。然后运行该模型，将最佳模型保存到

.h5文件一路走来。请注意，即使在现代CPU上，每个迭代也可能需要20分钟。您可能需要考虑在GPU上运行该示例，例如在AWS上。有关如何进行此设置的详细信息，请参阅附录。当我运行该示例时，最佳模型在第2epoch结束时保存，训练数据集损失3.245，开发数据集损失3.612。

23.5.5 渐进加载训练

注意:如果您在上一节中没有问题，请跳过这一节。这一节是为那些没有足够内存来训练前一节中描述的模型的人准备的(例如，无论出于什么原因都不能使用AWS EC2)，但这也是现在训练模型的一个流行的套路。

字幕模型的训练确实假设您有很多RAM，上一节中的代码内存效率不高，假设您运行在一个拥有32GB或64GB RAM的大型EC2实例上，您才能跑起来该模型，如果您在8GB RAM的工作站上运行代码，则无法训练模型。解决方法是使用渐进式加载。这一点在上一涨的最后一节“逐步加载”中进行了详细讨论：How to Prepare a Photo Caption Dataset for Training a Deep Learning Model。我建议在继续之前阅读这一节。

如果您想使用渐进式加载来训练这个模型，本节将向您展示如何进行。第一步是我们必须定义一个可以用作数据生成的函数，我们做了最简单的一个约定，数据生成器每批只生成一张照片的数据值，我们将按这个方式生成所有照片及其描述集的序列。

下面的函数data_generator()就是数据生成器，将加载文本描述,照片特征,tokenizer和最大图片描述长度。在这里，我假设你可以将这些训练数据放入内存中，8GB的内存应该够用。

生成器的工作原理?请阅读２５章数据生成器的内容。

# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, photos, tokenizer, max_length):
   # loop for ever over images
   while 1:
      for key, desc_list in descriptions.items():
         # retrieve the photo feature
         photo = photos[key][0]
         in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo)
         yield [[in_img, in_seq], out_word]

代码清单23.5.5-1

可以看到，我们调用create_sequence()函数来为单个照片而不是整个数据集创建一批数据。这意味着我们必须更新create_sequence()函数，删除for循环的“遍历所有描述”。更新后的函数如下:

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, desc_list, photo):
   X1, X2, y = list(), list(), list()
   # walk through each description for the image
   for desc in desc_list:
      # encode the sequence
      seq = tokenizer.texts_to_sequences([desc])[0]
      # split one sequence into multiple X,y pairs
      for i in range(1, len(seq)):
         # split into input and output pair
         in_seq, out_seq = seq[:i], seq[i]
         # pad input sequence
         in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
         # encode output sequence
         out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
         # store
         X1.append(photo)
         X2.append(in_seq)
         y.append(out_seq)
   return array(X1), array(X2), array(y)

代码清单23.5.5-2

我们现在几乎准备好了所有需要的东西。

注意，这是一个非常基本的数据生成器。因为训练数据和测试数据在拟合模型之前没有在内存中大量展开，只有在需要的时候才为每张图片创建样本数据，所有生成器模式会节约大量的内存，还有很多的好办法进一步改善这一数据生成器包括:

每一个迭代随机排列照片的顺序。
使用照片id列表，并根据需要加载文本和照片数据，以进一步减少内存。
一批内包含超过一张照片的样品数据。

您可以通过直接调用数据生成器来检查它的完整性，如下所示:

# test the data generator
generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)

代码清单23.5.5-3

运行此完整性检查将显示一个批处理序列的值，在本例中为第一张照片训练47个样本。

(47, 4096)

(47, 34)

(47, 7579)

代码清单23.5.5-4

最后，我们可以在模型上使用fit_generator()函数来使用这个数据生成器来训练模型。

在这个简单的例子中，我们将丢弃开发数据集和模型检查点的加载，并在每个训练阶段之后简单地保存模型，然后，您可以在训练之后，返回并加载/评估每个保存的模型，以找到我们认为可以在下一节中使用的损失最小的模型。

使用数据生成器训练模型的代码如下:

# train the model, run epochs manually and save after each epoch
epochs = 20
steps = len(train_descriptions)
for i in range(epochs):
   # create the data generator
   generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
   # fit for one epoch
   model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
   # save model
   model.save('model_' + str(i) + '.h5')

代码清单23.5.5-5

就是这样，现在可以使用渐进加载来训练模型，并节省大量RAM，这也可能会慢一点。下面列出了使用逐步加载(使用数据生成器)来训练字幕生成模型的完整更新示例。

from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint

# load doc into memory
def load_doc(filename):
   # open the file as read only
   file = open(filename, 'r')
   # read all text
   text = file.read()
   # close the file
   file.close()
   return text

# load a pre-defined list of photo identifiers
def load_set(filename):
   doc = load_doc(filename)
   dataset = list()
   # process line by line
   for line in doc.split('\n'):
      # skip empty lines
      if len(line) < 1:
         continue
      # get the image identifier
      identifier = line.split('.')[0]
      dataset.append(identifier)
   return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
   # load document
   doc = load_doc(filename)
   descriptions = dict()
   for line in doc.split('\n'):
      # split line by white space
      tokens = line.split()
      # split id from description
      image_id, image_desc = tokens[0], tokens[1:]
      # skip images not in the set
      if image_id in dataset:
         # create list
         if image_id not in descriptions:
            descriptions[image_id] = list()
         # wrap description in tokens
         desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
         # store
         descriptions[image_id].append(desc)
   return descriptions

# load photo features
def load_photo_features(filename, dataset):
   # load all features
   all_features = load(open(filename, 'rb'))
   # filter features
   features = {k: all_features[k] for k in dataset}
   return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
   all_desc = list()
   for key in descriptions.keys():
      [all_desc.append(d) for d in descriptions[key]]
   return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
   lines = to_lines(descriptions)
   tokenizer = Tokenizer()
   tokenizer.fit_on_texts(lines)
   return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
   lines = to_lines(descriptions)
   return max(len(d.split()) for d in lines)

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, desc_list, photo):
   X1, X2, y = list(), list(), list()
   # walk through each description for the image
   for desc in desc_list:
      # encode the sequence
      seq = tokenizer.texts_to_sequences([desc])[0]
      # split one sequence into multiple X,y pairs
      for i in range(1, len(seq)):
         # split into input and output pair
         in_seq, out_seq = seq[:i], seq[i]
         # pad input sequence
         in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
         # encode output sequence
         out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
         # store
         X1.append(photo)
         X2.append(in_seq)
         y.append(out_seq)
   return array(X1), array(X2), array(y)

# define the captioning model
def define_model(vocab_size, max_length):
   # feature extractor model
   inputs1 = Input(shape=(4096,))
   fe1 = Dropout(0.5)(inputs1)
   fe2 = Dense(256, activation='relu')(fe1)
   # sequence model
   inputs2 = Input(shape=(max_length,))
   se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
   se2 = Dropout(0.5)(se1)
   se3 = LSTM(256)(se2)
   # decoder model
   decoder1 = add([fe2, se3])
   decoder2 = Dense(256, activation='relu')(decoder1)
   outputs = Dense(vocab_size, activation='softmax')(decoder2)
   # tie it together [image, seq] [word]
   model = Model(inputs=[inputs1, inputs2], outputs=outputs)
   # compile model
   model.compile(loss='categorical_crossentropy', optimizer='adam')
   # summarize model
   model.summary()
   plot_model(model, to_file='model.png', show_shapes=True)
   return model

# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, photos, tokenizer, max_length):
   # loop for ever over images
   while 1:
      for key, desc_list in descriptions.items():
         # retrieve the photo feature
         photo = photos[key][0]
         in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo)
         yield [[in_img, in_seq], out_word]

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)

# define the model
model = define_model(vocab_size, max_length)
# train the model, run epochs manually and save after each epoch
epochs = 20
steps = len(train_descriptions)
for i in range(epochs):
   # create the data generator
   generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
   # fit for one epoch
   model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
   # save model
   model.save('model_' + str(i) + '.h5')

代码清单23.5.5-6

23.6 评估模型

一旦模型训练完成，我们就可以用测试数据集来评估其预测性能，我们通过生成测试数据集中所有图片的描述，并使用标准成本函数，评估测试数据集的预测来评价模型，首先，我们需要能够使用一个合适的训练模型生成图片的描述。

这包括传入开始描述标记startseq，生成一个单词，然后以生成的单词作为输入递归调用模型，直到到达序列标记的结尾endseq或达到最大描述长度，下面的generate_desc()函数就实现了这个过程，并在给定已训练模型和给定准备好的图片作为输入的情况下生成文本描述。它调用word_for_id()函数，以便将预测的整数映射回一个单词。

# map an integer to a word
def word_for_id(integer, tokenizer):
   for word, index in tokenizer.word_index.items():
      if index == integer:
         return word
   return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
   # seed the generation process
   in_text = 'startseq'
   # iterate over the whole length of the sequence
   for i in range(max_length):
      # integer encode input sequence
      sequence = tokenizer.texts_to_sequences([in_text])[0]
      # pad input
      sequence = pad_sequences([sequence], maxlen=max_length)
      # predict next word
      yhat = model.predict([photo,sequence], verbose=0)
      # convert probability to integer
      yhat = argmax(yhat)
      # map integer to word
      word = word_for_id(yhat, tokenizer)
      # stop if we cannot map the word
      if word is None:
         break
      # append as input for generating the next word
      in_text += ' ' + word
      # stop if we predict the end of the sequence
      if word == 'endseq':
         break
   return in_text

代码清单23.25：用于生成图片描述的函数

在生成和比较图片描述时，我们需要去除序列单词的开始和结束标识。cleanup_summary()函数就是完成此操作。

# remove start/end sequence tokens from a summary
def cleanup_summary(summary):
    # remove start of sequence token
    index = summary.find( 'startseq ' )
    if index > -1:
        summary = summary[len( 'startseq ' ):]
    # remove end of sequence token
    index = summary.find( ' endseq' )
    if index > -1:
        summary = summary[:index]
    return summary

代码清单23.26：删除序列单词开始和结束的函数

我们将为测试和训练数据集中的所有图片生成预测，evaluate_model()函数将根据给定的照片描述和照片特征数据对训练模型进行评估。收集和评估实际和预测的描述，使用语料库BLEU分数来评价生成的文本与预测文本的接近程度。

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
   actual, predicted = list(), list()
   # step over the whole set
   for key, desc_list in descriptions.items():
      # generate description
      yhat = generate_desc(model, tokenizer, photos[key], max_length)
      # store actual and predicted
      references = [d.split() for d in desc_list]
      actual.append(references)
      predicted.append(yhat.split())
   # calculate BLEU score
   print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
   print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
   print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
   print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

代码清单23.27：用于评估字幕生成模型的函数

BLEU分数用于文本翻译，用一个或多个参考翻译来评估翻译文本的好坏。在这里，我们将每个生成的描述与图片的所有参考描述进行比较，然后计算1、2、3和4n-gram的累积BLEU分数。NLTK Python库在corpus_bleu（）函数中实现BLEU分数计算，接近1.0的较高分数更好，接近零的分数比较差差。请注意，在第24章介绍了BLEU分数和NLTK API。

我们可以将所有这些与上一节中的加载数据函数放在一起。我们首先需要加载训练数据集以准备Tokenizer，以便可以将生成的单词编码为模型的输入序列，至关重要的是，我们使用与训练模型完全相同的编码方案对生成的单词进行编码，然后，我们使用这些函数来加载测试数据集。下面列出了完整的示例。

from numpy import argmax
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

# load doc into memory
def load_doc(filename):
   # open the file as read only
   file = open(filename, 'r')
   # read all text
   text = file.read()
   # close the file
   file.close()
   return text

# load a pre-defined list of photo identifiers
def load_set(filename):
   doc = load_doc(filename)
   dataset = list()
   # process line by line
   for line in doc.split('\n'):
      # skip empty lines
      if len(line) < 1:
         continue
      # get the image identifier
      identifier = line.split('.')[0]
      dataset.append(identifier)
   return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
   # load document
   doc = load_doc(filename)
   descriptions = dict()
   for line in doc.split('\n'):
      # split line by white space
      tokens = line.split()
      # split id from description
      image_id, image_desc = tokens[0], tokens[1:]
      # skip images not in the set
      if image_id in dataset:
         # create list
         if image_id not in descriptions:
            descriptions[image_id] = list()
         # wrap description in tokens
         desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
         # store
         descriptions[image_id].append(desc)
   return descriptions

# load photo features
def load_photo_features(filename, dataset):
   # load all features
   all_features = load(open(filename, 'rb'))
   # filter features
   features = {k: all_features[k] for k in dataset}
   return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
   all_desc = list()
   for key in descriptions.keys():
      [all_desc.append(d) for d in descriptions[key]]
   return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
   lines = to_lines(descriptions)
   tokenizer = Tokenizer()
   tokenizer.fit_on_texts(lines)
   return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
   lines = to_lines(descriptions)
   return max(len(d.split()) for d in lines)

# map an integer to a word
def word_for_id(integer, tokenizer):
   for word, index in tokenizer.word_index.items():
      if index == integer:
         return word
   return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
   # seed the generation process
   in_text = 'startseq'
   # iterate over the whole length of the sequence
   for i in range(max_length):
      # integer encode input sequence
      sequence = tokenizer.texts_to_sequences([in_text])[0]
      # pad input
      sequence = pad_sequences([sequence], maxlen=max_length)
      # predict next word
      yhat = model.predict([photo,sequence], verbose=0)
      # convert probability to integer
      yhat = argmax(yhat)
      # map integer to word
      word = word_for_id(yhat, tokenizer)
      # stop if we cannot map the word
      if word is None:
         break
      # append as input for generating the next word
      in_text += ' ' + word
      # stop if we predict the end of the sequence
      if word == 'endseq':
         break
   return in_text

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
   actual, predicted = list(), list()
   # step over the whole set
   for key, desc_list in descriptions.items():
      # generate description
      yhat = generate_desc(model, tokenizer, photos[key], max_length)
      # store actual and predicted
      references = [d.split() for d in desc_list]
      actual.append(references)
      predicted.append(yhat.split())
   # calculate BLEU score
   print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
   print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
   print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
   print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

# prepare tokenizer on train set

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)

# prepare test set

# load test set
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))

# load the model
filename = 'model_4.h5'
model = load_model(filename)
# evaluate model
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

代码清单23.28：评估标题生成模型的完整示例

运行该示例将打印BLEU分数，我们可以看到，分数符合并且接近问题的熟练模型的预期范围的最好值，所选的模型配置绝不是优化的。

注意：鉴于神经网络的随机性，您的具体结果可能会有所不同。考虑运行几次示例。

BLEU-1: 0.552647

BLEU-2: 0.294615

BLEU-3: 0.200016

BLEU-4: 0.091348

代码清单23.29：评估标题生成模型的示例输出

23.7 生成新字幕

现在我们知道如何开发和评估字幕生成模型，我们如何使用它？我们为全新图片生成字幕所需的几乎所有内容都在模型文件中，我们还需要Tokenizer在生成描述序列时为模型编码生成的单词，以及在我们定义模型时使用的输入序列的最大长度（例如34）。

我们可以强制设定最大序列长度，通过文本编码，我们可以创建标记生成器，并将其保存到文件中，以便我们可以在需要时快速加载它，而无需整个Flickr8K数据集。另一种方法是使用我们训练期间创建的词汇表文件和词汇映射到整数函数。我们可以像以前一样创建Tokenizer并将其保存为pickle文件tokenizer.pkl。下面列出了完整的示例。

from keras.preprocessing.text import Tokenizer
from pickle import dump

# load doc into memory
def load_doc(filename):
   # open the file as read only
   file = open(filename, 'r')
   # read all text
   text = file.read()
   # close the file
   file.close()
   return text

# load a pre-defined list of photo identifiers
def load_set(filename):
   doc = load_doc(filename)
   dataset = list()
   # process line by line
   for line in doc.split('\n'):
      # skip empty lines
      if len(line) < 1:
         continue
      # get the image identifier
      identifier = line.split('.')[0]
      dataset.append(identifier)
   return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
   # load document
   doc = load_doc(filename)
   descriptions = dict()
   for line in doc.split('\n'):
      # split line by white space
      tokens = line.split()
      # split id from description
      image_id, image_desc = tokens[0], tokens[1:]
      # skip images not in the set
      if image_id in dataset:
         # create list
         if image_id not in descriptions:
            descriptions[image_id] = list()
         # wrap description in tokens
         desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
         # store
         descriptions[image_id].append(desc)
   return descriptions

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
   all_desc = list()
   for key in descriptions.keys():
      [all_desc.append(d) for d in descriptions[key]]
   return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
   lines = to_lines(descriptions)
   tokenizer = Tokenizer()
   tokenizer.fit_on_texts(lines)
   return tokenizer

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

代码清单23.30：准备和保存Tokenizer的完整示例

现在可以在需要时加载tokenizer而无需加载整个训练数据集的标记了。现在，让我们为新图片生成描述，下面是我在Flickr上随机选择的新图片（在许可许可下可用）¹。

¹https://www.flickr.com/photos/bambe1964/7837618434/

图23.2：一只狗在海滩的图片。图片来自bambe1964，保留一些权利。

我们将使用模型为它生成描述，下载图片并使用文件名example.jpg将其保存到本地目录。首先，我们必须从tokenizer.pkl加载Tokenizer，并定义填充输入所需的生成序列的最大长度。

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34

代码清单23.31：加载保存的Tokenizer的示例

然后我们必须像以前一样加载模型。

# load the model
model = load_model( 'model.h5' )

代码清单23.32：加载已保存模型的示例

接下来，我们加载要描述的图片并提取图片特征，我们可以重新定义模型并向其添加VGG-16模型来实现这一目标，或者我们可以使用VGG模型来预测特征并将其用作现有模型的输入，我们使用后者并使用数据准备期间使用的extract_features函数的修改版本，仅仅适用于单张图片处理代码如下。

# extract features from each photo in the directory
def extract_features(filename):
   # load the model
   model = VGG16()
   # re-structure the model
   model.layers.pop()
   model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
   # load the photo
   image = load_img(filename, target_size=(224, 224))
   # convert the image pixels to a numpy array
   image = img_to_array(image)
   # reshape data for the model
   image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
   # prepare the image for the VGG model
   image = preprocess_input(image)
   # get features
   feature = model.predict(image, verbose=0)
   return feature

# load and prepare the photograph
photo = extract_features('example.jpg')

代码清单23.33：提取所提供图片的函数的示例

然后，我们使用在评估模型时定义的generate_desc()函数生成描述，下面列出了为全新独立图片生成描述的完整示例。

from pickle import load
from numpy import argmax
from keras.preprocessing.sequence import pad_sequences
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.models import load_model

# extract features from each photo in the directory
def extract_features(filename):
   # load the model
   model = VGG16()
   # re-structure the model
   model.layers.pop()
   model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
   # load the photo
   image = load_img(filename, target_size=(224, 224))
   # convert the image pixels to a numpy array
   image = img_to_array(image)
   # reshape data for the model
   image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
   # prepare the image for the VGG model
   image = preprocess_input(image)
   # get features
   feature = model.predict(image, verbose=0)
   return feature

# map an integer to a word
def word_for_id(integer, tokenizer):
   for word, index in tokenizer.word_index.items():
      if index == integer:
         return word
   return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
   # seed the generation process
   in_text = 'startseq'
   # iterate over the whole length of the sequence
   for i in range(max_length):
      # integer encode input sequence
      sequence = tokenizer.texts_to_sequences([in_text])[0]
      # pad input
      sequence = pad_sequences([sequence], maxlen=max_length)
      # predict next word
      yhat = model.predict([photo,sequence], verbose=0)
      # convert probability to integer
      yhat = argmax(yhat)
      # map integer to word
      word = word_for_id(yhat, tokenizer)
      # stop if we cannot map the word
      if word is None:
         break
      # append as input for generating the next word
      in_text += ' ' + word
      # stop if we predict the end of the sequence
      if word == 'endseq':
         break
   return in_text

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model('model-ep002-loss3.245-val_loss3.612.h5')
# load and prepare the photograph
photo = extract_features('example.jpg')
# generate description
description = generate_desc(model, tokenizer, photo, max_length)
print(description)

代码清单23.34：为新图片生成描述的完整示例

在这种情况下，生成的描述如下：

dog is running across the beach

代码清单23.35：为新图片生成标题的示例输出

注意：鉴于神经网络的随机性，您的具体结果可能会有所不同。考虑运行几次示例。、

23.8 扩展

本节列出了一些扩展您可能希望探索的教程的想法。

替代预训练图像模型。本章使用小型16层VGG模型进行特征提取，尝试使用在ImageNet数据集上性能更好、更大模型，例如Inception。
较小的词汇量。在模型的开发中使用了大约八千字的大词汇，包含的许多单词可能是拼写错误或仅在整个数据集中使用过一次，尝试优化词汇量并缩小尺寸，可能减半。
预先训练的单词向量。该模型学习了单词向量作为拟合模型的一部分。通过使用在训练数据集上预训练或在更大的文本语料库（例如新闻文章或维基百科）上训练的单词向量，可以实现更好的性能。
训练Word2Vec矢量。使用Word2Vec对描述数据进行预训练单词向量，并尝试在训练期间允许和不允许微调向量的模型，然后比较性能。
调整模型。该模型的配置没有针对该问题进行调整，尝试其他调参配置，看看是否可以获得更好的性能。
注入架构。尝试用于生成字幕的注入体系结构，并将性能与本教程中使用的合并体系结构进行比较。
替代框架。尝试问题的替代框架，例如仅从图片生成整个序列。
预训练语言模型。预先训练语言模型以生成描述文本，然后在字幕生成模型中使用它并评估对模型训练时间和性能的影响。
截断描述。仅在特定数量的单词或低于特定数量的单词的情况下训练模型，并尝试将长描述截断为最优长度，评估对训练时间和模型性能的影响。
替代措施。尝试BLEU之外的其他性能指标，例如ROGUE。比较相同描述的分数，以便对实践中措施的差异产生直觉。

第23章项目：开发神经图像字幕生成模型

作者： oliverwang
发表时间: 2022-01-24 11:33

第23章项目：开发神经图像字幕生成模型

23.1 教程概述

23.2 图片和字幕数据集

23.3 准备图片数据

23.4 准备文本数据

23.5 开发深度学习模型

23.5.1 加载数据中

23.5.2 定义模型

23.5.3 拟合模型

23.5.4 完整的例子

23.5.5 渐进加载训练

23.6 评估模型

23.7 生成新字幕

23.8 扩展

0 条查看最新评论

没有评论

暂时无法发表评论

公告

欢迎关注

关于我

分类

第23章 项目： 开发神经图像字幕生成模型

作者： oliverwang 发表时间: 2022-01-24 11:33

第23章 项目： 开发神经图像字幕生成模型

23.1 教程概述

23.2 图片和字幕数据集

23.3 准备图片数据

23.4 准备文本数据

23.5 开发深度学习模型

23.5.1 加载数据中

23.5.2 定义模型

23.5.3 拟合模型

23.5.4 完整的例子

23.5.5 渐进加载训练

23.6 评估模型

23.7 生成新字幕

23.8 扩展

0 条 查看最新 评论

没有评论

暂时无法发表评论

公告

欢迎关注

关于我

分类

第23章项目：开发神经图像字幕生成模型

作者： oliverwang
发表时间: 2022-01-24 11:33

第23章项目：开发神经图像字幕生成模型

0 条查看最新评论