【TensorFlow 2.0教程】对影视评论进行文本分类


本文将对电影评论文本进行分类,分为正面影评和负面影评,这是一个在机器学习问题中非常重要且常见的二分类问题。

本文演示使用TensorFlow HubKeras进行转移学习的基本应用。

我们将使用IMDB数据集,其中包含了来自互联网电影数据库的50,000篇电影评论的文本,它们被分成25000个训练评论文本和25000个测试评论文本。训练集和测试集中的评论类型是比较平衡的,这意味着它们包含相同数量的正面和负面评论。

我们同样使用keras高级API,用于在TensorFlow中构建和训练模型。TensorFlow Hub是一个用于转移学习的库和平台,我们将使用其中已经训练好的文本嵌入模型。

首先,导入本文将用到的python库:

import numpy as np
import tensorflow as tf
from tensorflow import keras

import tensorflow_hub as tfhub
import tensorflow_datasets as tfds

print("Version:", tf.__version__)
print("Eager mode:", tf.executing_eagerly())
print("Hub version:", hub.__version__)
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")
Version: 2.0.0-alpha0
Eager mode: True
Hub version: 0.4.0
GPU is NOT AVAILABLE

下载IMDB数据集

TensorFlow datasets库提供了IMDB数据集,下面的代码使用datasets库下载该数据集:

train_validation_split = tfds.Split.TRAIN.subsplit([6, 4])

(train_data, validation_data), test_data = tfds.load(
    name="imdb_reviews", 
    split=(train_validation_split, tfds.Split.TEST),
    as_supervised=True)

部分输出如下:

Downloading and preparing dataset imdb_reviews (80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/0.1.0...
HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Completed...', max=1, style=ProgressStyl…
HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Size...', max=1, style=ProgressStyle(des…


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))
HBox(children=(IntProgress(value=0, description='Shuffling...', max=10, style=ProgressStyle(description_width=…
WARNING: Logging before flag parsing goes to stderr.
W0605 04:08:56.784977 140114589456192 deprecation.py:323] From /root/anaconda3/lib/python3.7/site-packages/tensorflow_datasets/core/file_format_adapter.py:247: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…
HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…
...

探索数据

首先,让我们花点时间来看看数据集的数据格式。每个样本都包含一段电影评论文本,以及相应的标签。电影评论文本没有经过任何预处理,标签为01的整数值,其中0表示负面评论,1表示正面评论。 让我们打印头10个样本看看:

train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch
<tf.Tensor: id=235, shape=(3,), dtype=string, numpy=
array([b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's not really Dickens at all.<br /><br />'Oliver!', on the other hand, is much closer to the mark. The mockery of officialdom is perfectly interpreted, from the blustering beadle to the drunken magistrate. The classic stand-off between the beadle and Mr Brownlow, in which the law is described as 'a ass, a idiot' couldn't have been better done. Harry Secombe is an ideal choice.<br /><br />But the blinding cruelty is also there, the callous indifference of the state, the cold, hunger, poverty and loneliness are all presented just as surely as The Master would have wished.<br /><br />And then there is crime. Ron Moody is a treasure as the sleazy Jewish fence, whilst Oliver Reid has Bill Sykes to perfection.<br /><br />Perhaps not surprisingly, Lionel Bart - himself a Jew from London's east-end - takes a liberty with Fagin by re-interpreting him as a much more benign fellow than was Dicken's original. In the novel, he was utterly ruthless, sending some of his own boys to the gallows in order to protect himself (though he was also caught and hanged). Whereas in the movie, he is presented as something of a wayward father-figure, a sort of charitable thief rather than a corrupter of children, the latter being a long-standing anti-semitic sentiment. Otherwise, very few liberties are taken with Dickens's original. All of the most memorable elements are included. Just enough menace and violence is retained to ensure narrative fidelity whilst at the same time allowing for children' sensibilities. Nancy is still beaten to death, Bullseye narrowly escapes drowning, and Bill Sykes gets a faithfully graphic come-uppance.<br /><br />Every song is excellent, though they do incline towards schmaltz. Mark Lester mimes his wonderfully. Both his and my favourite scene is the one in which the world comes alive to 'who will buy'. It's schmaltzy, but it's Dickens through and through.<br /><br />I could go on. I could commend the wonderful set-pieces, the contrast of the rich and poor. There is top-quality acting from more British regulars than you could shake a stick at.<br /><br />I ought to give it 10 points, but I'm feeling more like Scrooge today. Soak it up with your Christmas dinner. No original has been better realised.",
       b"Oh yeah! Jenna Jameson did it again! Yeah Baby! This movie rocks. It was one of the 1st movies i saw of her. And i have to say i feel in love with her, she was great in this move.<br /><br />Her performance was outstanding and what i liked the most was the scenery and the wardrobe it was amazing you can tell that they put a lot into the movie the girls cloth were amazing.<br /><br />I hope this comment helps and u can buy the movie, the storyline is awesome is very unique and i'm sure u are going to like it. Jenna amazed us once more and no wonder the movie won so many awards. Her make-up and wardrobe is very very sexy and the girls on girls scene is amazing. specially the one where she looks like an angel. It's a must see and i hope u share my interests",
       b"I saw this film on True Movies (which automatically made me sceptical) but actually - it was good. Why? Not because of the amazing plot twists or breathtaking dialogue (of which there is little) but because actually, despite what people say I thought the film was accurate in it's depiction of teenagers dealing with pregnancy.<br /><br />It's NOT Dawson's Creek, they're not graceful, cool witty characters who breeze through sexuality with effortless knowledge. They're kids and they act like kids would. <br /><br />They're blunt, awkward and annoyingly confused about everything. Yes, this could be by accident and they could just be bad actors but I don't think so. Dermot Mulroney gives (when not trying to be cool) a very believable performance and I loved him for it. Patricia Arquette IS whiny and annoying, but she was pregnant and a teenagers? The combination of the two isn't exactly lavender on your pillow. The plot was VERY predictable and but so what? I believed them, his stress and inability to cope - her brave, yet slightly misguided attempts to bring them closer together. I think the characters, acted by anyone else, WOULD indeed have been annoying and unbelievable but they weren't. It reflects the surreality of the situation they're in, that he's sitting in class and she walks on campus with the baby. I felt angry at her for that, I felt angry at him for being such a child and for blaming her. I felt it all.<br /><br />In the end, I loved it and would recommend it.<br /><br />Watch out for the scene where Dermot Mulroney runs from the disastrous counselling session - career performance."],
      dtype=object)>

对应的头10个样本的标签:

train_labels_batch
<tf.Tensor: id=231, shape=(10,), dtype=int64, numpy=array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0])>

构建模型

神经网络模型一般是由多个叠加起来构成的,我们一般需要考虑如下三个主要因素来构建模型:

  • 如何表示文本?
  • 模型中应该使用多少
  • 每层应该有多少个隐藏单元(即神经元,也称为节点)?

在本例中,输入数据由一段文本构成的句子组成,要预测的标签是01

表示文本的一种方法是将句子转换成嵌入向量。我们可以使用一个预先训练好的文本嵌入模型作为第一层。使用已训练好的文本嵌入模型有以下三个优点:

  • 我们不需要考虑文本预处理
  • 我们可以从转移学习中受益
  • 嵌入的大小是固定的,所以处理起来更简单。

在本例中,我们将使用一个来自TensorFlow Hub的预先训练好的文本嵌入模型,名为 google/tf2-preview/gnews-swivel-20dim/1

TensorFlow Hub中还有另外三个其他的训练好的模型,也可以用于本例的测试:

接下来,我们首先创建一个Keras层,我们使用TensorFlow Hub模型来进行文本嵌入,并对几个输入样本进行测试。注意,无论输入文本的长度如何,文本嵌入的输出形状都是固定的,大小为(num_examples, embedding_dimension) 。

embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])
<tf.Tensor: id=416, shape=(3, 20), dtype=float32, numpy=
array([[ 3.9819887 , -4.4838037 ,  5.177359  , -2.3643482 , -3.2938678 ,
        -3.5364532 , -2.4786978 ,  2.5525482 ,  6.688532  , -2.3076782 ,
        -1.9807833 ,  1.1315885 , -3.0339816 , -0.7604128 , -5.743445  ,
         3.4242578 ,  4.790099  , -4.03061   , -5.992149  , -1.7297493 ],
       [ 3.4232912 , -4.230874  ,  4.1488533 , -0.29553518, -6.802391  ,
        -2.5163853 , -4.4002395 ,  1.905792  ,  4.7512794 , -0.40538004,
        -4.3401685 ,  1.0361497 ,  0.9744097 ,  0.71507156, -6.2657013 ,
         0.16533905,  4.560262  , -1.3106939 , -3.1121316 , -2.1338716 ],
       [ 3.8508697 , -5.003031  ,  4.8700504 , -0.04324996, -5.893603  ,
        -5.2983093 , -4.004676  ,  4.1236343 ,  6.267754  ,  0.11632943,
        -3.5934832 ,  0.8023905 ,  0.56146765,  0.9192484 , -7.3066816 ,
         2.8202746 ,  6.2000837 , -3.5709393 , -4.564525  , -2.305622  ]],
      dtype=float32)>

现在,我们可以使用上面创建的层来构建完整的神经网络模型了:

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________

可以看到,我们把几个依次叠加起来,构成了我们的分类器模型:

  1. 第一层是一个TensorFlow Hub层。我们使用了一个预先训练好的被保存起来的模型,通过它将一个句子映射成一个文本嵌入向量。这个训练好的模型将句子分割成标记token),然后嵌入每个标记,然后组合嵌入形成嵌入向量。结果的维度是:(num_examples, embedding_dimension) 。
  2. 第一层输出的固定长度的向量紧接着通过一个有16隐藏单元全连接(Dense)层。
  3. 最后一层是一个单节点输出层,使用sigmoid激活函数,这个值是一个介于01之间的浮点数,表示一个概率或置信级别

接下来我们需要编译模型。

选择损失函数和优化器

模型需要指定一个损失函数和一个用于训练的优化器。由于这是一个二元分类问题,并且模型输出一个概率,所以我们将使用binary_crossentropy损失函数。

这不是损失函数的唯一选择,例如,你可以选择mean_squared_error均方误差)。但是,一般来说,binary_crossentropy 更适合处理概率,它测量概率分布之间的差距,在我们的例子中,它表示测量的真实分布和预测之间的差距

当我们研究回归问题(例如,预测房价)时,我们可以使用均方误差损失函数。

现在,为我们的模型配置优化器损失函数,同时指定检测参数

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

训练模型

我们对模型进行20次迭代训练,也就是使用训练数据中的所有样本进行20次迭代。在训练过程中,对来自验证集的10000个样本进行验证,从而监测模型的损失准确性

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

部分输出如下:

Epoch 1/20
30/30 [==============================] - 7s 245ms/step - loss: 0.7742 - accuracy: 0.4614 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/20
30/30 [==============================] - 6s 210ms/step - loss: 0.6759 - accuracy: 0.5783 - val_loss: 0.6483 - val_accuracy: 0.6273
Epoch 3/20
30/30 [==============================] - 7s 221ms/step - loss: 0.6084 - accuracy: 0.6863 - val_loss: 0.5836 - val_accuracy: 0.7120
Epoch 4/20
30/30 [==============================] - 7s 219ms/step - loss: 0.5588 - accuracy: 0.7369 - val_loss: 0.5470 - val_accuracy: 0.7414
Epoch 5/20
30/30 [==============================] - 6s 208ms/step - loss: 0.5192 - accuracy: 0.7703 - val_loss: 0.5117 - val_accuracy: 0.7655
...
Epoch 20/20
30/30 [==============================] - 6s 214ms/step - loss: 0.1667 - accuracy: 0.9414 - val_loss: 0.2963 - val_accuracy: 0.8758

评估模型

使用evaluate函数来验证模型对测试集的预测情况。该函数将返回两个值:损失值准确率

results = model.evaluate(test_data.batch(512), verbose=0)
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
25000/25000 [==============================] - 1s 35us/step
[0.32395469411849975, 0.87264]

可以看到,使用这种相当简单的方法可以达到约87%的准确率。如果采用更高级的方法,模型的准确率应该可以接近95%

延伸阅读

要了解处理文本输入的更一般的方法,以及在训练时对准确率和损失值的更详细分析过程,请看这里

本文完整代码请参考这里


文章作者: yglong
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 yglong !
评论
  目录