【TensorFlow 2.0教程】对影视评论进行文本分类


本文演示使用TensorFlow HubKeras进行转移学习的基本应用。


我们同样使用keras高级API,用于在TensorFlow中构建和训练模型。TensorFlow Hub是一个用于转移学习的库和平台,我们将使用其中已经训练好的文本嵌入模型。


import numpy as np
import tensorflow as tf
from tensorflow import keras

import tensorflow_hub as tfhub
import tensorflow_datasets as tfds

print("Version:", tf.__version__)
print("Eager mode:", tf.executing_eagerly())
print("Hub version:", hub.__version__)
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")
Version: 2.0.0-alpha0
Eager mode: True
Hub version: 0.4.0


TensorFlow datasets库提供了IMDB数据集,下面的代码使用datasets库下载该数据集:

train_validation_split = tfds.Split.TRAIN.subsplit([6, 4])

(train_data, validation_data), test_data = tfds.load(
    split=(train_validation_split, tfds.Split.TEST),


首先,让我们花点时间来看看数据集的数据格式。每个样本都包含一段电影评论文本,以及相应的标签。电影评论文本没有经过任何预处理,标签为01的整数值,其中0表示负面评论,1表示正面评论。 让我们打印头10个样本看看:

train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
<tf.Tensor: id=235, shape=(3,), dtype=string, numpy=
array([b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's not really Dickens at all.<br /><br />'Oliver!', on the other hand, is much closer to the mark. The mockery of officialdom is perfectly interpreted, from the blustering beadle to the drunken magistrate. The classic stand-off between the beadle and Mr Brownlow, in which the law is described as 'a ass, a idiot' couldn't have been better done. Harry Secombe is an ideal choice.<br /><br />But the blinding cruelty is also there, the callous indifference of the state, the cold, hunger, poverty and loneliness are all presented just as surely as The Master would have wished.<br /><br />And then there is crime. Ron Moody is a treasure as the sleazy Jewish fence, whilst Oliver Reid has Bill Sykes to perfection.<br /><br />Perhaps not surprisingly, Lionel Bart - himself a Jew from London's east-end - takes a liberty with Fagin by re-interpreting him as a much more benign fellow than was Dicken's original. In the novel, he was utterly ruthless, sending some of his own boys to the gallows in order to protect himself (though he was also caught and hanged). Whereas in the movie, he is presented as something of a wayward father-figure, a sort of charitable thief rather than a corrupter of children, the latter being a long-standing anti-semitic sentiment. Otherwise, very few liberties are taken with Dickens's original. All of the most memorable elements are included. Just enough menace and violence is retained to ensure narrative fidelity whilst at the same time allowing for children' sensibilities. Nancy is still beaten to death, Bullseye narrowly escapes drowning, and Bill Sykes gets a faithfully graphic come-uppance.<br /><br />Every song is excellent, though they do incline towards schmaltz. Mark Lester mimes his wonderfully. Both his and my favourite scene is the one in which the world comes alive to 'who will buy'. It's schmaltzy, but it's Dickens through and through.<br /><br />I could go on. I could commend the wonderful set-pieces, the contrast of the rich and poor. There is top-quality acting from more British regulars than you could shake a stick at.<br /><br />I ought to give it 10 points, but I'm feeling more like Scrooge today. Soak it up with your Christmas dinner. No original has been better realised.",
       b"Oh yeah! Jenna Jameson did it again! Yeah Baby! This movie rocks. It was one of the 1st movies i saw of her. And i have to say i feel in love with her, she was great in this move.<br /><br />Her performance was outstanding and what i liked the most was the scenery and the wardrobe it was amazing you can tell that they put a lot into the movie the girls cloth were amazing.<br /><br />I hope this comment helps and u can buy the movie, the storyline is awesome is very unique and i'm sure u are going to like it. Jenna amazed us once more and no wonder the movie won so many awards. Her make-up and wardrobe is very very sexy and the girls on girls scene is amazing. specially the one where she looks like an angel. It's a must see and i hope u share my interests",
       dtype=string)>


<tf.Tensor: id=231, shape=(10,), dtype=int64, numpy=array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0])>



  • 如何表示文本?
  • 模型中应该使用多少
  • 每层应该有多少个隐藏单元(即神经元,也称为节点)?



  • 我们不需要考虑文本预处理
  • 我们可以从转移学习中受益
  • 嵌入的大小是固定的,所以处理起来更简单。

在本例中,我们将使用一个来自TensorFlow Hub的预先训练好的文本嵌入模型,名为 google/tf2-preview/gnews-swivel-20dim/1

TensorFlow Hub中还有另外三个其他的训练好的模型,也可以用于本例的测试:

接下来,我们首先创建一个Keras层,我们使用TensorFlow Hub模型来进行文本嵌入,并对几个输入样本进行测试。注意,无论输入文本的长度如何,文本嵌入的输出形状都是固定的,大小为(num_examples, embedding_dimension) 。

embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
<tf.Tensor: id=416, shape=(3, 20), dtype=float32, numpy=
array([[ 3.9819887 , -4.4838037 ,  5.177359  , -2.3643482 , -3.2938678 ,
        -3.5364532 , -2.4786978 ,  2.5525482 ,  6.688532  , -2.3076782 ,
        -1.9807833 ,  1.1315885 , -3.0339816 , -0.7604128 , -5.743445  ,
         3.4242578 ,  4.790099  , -4.03061   , -5.992149  , -1.7297493 ],
       [ 3.4232912 , -4.230874  ,  4.1488533 , -0.29553518, -6.802391  ,
        -2.5163853 , -4.4002395 ,  1.905792  ,  4.7512794 , -0.40538004,
        -4.3401685 ,  1.0361497 ,  0.9744097 ,  0.71507156, -6.2657013 ,
         0.16533905,  4.560262  , -1.3106939 , -3.1121316 , -2.1338716 ],
       [ 3.8508697 , -5.003031  ,  4.8700504 , -0.04324996, -5.893603  ,
        -5.2983093 , -4.004676  ,  4.1236343 ,  6.267754  ,  0.11632943,
        -3.5934832 ,  0.8023905 ,  0.56146765,  0.9192484 , -7.3066816 ,
         dtype=float32)>


model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

Model: "sequential"
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
dense (Dense)                (None, 16)                336       
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0


  1. 第一层是一个TensorFlow Hub层。我们使用了一个预先训练好的被保存起来的模型,通过它将一个句子映射成一个文本嵌入向量。这个训练好的模型将句子分割成标记token),然后嵌入每个标记,然后组合嵌入形成嵌入向量。结果的维度是:(num_examples, embedding_dimension) 。
  2. 第一层输出的固定长度的向量紧接着通过一个有16隐藏单元全连接(Dense)层。
  3. 最后一层是一个单节点输出层,使用sigmoid激活函数,这个值是一个介于01之间的浮点数,表示一个概率或置信级别




这不是损失函数的唯一选择,例如,你可以选择mean_squared_error均方误差)。但是,一般来说,binary_crossentropy 更适合处理概率,它测量概率分布之间的差距,在我们的例子中,它表示测量的真实分布和预测之间的差距






history = model.fit(train_data.shuffle(10000).batch(512),


Epoch 1/20
30/30 [==============================] - 7s 245ms/step - loss: 0.7742 - accuracy: 0.4614 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/20
30/30 [==============================] - 6s 210ms/step - loss: 0.6759 - accuracy: 0.5783 - val_loss: 0.6483 - val_accuracy: 0.6273
Epoch 3/20
30/30 [==============================] - 7s 221ms/step - loss: 0.6084 - accuracy: 0.6863 - val_loss: 0.5836 - val_accuracy: 0.7120
Epoch 4/20
30/30 [==============================] - 7s 219ms/step - loss: 0.5588 - accuracy: 0.7369 - val_loss: 0.5470 - val_accuracy: 0.7414
Epoch 5/20
30/30 [==============================] - 6s 208ms/step - loss: 0.5192 - accuracy: 0.7703 - val_loss: 0.5117 - val_accuracy: 0.7655
Epoch 20/20
30/30 [==============================] - 6s 214ms/step - loss: 0.1667 - accuracy: 0.9414 - val_loss: 0.2963 - val_accuracy: 0.8758



results = model.evaluate(test_data.batch(512), verbose=0)
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
loss: 0.324, accuracy: 0.873
[0.32395469411849975, 0.87264]





