我在gcolab上运行这个卷积神经网络模型,我的目标是文本分类。以下是我的代码和错误:
# sequence encodeencoded_docs = tokenizer.texts_to_sequences(train_docs)# pad sequencesmax_length = max([len(s.split()) for s in train_docs])Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')# define training labelsytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])# load all test reviewsfood_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/food', vocab, False)location_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/location', vocab, False)price_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/price', vocab, False)service_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/service', vocab, False)time_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/time', vocab, False)test_docs = food_docs + location_docs + price_docs + service_docs + time_docs# sequence encodeencoded_docs = tokenizer.texts_to_sequences(test_docs)# pad sequencesXtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')# define test labelsytest = array([0 for _ in range(100)] + [1 for _ in range(100)])# define vocabulary size (largest integer value)vocab_size = len(tokenizer.word_index) + 1# define modelmodel = Sequential()model.add(Embedding(vocab_size, 100, input_length=max_length))model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))model.add(MaxPooling1D(pool_size=2))model.add(Flatten())model.add(Dense(10, activation='relu'))model.add(Dense(1, activation='sigmoid'))print(model.summary())# compile networkmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# fit networkmodel.fit(Xtrain, ytrain, epochs=10, verbose=2)
这是我的模型摘要输出:
模型: “sequential_1”
层(类型) 输出形状 参数数
embedding_1 (Embedding) (None, 41, 100) 415400
conv1d_1 (Conv1D) (None, 34, 32) 25632
max_pooling1d_1 (MaxPooling1 (None, 17, 32) 0
flatten_1 (Flatten) (None, 544) 0
dense_2 (Dense) (None, 10) 5450
dense_3 (Dense) (None, 1) 11
总参数: 446,493可训练参数: 446,493不可训练参数: 0
None
这是运行最后一个单元格时发生的错误
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-39-fa9c5ed3e39a> in <module>() 2 model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 3 # fit network----> 4 model.fit(Xtrain, ytrain, epochs=10, verbose=2)3 frames/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/data_adapter.py in _check_data_cardinality(data) 1527 label, ", ".join(str(i.shape[0]) for i in nest.flatten(single_data))) 1528 msg += "Make sure all arrays contain the same number of samples."-> 1529 raise ValueError(msg) 1530 1531 ValueError: Data cardinality is ambiguous: x sizes: 9473 y sizes: 1800Make sure all arrays contain the same number of samples.
我对使用CNN还比较新,任何帮助我都会非常感激!谢谢你。
回答:
你的训练数据只有1,800个标签,但是你的训练输入有9,473个。
>>> ytrain = np.array([0 for _ in range(900)] + [1 for _ in range(900)])>>> ytrain.shape(1800,)
假设你实际上是想为你的标签创建50%的0和50%的1,你需要将其更改为类似于:
ytrain = np.array([0 for _ in range(len(Xtrain)//2)] + [1 for _ in range(len(Xtrain)//2)])
这样将创建一个数组,其中Xtrain的一半标签为0,另一半为1。
更新
对于不均匀的数据集,这可能会更好,因为它在中间索引处分割,因此应该能够处理奇数长度:
length = len(Xtrain)middle_index = length//2ytrain = np.array([0 for _ in range(len(Xtrain[:middle_index]))] + [1 for _ in range(len(Xtrain[middle_index:]))])