我在尝试用Kaggle的泰坦尼克数据集拟合人工神经网络(ANN)时遇到了ValueError。当我使用随机森林(RandomForest)时没有问题,但在尝试使用人工神经网络时,代码抛出了下面的错误。您能指出为什么我会得到下面的错误吗?我已经在下面粘贴了代码
import numpy as np import pandas as pd train_data = pd.read_csv("/kaggle/input/titanic/train.csv") test_data = pd.read_csv("/kaggle/input/titanic/test.csv") y = train_data["Survived"] y = np.array(y.values.tolist()) features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare"] X = pd.get_dummies(train_data[features]) X_test = pd.get_dummies(test_data[features]) from sklearn.impute import SimpleImputer my_imputer = SimpleImputer() X = my_imputer.fit_transform(X) my_imputer = SimpleImputer() X_test = my_imputer.fit_transform(X_test) from sklearn.preprocessing import StandardScaler sc= StandardScaler() X=sc.fit_transform(X) X_test=sc.transform(X_test) import keras from keras.models import Sequential from keras.layers import Dense classifier = Sequential() classifier.add(Dense(units=4, kernel_initializer='uniform', activation= 'relu', input_dim=6)) classifier.add(Dense(units=4, kernel_initializer='uniform', activation='relu')) classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) classifier.compile(optimizer = 'adam', loss='binary_crossentropy', metrics=['accuracy']) #Fitting the ANN to the Training set classifier.fit(X, y, batch_size=10, epochs=100)
错误如下
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-30-f7e7c8ad52f1> in <module> ----> 1 classifier.fit(X, y, batch_size=10, epochs=100) /opt/conda/lib/python3.7/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs) 1152 sample_weight=sample_weight, 1153 class_weight=class_weight, -> 1154 batch_size=batch_size) 1155 1156 # Prepare validation data. /opt/conda/lib/python3.7/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size) 577 feed_input_shapes, 578 check_batch_axis=False, # Don't enforce the batch size. --> 579 exception_prefix='input') 580 581 if y is not None: /opt/conda/lib/python3.7/site-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix) 143 ': expected ' + names[i] + ' to have shape ' + 144 str(shape) + ' but got array with shape ' + --> 145 str(data_shape)) 146 return data 147 ValueError: Error when checking input: expected dense_11_input to have shape (6,) but got array with shape (7,)
我尝试将y变量转换为数组,但仍然得到同样的错误。
回答:
看起来您在神经网络的第一个密集层中输入了错误的形状。您有7列,所以应该输入(7,)
。作为经验法则,对于一维数据,您可以使用X.shape[1]
。所以:
classifier.add(Dense(units=4, kernel_initializer='uniform', activation='relu', input_dim=X.shape[1]))
您最终得到7列而不是6列的原因是您使用pd.get_dummies
对sex
变量进行了独热编码。原本的['male', 'female', 'female'...]
现在变成了两列:一列是男性,一列是女性。这就是pd.get_dummies
的作用(见右侧)。
Pclass SibSp Parch Age Fare Sex_female Sex_male0 3 1 0 22.0 7.2500 0 11 1 1 0 38.0 71.2833 1 02 3 0 0 26.0 7.9250 1 03 1 1 0 35.0 53.1000 1 04 3 0 0 35.0 8.0500 0 1.. ... ... ... ... ... ... ...886 2 0 0 27.0 13.0000 0 1887 1 0 0 19.0 30.0000 1 0888 3 1 2 NaN 23.4500 1 0889 1 0 0 26.0 30.0000 0 1890 3 0 0 32.0 7.7500 0 1
通常情况下,设置input_dim=X.shape[1]
会更容易,因为您不必手动设置它,甚至不需要知道有多少列。这基本上是在说input_dim
应该是X
的列数,无论它是多少。