验证数据和测试数据性能差异巨大

我从Spotify上抓取了一些数据，以尝试对不同歌曲的音乐类型进行分类。我将数据分成了测试集和剩余集，然后进一步将剩余集分为训练集和验证集。

当我运行模型时（我尝试对112种音乐类型进行分类），在验证集上的准确率为30%。当然，这并不理想，但考虑到有112种音乐类型和有限的数据，这是可以预见的。真正让我困惑的是，当我将模型应用到测试数据上时，准确率下降到1%。

我不确定这是为什么：据我所知，验证数据和测试数据应该是可比的。我在完全独立的训练数据上训练模型。

我一定是在某个地方犯了错误，要么是让模型偷看验证数据（在那里表现更好），要么是搞乱了我的测试数据。

或者也许是两次应用模型搞乱了事情？

您有任何想法关于可能发生的事情或如何调试吗？

非常感谢！Franka

from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.utils import shuffle# re-read datatrack_df = pd.read_csv('track_df_corr.csv') features = [ 'acousticness', 'speechiness',           'key', 'liveness', 'instrumentalness', 'energy', 'tempo',            'loudness', 'danceability', 'valence',           'duration_mins', 'year', 'genre']track_df = track_df[features]#First make a big split of all the data into test and train.train, test = train_test_split(track_df, test_size=0.2, random_state = 0)#Then create training and validation data set from the traindata.# Read the data. Assign train and test data# "full" is the data before preprocessingX_full = train X_test_full = test # select to be predicted datay = X_full.genre # just the target for the test datay = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier  #Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.# select to be predicted datay_test = X_test_full.genre # just the target for the test datay_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]                    # numbers needed for classifier# remove to be predicted variableX_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in yX_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt# Break off validation set from training data (X_full)# Remember we still have X_test_full as an entirely independend test set. # Here we just create our training and validation sets from X_full.X_train_full, X_valid_full, y_train, y_valid = \            train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0) # General preprocessing steps: take care of categorical data (does not apply here).categorical_cols = [cname for cname in X_train_full.columns if                    X_train_full[cname].nunique() < 10 and                     X_train_full[cname].dtype == "object"]# Select numerical columnsnumerical_cols = [cname for cname in X_train_full.columns if                 X_train_full[cname].dtype in ['int64', 'float64']]# Keep selected columns onlymy_cols = categorical_cols + numerical_colsX_train = X_train_full[my_cols].copy()X_valid = X_valid_full[my_cols].copy()X_test = X_test_full[my_cols].copy()#Time to run the model.from sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformer#Run our model on the TRAINING data# FRR set up input values that are passed to the Bundle below# Preprocessing for NUMERICAL datanumerical_transformer = SimpleImputer(strategy='median') # Preprocessing for CATEGORICAL datacategorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".    ('imputer', SimpleImputer(strategy='most_frequent')),    ('onehot', OneHotEncoder(handle_unknown='ignore'))])# FRR Run the numerical_transformer and categorical_transformer defined above here:# Bundle preprocessing for numerical and categorical datapreprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.    transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to                         #be applied to subsets of the data.        ('num', numerical_transformer, numerical_cols),         ('cat', categorical_transformer, categorical_cols)    ])# Define modelmodel = RandomForestClassifier(n_estimators=100, random_state=0)# Bundle preprocessing and modeling code in a pipeline# clf  stands for clasiifier.# Pipeline can be used to chain multiple estimators into one# Preprocessing of training data, fit model clf = Pipeline(steps=[('preprocessor', preprocessor),                      ('model', model)                     ])# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model) clf.fit(X_train, y_train)# --------------------------------------------------------#Test our model on the VALIDATION data# Preprocessing of validation data, get predictionspreds = clf.predict(X_valid)# Return the mean accuracy on the given test data and labels.clf.score(X_valid, y_valid) # this is correct! # The code yields a value around 30%. # --------------------------------------------------------Apply our model on the TESTING data# Preprocessing of training data, fit model preds_test = clf.predict(X_test)clf.score(X_test, y_test)#The code yields a value around 1%.

回答：

我看到的问题是您使用 pd.factorize 编码训练和测试标签。由于您独立地对 y 和 y_test 使用 pd.factorize，因此生成的编码将不会互相对应。您应该使用 LabelEncoder，这样当您使用训练数据 fit 编码器后，您就可以使用相同的编码方案转换 y_test。

这是一个说明此问题的例子：

from sklearn.preprocessing import LabelEncoderl = [1,4,6,1,4]le = LabelEncoder()le.fit(l)le.transform(l)# array([0, 1, 2, 0, 1], dtype=int64)le.transform([1,6,4])# array([0, 2, 1], dtype=int64)

在这里我们得到了正确的编码。然而，如果我们应用 pd.factorize，显然Pandas无法猜测哪些是正确的编码：

pd.factorize(l)[0]# array([0, 1, 2, 0, 1], dtype=int64)pd.factorize([1,6,4])[0]# array([0, 1, 2], dtype=int64)

学技术

验证数据和测试数据性能差异巨大

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复