我从Spotify上抓取了一些数据,以尝试对不同歌曲的音乐类型进行分类。我将数据分成了测试集和剩余集,然后进一步将剩余集分为训练集和验证集。
当我运行模型时(我尝试对112种音乐类型进行分类),在验证集上的准确率为30%。当然,这并不理想,但考虑到有112种音乐类型和有限的数据,这是可以预见的。真正让我困惑的是,当我将模型应用到测试数据上时,准确率下降到1%。
我不确定这是为什么:据我所知,验证数据和测试数据应该是可比的。我在完全独立的训练数据上训练模型。
我一定是在某个地方犯了错误,要么是让模型偷看验证数据(在那里表现更好),要么是搞乱了我的测试数据。
或者也许是两次应用模型搞乱了事情?
您有任何想法关于可能发生的事情或如何调试吗?
非常感谢!Franka
from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.utils import shuffle# re-read datatrack_df = pd.read_csv('track_df_corr.csv') features = [ 'acousticness', 'speechiness', 'key', 'liveness', 'instrumentalness', 'energy', 'tempo', 'loudness', 'danceability', 'valence', 'duration_mins', 'year', 'genre']track_df = track_df[features]#First make a big split of all the data into test and train.train, test = train_test_split(track_df, test_size=0.2, random_state = 0)#Then create training and validation data set from the traindata.# Read the data. Assign train and test data# "full" is the data before preprocessingX_full = train X_test_full = test # select to be predicted datay = X_full.genre # just the target for the test datay = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier #Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.# select to be predicted datay_test = X_test_full.genre # just the target for the test datay_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0] # numbers needed for classifier# remove to be predicted variableX_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in yX_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt# Break off validation set from training data (X_full)# Remember we still have X_test_full as an entirely independend test set. # Here we just create our training and validation sets from X_full.X_train_full, X_valid_full, y_train, y_valid = \ train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0) # General preprocessing steps: take care of categorical data (does not apply here).categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]# Select numerical columnsnumerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]# Keep selected columns onlymy_cols = categorical_cols + numerical_colsX_train = X_train_full[my_cols].copy()X_valid = X_valid_full[my_cols].copy()X_test = X_test_full[my_cols].copy()#Time to run the model.from sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformer#Run our model on the TRAINING data# FRR set up input values that are passed to the Bundle below# Preprocessing for NUMERICAL datanumerical_transformer = SimpleImputer(strategy='median') # Preprocessing for CATEGORICAL datacategorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer". ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])# FRR Run the numerical_transformer and categorical_transformer defined above here:# Bundle preprocessing for numerical and categorical datapreprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame. transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to #be applied to subsets of the data. ('num', numerical_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols) ])# Define modelmodel = RandomForestClassifier(n_estimators=100, random_state=0)# Bundle preprocessing and modeling code in a pipeline# clf stands for clasiifier.# Pipeline can be used to chain multiple estimators into one# Preprocessing of training data, fit model clf = Pipeline(steps=[('preprocessor', preprocessor), ('model', model) ])# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model) clf.fit(X_train, y_train)# --------------------------------------------------------#Test our model on the VALIDATION data# Preprocessing of validation data, get predictionspreds = clf.predict(X_valid)# Return the mean accuracy on the given test data and labels.clf.score(X_valid, y_valid) # this is correct! # The code yields a value around 30%. # --------------------------------------------------------Apply our model on the TESTING data# Preprocessing of training data, fit model preds_test = clf.predict(X_test)clf.score(X_test, y_test)#The code yields a value around 1%.
回答:
我看到的问题是您使用 pd.factorize
编码训练和测试标签。由于您独立地对 y
和 y_test
使用 pd.factorize
,因此生成的编码将不会互相对应。您应该使用 LabelEncoder
,这样当您使用训练数据 fit
编码器后,您就可以使用相同的编码方案转换 y_test
。
这是一个说明此问题的例子:
from sklearn.preprocessing import LabelEncoderl = [1,4,6,1,4]le = LabelEncoder()le.fit(l)le.transform(l)# array([0, 1, 2, 0, 1], dtype=int64)le.transform([1,6,4])# array([0, 2, 1], dtype=int64)
在这里我们得到了正确的编码。然而,如果我们应用 pd.factorize
,显然Pandas无法猜测哪些是正确的编码:
pd.factorize(l)[0]# array([0, 1, 2, 0, 1], dtype=int64)pd.factorize([1,6,4])[0]# array([0, 1, 2], dtype=int64)