当我尝试将数据分割成训练集和测试集时,我遇到了以下错误。我知道这个错误发生是因为对于stratify参数,我应该只传递分类数据,而不是数值数据,但这里OFFENSE_CODE
就像一个类别,只是其中的类别由数字表示。那么我该如何通过OFFENSE_CODE
进行分层抽样呢?
x = df.loc[:,['YEAR','MONTH','DAY_OF_WEEK']]X_train, x_test, Y_train, y_test = model_selection.train_test_split(x,df['OFFENSE_CODE'],stratify=df['OFFENSE_CODE'],random_state=2,test_size=0.3)
这是数据集的一个样本
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP \ I192067438 613 Larceny I192067437 3831 Motor Vehicle Accident Response I192067435 3115 Investigate Person I192067434 3301 Verbal Disputes I192067433 3301 Verbal Disputes OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING \ LARCENY SHOPLIFTING A1 112 NaN PROPERTY DAMAGE A1 NaN INVESTIGATE PERSON C11 336 NaN VERBAL DISPUTE E18 492 NaN VERBAL DISPUTE D14 769 NaN OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART \2019-08-25 19:55:02 2019 8 Sunday 19 Part One 2019-08-25 18:20:00 2019 8 Sunday 18 Part Three 2019-08-25 20:45:00 2019 8 Sunday 20 Part Three 2019-08-25 20:32:00 2019 8 Sunday 20 Part Three 2019-08-25 20:30:00 2019 8 Sunday 20 Part Three STREET Lat Long Location CODES WASHINGTON ST 42.355123 -71.060880 (42.35512339, -71.06087980) tyer613a NaN 42.352389 -71.062603 (42.35238871, -71.06260312) tyer3831a NORTON ST 42.306265 -71.068646 (42.30626521, -71.06864556) tyer3115a DERRY RD 42.265933 -71.113774 (42.26593347, -71.11377415) tyer3301a PARSONS ST NaN NaN (0.00000000, 0.00000000) tyer3301a
我也尝试了
y = df.loc['OFFENSE_CODE'].apply(str)X_train, x_test, Y_train, y_test = model_selection.train_test_split(x,y,stratify=y,random_state=2,test_size=0.3)
它返回了相同的错误
ValueError:y类别中最少的类别只有1个成员,这太少了。任何类别的最小组数不能少于2。
回答:
将列转换为字符串,然后进行抽样
df['OFFENSE_CODE'].apply(str)
别忘了将结果重新赋值
df['OFFENSE_CODE'] = df['OFFENSE_CODE'].apply(str)