让 x
包含以下变量: print(x)
Restaurant Cuisines Average_Cost Rating Votes Reviews Area 0 3.526361 0.693147 5.303305 1.504077 2.564949 1.609438 7.214504 1 1.386294 4.127134 4.615121 1.504077 2.484907 1.609438 5.905362 2 2.772589 1.386294 5.017280 1.526056 4.605170 3.433987 6.131226 3 3.912023 2.833213 5.525453 1.547563 5.176150 4.564348 7.643483 4 3.526361 2.708050 5.303305 1.435085 5.948035 5.046646 6.126869 ... ... ... ... ... ... ... ... 11089 3.912023 0.693147 5.525453 1.648659 5.789960 5.046646 3.135494 11090 1.386294 6.028279 4.615121 1.526056 3.610918 2.833213 7.643483 11091 1.386294 2.397895 4.615121 1.504077 3.828641 2.944439 5.814131 11092 1.386294 6.028279 4.615121 1.410987 3.218876 2.302585 5.905362 11093 1.386294 6.028279 4.615121 1.029619 0.000000 0.000000 5.564520 11094 rows × 7 columns
让 y
为多类别目标变量。 print(y.value_counts())
30 minutes 7406 45 minutes 2665 65 minutes 923 120 minutes 62 20 minutes 20 80 minutes 14 10 minutes 4 Name: Delivery_Time, dtype: int64
在探索了 y
变量后,我们可以看到 30 minutes
类别的数量比其他类别高很多。
为了平衡这些,我尝试使用 SMOTETomek
来过采样数据。但我得到了一个错误:
from imblearn.combine import SMOTETomeksmk = SMOTEtomek(ratio = 1)x_res, y_res = smk.fit_sample(x,y)
错误:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-54-426e8b86623d> in <module>() 1 from imblearn.combine import SMOTETomek 2 smk = SMOTETomek(ratio = 1)----> 3 x_res, y_res = smk.fit_sample(x,y)2 frames/usr/local/lib/python3.6/dist-packages/imblearn/utils/_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type) 311 if type_y != 'binary': 312 raise ValueError(--> 313 '"sampling_strategy" can be a float only when the type ' 314 'of target is binary. For multi-class, use a dict.') 315 target_stats = _count_class_sample(y)ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.
回答:
我认为你应该保持目标变量的比例一致,因为 SMOTE 可能会在测试数据集上提供增强和更好的结果,但模型在用户的新数据输入(实时数据)上可能会失败。
是否应用 SMOTE 取决于你。你可以使用以下代码:
from imblearn.oversampling import SMOTEsmote=SMOTE("minority")X,Y=smote.fit_sample(x_train_data,y_train_data)