我正在尝试使我的数据平衡，因为我的目标变量是多类别的，我希望通过过采样来使数据平衡

让 x 包含以下变量： print(x)

    Restaurant  Cuisines    Average_Cost    Rating  Votes   Reviews Area    0   3.526361    0.693147    5.303305    1.504077    2.564949    1.609438    7.214504    1   1.386294    4.127134    4.615121    1.504077    2.484907    1.609438    5.905362    2   2.772589    1.386294    5.017280    1.526056    4.605170    3.433987    6.131226    3   3.912023    2.833213    5.525453    1.547563    5.176150    4.564348    7.643483    4   3.526361    2.708050    5.303305    1.435085    5.948035    5.046646    6.126869    ... ... ... ... ... ... ... ...    11089   3.912023    0.693147    5.525453    1.648659    5.789960    5.046646    3.135494    11090   1.386294    6.028279    4.615121    1.526056    3.610918    2.833213    7.643483    11091   1.386294    2.397895    4.615121    1.504077    3.828641    2.944439    5.814131    11092   1.386294    6.028279    4.615121    1.410987    3.218876    2.302585    5.905362    11093   1.386294    6.028279    4.615121    1.029619    0.000000    0.000000    5.564520    11094 rows × 7 columns

让 y 为多类别目标变量。 print(y.value_counts())

    30 minutes     7406    45 minutes     2665    65 minutes      923    120 minutes      62    20 minutes       20    80 minutes       14    10 minutes        4    Name: Delivery_Time, dtype: int64

在探索了 y 变量后，我们可以看到 30 minutes 类别的数量比其他类别高很多。

为了平衡这些，我尝试使用 SMOTETomek 来过采样数据。但我得到了一个错误：

from imblearn.combine import SMOTETomeksmk = SMOTEtomek(ratio = 1)x_res, y_res = smk.fit_sample(x,y)

错误：

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-54-426e8b86623d> in <module>()        1 from imblearn.combine import SMOTETomek        2 smk = SMOTETomek(ratio = 1)----> 3 x_res, y_res = smk.fit_sample(x,y)2 frames/usr/local/lib/python3.6/dist-packages/imblearn/utils/_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type)    311     if type_y != 'binary':    312         raise ValueError(--> 313             '"sampling_strategy" can be a float only when the type '    314             'of target is binary. For multi-class, use a dict.')    315     target_stats = _count_class_sample(y)ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.

回答：

我认为你应该保持目标变量的比例一致，因为 SMOTE 可能会在测试数据集上提供增强和更好的结果，但模型在用户的新数据输入（实时数据）上可能会失败。

是否应用 SMOTE 取决于你。你可以使用以下代码：

from imblearn.oversampling import SMOTEsmote=SMOTE("minority")X,Y=smote.fit_sample(x_train_data,y_train_data)

学技术

我正在尝试使我的数据平衡，因为我的目标变量是多类别的，我希望通过过采样来使数据平衡

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复