scikit learn中的cross_val_score返回nan评分列表

我正在尝试使用交叉验证处理多标签数据集的不平衡问题,但在运行分类器时,scikit learn的cross_val_score返回nan值列表。这是代码:

import pandas as pdimport numpy as npdata = pd.DataFrame.from_dict(dict, orient = 'index') # save the given data below in dict variable to run this linefrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.preprocessing import MultiLabelBinarizerfrom sklearn.multiclass import OneVsRestClassifiermultilabel = MultiLabelBinarizer()y = multilabel.fit_transform(data['Tags']) from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) tfidf = TfidfVectorizer(stop_words = stop_words,max_features= 40000, ngram_range = (1,3))X = tfidf.fit_transform(data['cleaned_title'])from skmultilearn.model_selection import IterativeStratificationk_fold = IterativeStratification(n_splits=10, order=1)from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import jaccard_scoreclass_weight = {0:1,1:10}lr = LogisticRegression(class_weight = class_weight, n_jobs = -1)scores = cross_val_score(lr, X, y, cv=k_fold, scoring = 'f1_micro')scores

这是使用data.head(10).to_dict()获取的前10行数据:

{0: {'Tags': ['python', 'list', 'loops', 'for-loop', 'indexing'],  'cleaned_title': 'for loop we use any local variable   what if we use any number present in a list  ',  'cleaned_text_of_ques': 'in the for   loop we use any local variable   what if we use any number in a list   what will be the output   a   [ 1 2 3 4 5 6 ] b   [ ] for a[ 1 ] in a   b append a[ 1 ]   print b  '}, 1: {'Tags': ['python', 'loops', 'tkinter', 'algorithm-animation'],  'cleaned_title': 'contain a mainloop [ duplicate ]',  'cleaned_text_of_ques': 'my code be a bubble sort that i be try to visualise   but i be struggle to find a way to make a block of code only be use once   i also think that if i could only mainloop a section that would'}, 2: {'Tags': ['android',   'android-lifecycle',   'activity-lifecycle',   'onsaveinstancestate'],  'cleaned_title': 'when onrestoreinstancestate be not call  ',  'cleaned_text_of_ques': 'docs describe when onrestoreinstancestate be call   this method be call after onstart     when the activity be be re   initialize from a previously save state   give here in savedinstancestate  '}, 3: {'Tags': ['python', 'r', 'bash', 'conda', 'spyder'],  'cleaned_title': 'point conda r to already   instal version of r',  'cleaned_text_of_ques': 'my problem have to do with the fact that rstudio and conda be point to different version of r  my r and rstudio be instal independent of anaconda   and everything be work great   in my    '}, 4: {'Tags': ['android',   'firebase',   'firebase-realtime-database',   'android-recyclerview'],  'cleaned_title': 'how to use a recycleview with several different layout   accord to the datum collect in firebase   [ close ]',  'cleaned_text_of_ques': 'i have a problem   there be day that i do research and test code   but nothing work   my application will have a window where i will post datum take in firebase   use a recycleview   with the'}, 5: {'Tags': ['html', 'css', 'layout'],  'cleaned_title': 'how to create side by side layout of an image and label  ',  'cleaned_text_of_ques': 'i have be try for a while now and can not seem to achive the bellow design    exploreitem   background   color     353258       rgba 31   31   31   1       border   1px solid   4152f1   color  '}, 6: {'Tags': ['php', 'jquery', 'file'],  'cleaned_title': 'php jquery ajax   _ files[ file   ] undefined index error',  'cleaned_text_of_ques': 'i have a form that upload image file and it be not work   i have try submit and click event   the error appear when i have remove the if statement   thank in advance for your help  '}, 7: {'Tags': ['python', 'pandas', 'dataframe'],  'cleaned_title': 'how to update value in pandas dataframe in a for loop  ',  'cleaned_text_of_ques': 'i be try to make a data frame that can store variable coeff value after each iteration   i be able to plot the graph after each iteration   but when i try to insert the value in the data frame'}, 8: {'Tags': ['xpath', 'web-scraping', 'scrapy'],  'cleaned_title': 'scrapy   how can i handle a random number of element  ',  'cleaned_text_of_ques': 'i have a scrapy crawler that i can comfortably acquire the first desire paragraph   but sometimes there be a second or third paragraph   response xpath f string   h2[contains text           card   ] '}, 9: {'Tags': ['bootstrap-4', 'tabs', 'collapse'],  'cleaned_title': 'collapse three column with bootstrap',  'cleaned_text_of_ques': 'i be try to make three tab with cross   reference with one tab visible at the time   i be use the bootstrap v4 collapse scheme with functionality support by jquery   here be the example   https  '}}

这是我在scores变量中获取的cross_val_score结果:array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])。每个值应该在0-1范围内。然而,这发生在所有算法模型上。


回答:

你有一个多标签数据集,这意味着在转换后,你的y变量将具有多于1列,逻辑回归对此不适用:

lr.fit(X,y)ValueError: y should be a 1d array, got an array of shape (10, 32) instead.

这就是你得到nan的原因。你需要选择一个分类器,请参阅scikit-learn的帮助页面来了解选项。此外,我不确定IterativeStratification是否适用于多标签数据集,所以如果你使用KFold,它可以工作:

from sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import KFoldkf = KFold(n_splits=5)clf = DecisionTreeClassifier()scores = cross_val_score(clf, X, y, cv=kf, scoring = 'f1_micro')

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注