使用scikit-learn中的标签编码器编码数据时出现TypeError

我在使用scikit-learn中的标签编码器进行数据编码时遇到了问题。

dataset.csv有两列,分别是文本和标签。我尝试将数据集中的文本读取到一个列表中,将标签读取到另一个列表中,然后将这些列表添加到数据框中,但似乎不起作用。

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svmfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom sklearn import decomposition, ensembleimport pandas, xgboost, numpy, stringdata = open('dataset.csv').read()labels = []texts = []for i ,line in enumerate(data.split("\n")):    content = line.split("\",")    texts.append(content[0])    labels.append(content[1:])trainDF = pandas.DataFrame()trainDF['text'] = textstrainDF['label'] = labelstrain_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'],trainDF['label'],test_size = 0.2,random_state = 0)encoder = preprocessing.LabelEncoder()train_y = encoder.fit_transform(train_y)valid_y = encoder.fit_transform(valid_y)count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')count_vect.fit(trainDF['texts'])xtrain_count =  count_vect.transform(train_x)xvalid_count =  count_vect.transform(valid_x)tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)tfidf_vect.fit(trainDF['texts'])xtrain_tfidf =  tfidf_vect.transform(train_x)xvalid_tfidf =  tfidf_vect.transform(valid_x)accuracy = train_model(svm.SVC(), xtrain_tfidf, train_y, xvalid_tfidf)print(accuracy)

错误信息如下:

Traceback (most recent call last):  File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 105, in _encode    res = _encode_python(values, uniques, encode)  File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 59, in _encode_python    uniques = sorted(set(values))TypeError: unhashable type: 'list'During handling of the above exception, another exception occurred:Traceback (most recent call last):  File "Classifier.py", line 21, in <module>    train_y = encoder.fit_transform(train_y)  File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 236, in fit_transform    self.classes_, y = _encode(y, encode=True)  File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 107, in _encode    raise TypeError("argument must be a string or number")TypeError: argument must be a string or number

回答:

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svmfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom sklearn import decomposition, ensembleimport pandas, xgboost, numpy, stringfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.svm import SVCdata = open('dataset.csv').read()labels = []texts = []for i ,line in enumerate(data.split("\n")):    content = line.split("\",")    texts.append(str(content[0]))    labels.append(str(content[1:]))trainDF = pandas.DataFrame()trainDF['text'] = textstrainDF['label'] = labelstrain_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'],trainDF['label'],test_size = 0.2,random_state = 0)encoder = preprocessing.LabelEncoder()train_y = encoder.fit_transform(train_y)valid_y = encoder.fit_transform(valid_y)from sklearn.pipeline import Pipelinetext_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='rbf'))])text_clf.fit(train_x, train_y)predicted = text_clf.predict(valid_x)from sklearn.metrics import classification_report, confusion_matrix, accuracy_scoreprint(confusion_matrix(valid_y,predicted))print(classification_report(valid_y,predicted))print(accuracy_score(valid_y,predicted))

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注