使用不同类型列作为训练数据集

之前我仅使用一列(字符串类型数据)作为训练集,现在我想将另一个对应的列(浮点类型的金额列)也作为训练集,与详情列一起考虑。金额列中,负值表示借记,正值表示贷记。我该如何进行?我尝试将两列合并,但不得不将浮点类型的金额转换为字符串类型,这在我的数据集中毫无意义。我希望包含金额列,以检查机器是否能学习到这些变化,这在这种情况下非常重要。提前感谢。

Details                    |Amount               |Category-------------------------------------------------------------                                Tanishq Jwellery Bangalore |-990                 |jwelleryODESK***BAL-28APR13        |240                  |OthersAEGON RELIGARE LIFE IN     |456                  |OthersINTERNET PAYMENT #999999   |-250                 |Transfer in for Card PaymentWWW.VISTAPRINT.IN          |245                  |PrintKhazana Jwellery           |-9000                |jwelleryINTERNET PAYMENT #999999   |785                  |Transfer in for Card PaymentIndian Oil                 |344                  |FuelTouch foot wear            |-782                 |Clothing

我的脚本的一部分:

import pandas as pdimport numpy as npimport scipy as spfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn import preprocessingimport timeimport matplotlib.pyplot as plt  from sklearn.model_selection import train_test_split # TRAIN DATAdata= pd.read_csv('ds1.csv', delimiter=',',usecols=['Details','Amount','Category'],encoding='utf-8')data=data[data.Category !="Others"]target_one=data['Category']target_list=data['Category'].unique()# TEST DATASETtest_data=pd.read_csv('ds2.csv', delimiter='\t',usecols=['Details','Amount','Category'],encoding='utf-8')x_train, y_train = (data.Details, data.Category )x_test, y_test = (test_data.Details, test_data.Category)vect = CountVectorizer(ngram_range=(1,2))X_train = vect.fit_transform(x_train)X_test = vect.transform(x_test)start = time.clock()mnb = MultinomialNB(alpha =0.13)mnb.fit(X_train,y_train)result= mnb.predict(X_test)print (time.clock()-start)accuracy_score(result,y_test)

回答:

如果你只是想将“金额”列堆叠到通过CountVectorizer获得的文本特征矩阵上,只需在拟合MultinomialNB之前这样做:

import numpy as npX_amount = data["Amount"].as_matrix().reshape(-1, 1)X_train = X_train.toarray()X_train = np.hstack((X_train, X_amount))X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)X_test = X_test.toarray()X_test = np.hstack((X_test, X_test_amount)) 

或者,如果你想继续使用稀疏矩阵作为X_train:

import scipy as spX_amount = data["Amount"].as_matrix().reshape(-1, 1)X_train = sp.sparse.hstack((X_train, X_amount))X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)X_test = sp.sparse.hstack((X_test, X_test_amount)) 

但是,我认为你最终会得到ValueError: Input X must be non-negative错误,因为MultinomialNB旨在用于非负特征值…

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注