之前我仅使用一列(字符串类型数据)作为训练集,现在我想将另一个对应的列(浮点类型的金额列)也作为训练集,与详情列一起考虑。金额列中,负值表示借记,正值表示贷记。我该如何进行?我尝试将两列合并,但不得不将浮点类型的金额转换为字符串类型,这在我的数据集中毫无意义。我希望包含金额列,以检查机器是否能学习到这些变化,这在这种情况下非常重要。提前感谢。
Details |Amount |Category------------------------------------------------------------- Tanishq Jwellery Bangalore |-990 |jwelleryODESK***BAL-28APR13 |240 |OthersAEGON RELIGARE LIFE IN |456 |OthersINTERNET PAYMENT #999999 |-250 |Transfer in for Card PaymentWWW.VISTAPRINT.IN |245 |PrintKhazana Jwellery |-9000 |jwelleryINTERNET PAYMENT #999999 |785 |Transfer in for Card PaymentIndian Oil |344 |FuelTouch foot wear |-782 |Clothing
我的脚本的一部分:
import pandas as pdimport numpy as npimport scipy as spfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn import preprocessingimport timeimport matplotlib.pyplot as plt from sklearn.model_selection import train_test_split # TRAIN DATAdata= pd.read_csv('ds1.csv', delimiter=',',usecols=['Details','Amount','Category'],encoding='utf-8')data=data[data.Category !="Others"]target_one=data['Category']target_list=data['Category'].unique()# TEST DATASETtest_data=pd.read_csv('ds2.csv', delimiter='\t',usecols=['Details','Amount','Category'],encoding='utf-8')x_train, y_train = (data.Details, data.Category )x_test, y_test = (test_data.Details, test_data.Category)vect = CountVectorizer(ngram_range=(1,2))X_train = vect.fit_transform(x_train)X_test = vect.transform(x_test)start = time.clock()mnb = MultinomialNB(alpha =0.13)mnb.fit(X_train,y_train)result= mnb.predict(X_test)print (time.clock()-start)accuracy_score(result,y_test)
回答:
如果你只是想将“金额”列堆叠到通过CountVectorizer
获得的文本特征矩阵上,只需在拟合MultinomialNB
之前这样做:
import numpy as npX_amount = data["Amount"].as_matrix().reshape(-1, 1)X_train = X_train.toarray()X_train = np.hstack((X_train, X_amount))X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)X_test = X_test.toarray()X_test = np.hstack((X_test, X_test_amount))
或者,如果你想继续使用稀疏矩阵作为X_train:
import scipy as spX_amount = data["Amount"].as_matrix().reshape(-1, 1)X_train = sp.sparse.hstack((X_train, X_amount))X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)X_test = sp.sparse.hstack((X_test, X_test_amount))
但是,我认为你最终会得到ValueError: Input X must be non-negative
错误,因为MultinomialNB
旨在用于非负特征值…