我正在尝试构建一个机器学习模型。然而,我在理解如何应用编码方面遇到了困难。请查看以下步骤和函数,以复制我一直遵循的过程。
首先,我将数据集拆分为训练集和测试集:
# 导入重采样包from sklearn.naive_bayes import MultinomialNBimport stringfrom nltk.corpus import stopwordsimport refrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import CountVectorizerfrom nltk.tokenize import RegexpTokenizerfrom sklearn.utils import resamplefrom sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score# 将数据集拆分为训练集和测试集# 测试Count VectorizerX = df[['Text']] y = df['Label']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)# 返回到一个数据框training_set = pd.concat([X_train, y_train], axis=1)
现在我应用(欠)采样:
# 分离类别spam = training_set[training_set.Label == 1]not_spam = training_set[training_set.Label == 0]# 对多数类进行欠采样undersample = resample(not_spam, replace=True, n_samples=len(spam), #设置样本数量等于少数类的数量 random_state=40)# 返回到新的训练集undersample_train = pd.concat([spam, undersample])
然后我应用所选择的算法:
full_result = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])X, y = BOW(undersample_train)full_result = full_result.append(training_naive(X_train, X_test, y_train, y_test, 'Count Vectorize'), ignore_index = True)
其中BOW定义如下
def BOW(data): df_temp = data.copy(deep = True) df_temp = basic_preprocessing(df_temp) count_vectorizer = CountVectorizer(analyzer=fun) count_vectorizer.fit(df_temp['Text']) list_corpus = df_temp["Text"].tolist() list_labels = df_temp["Label"].tolist() X = count_vectorizer.transform(list_corpus) return X, list_labels
basic_preprocessing
定义如下:
def basic_preprocessing(df): df_temp = df.copy(deep = True) df_temp = df_temp.rename(index = str, columns = {'Clean_Titles_2': 'Text'}) df_temp.loc[:, 'Text'] = [text_prepare(x) for x in df_temp['Text'].values] #le = LabelEncoder() #le.fit(df_temp['medical_specialty']) #df_temp.loc[:, 'class_label'] = le.transform(df_temp['medical_specialty']) tokenizer = RegexpTokenizer(r'\w+') df_temp["Tokens"] = df_temp["Text"].apply(tokenizer.tokenize) return df_temp
其中text_prepare
定义如下:
def text_prepare(text): REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]') BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') STOPWORDS = set(stopwords.words('english')) text = text.lower() text = REPLACE_BY_SPACE_RE.sub('', text) # 用空格替换REPLACE_BY_SPACE_RE符号 text = BAD_SYMBOLS_RE.sub('', text) # 从文本中删除BAD_SYMBOLS_RE中的符号 words = text.split() i = 0 while i < len(words): if words[i] in STOPWORDS: words.pop(i) else: i += 1 text = ' '.join(map(str, words))# 从文本中删除停用词 return text
以及
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc): clf = MultinomialNB() # 高斯朴素贝叶斯 clf.fit(X_train_naive, y_train_naive) res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy']) y_pred = clf.predict(X_test_naive) f1 = f1_score(y_pred, y_test_naive, average = 'weighted') pres = precision_score(y_pred, y_test_naive, average = 'weighted') rec = recall_score(y_pred, y_test_naive, average = 'weighted') acc = accuracy_score(y_pred, y_test_naive) res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True) return res
如您所见,顺序是这样的:
- 定义
text_prepare
用于文本清理; - 定义
basic_preprocessing
; - 定义
BOW
; - 将数据集拆分为训练集和测试集;
- 应用采样;
- 应用算法。
我不理解的是如何正确地编码文本以使算法正常工作。我的数据集称为df,列如下:
Label Text Year1 bla bla bla 20000 add some words 20121 this is just an example 19980 unfortunately the code does not work 20180 where should I apply the encoding? 20000 What am I missing here? 2005
当我应用BOW时,顺序是错误的,因为我得到了这个错误:ValueError: could not convert string to float: 'Expect a good results if ... '
我遵循了这个链接中的步骤(和代码):kaggle.com/ruzarx/oversampling-smote-and-adasyn。然而,采样的部分是错误的,因为它应该只在拆分后的训练集上进行。原则应该是:(1) 拆分训练/测试集;(2) 对训练集应用重采样,以便模型在平衡数据上进行训练;(3) 应用模型到测试集并对其进行评估。
我很乐意提供进一步的信息、数据和/或代码,但我认为我已经提供了所有最相关的步骤。
非常感谢。
回答:
您需要有一个测试BOW函数,该函数应重用在训练阶段构建的count vectorizer模型。
考虑使用管道来减少代码的冗长性。