如何使用词袋模型的特征向量数据进行机器学习算法预测?

我正在开发一个程序,通过文本数据预测相应的业务单位。我已经建立了一个词汇表,用于查找文本中与特定单位相关的词汇出现情况,但我不确定如何将这些数据用于机器学习模型进行预测。

该程序可能预测的单位有四个,分别是:MicrosoftTech、JavaTech、Pythoneers 和 JavascriptRoots。我在词汇表中放入了指示特定单位的词。例如,JavaTech:Java、Spring、Android;MicrosoftTech:.Net、csharp;依此类推。现在,我使用带有自定义词汇表的词袋模型来查找这些词出现的频率。

这是我获取词频数据的代码:

def bagOfWords(description, vocabulary):    bag = np.zeros(len(vocabulary)).astype(int)    for sw in description:        for i,word in enumerate(vocabulary):            if word == sw:                 bag[i] += 1    print("Bag: ", bag)    return bag

假设词汇表是:[java, spring, .net, csharp, python, numpy, nodejs, javascript]。描述是:"Company X is looking for a Java Developer. Requirements: Has worked with Java. 3+ years experience with Java, Maven and Spring."

运行代码将输出以下结果:Bag: [3,1,0,0,0,0,0,0]

如何使用这些数据进行机器学习算法的预测呢?

到目前为止我的代码如下:

import pandas as pdimport numpy as npimport warningsimport tkinter as tkfrom tkinter import filedialogfrom nltk.tokenize import TweetTokenizerwarnings.filterwarnings("ignore", category=FutureWarning)root= tk.Tk()canvas1 = tk.Canvas(root, width = 300, height = 300, bg = 'lightsteelblue')canvas1.pack()def getExcel ():    global df    vocabularysheet = pd.read_excel (r'Filepath\filename.xlsx')    vocabularydf = pd.DataFrame(vocabularysheet, columns = ['Word'])    vocabulary = vocabularydf.values.tolist()    unitlabelsdf = pd.DataFrame(vocabularysheet, columns = ['Unit'])    unitlabels = unitlabelsdf.values.tolist()    for voc in vocabulary:        index = vocabulary.index(voc)        voc = vocabulary[index][0]        vocabulary[index] = voc    for label in unitlabels:        index = unitlabels.index(label)        label = unitlabels[index][0]        unitlabels[index] = label    import_file_path = filedialog.askopenfilename()    testdatasheet = pd.read_excel (import_file_path)    descriptiondf = pd.DataFrame(testdatasheet, columns = ['Description'])    descriptiondf = descriptiondf.replace('\n',' ', regex=True).replace('\xa0',' ', regex=True).replace('•', ' ', regex=True).replace('u200b', ' ', regex=True)    description = descriptiondf.values.tolist()    tokenized_description = tokanize(description)    for x in tokenized_description:        index = tokenized_description.index(x)        tokenized_description[index] = bagOfWords(x, vocabulary)def tokanize(description):     for d in description:        index = description.index(d)        tknzr = TweetTokenizer()        tokenized_description = list(tknzr.tokenize((str(d).lower())))        description[index] = tokenized_description    return descriptiondef wordFilter(tokenized_description):    bad_chars = [';', ':', '!', "*", ']', '[', '.', ',', "'", '"']    if(tokenized_description in bad_chars):        return False    else:        return Truedef bagOfWords(description, vocabulary):    bag = np.zeros(len(vocabulary)).astype(int)    for sw in description:        for i,word in enumerate(vocabulary):            if word == sw:                 bag[i] += 1    print("Bag: ", bag)    return bagbrowseButton_Excel = tk.Button(text='Import Excel File', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold'))predictionButton = tk.Button(text='Button', command=getExcel, bg='green', fg='white', font=('helvetica', 12, 'bold'))canvas1.create_window(150, 150, window=browseButton_Excel)root.mainloop()

回答:

您已经知道如何准备训练数据集了。

以下是我做的一个解释用的例子:

voca = ["java", "spring", "net", "csharp", "python", "numpy", "nodejs", "javascript"]units = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]desc1 = "Company X is looking for a Java Developer. Requirements: Has worked with Java. 3+ years experience with Java, Maven and Spring."desc2 = "Company Y is looking for a csharp Developer. Requirements: Has wored with csharp. 5+ years experience with csharp, Net."x_train = []y_train = []x_train.append(bagOfWords(desc1, voca))y_train.append(units.index("JavaTech"))x_train.append(bagOfWords(desc2, voca))y_train.append(units.index("MicrosoftTech"))

这样,我们就得到了两个训练数据集:

[array([3, 1, 0, 0, 0, 0, 0, 0]), array([0, 0, 1, 3, 0, 0, 0, 0])] [1, 0]array([3, 1, 0, 0, 0, 0, 0, 0]) => 1 (表示JavaTech)array([0, 0, 1, 3, 0, 0, 0, 0]) => 0 (表示MicrosoftTech)

模型需要预测您定义的四个单位中的一个。因此,我们需要一个分类机器学习模型。分类机器学习模型需要在输出层使用’softmax’作为激活函数,并且需要’crossentropy’作为损失函数。这是一个使用TensorFlow的Keras API编写的简单深度学习模型。

import tensorflow as tfimport numpy as npunits = ["MicrosoftTech", "JavaTech", "Pythoneers", "JavascriptRoots"]x_train = np.array([[3, 1, 0, 0, 0, 0, 0, 0],                [1, 0, 0, 0, 0, 0, 0, 0],                [0, 0, 1, 1, 0, 0, 0, 0],                [0, 0, 2, 0, 0, 0, 0, 0],                [0, 0, 0, 0, 2, 1, 0, 0],                [0, 0, 0, 0, 1, 2, 0, 0],                [0, 0, 0, 0, 0, 0, 1, 1],                [0, 0, 0, 0, 0, 0, 1, 0]])y_train = np.array([0, 0, 1, 1, 2, 2, 3, 3])

模型由一个包含256个单元的隐藏层和一个包含4个单元的输出层组成。

model = tf.keras.models.Sequential([    tf.keras.layers.Dense(256, activation=tf.nn.relu),    tf.keras.layers.Dropout(0.2),    tf.keras.layers.Dense(len(units), activation=tf.nn.softmax)])model.compile(optimizer='adam',                         loss='sparse_categorical_crossentropy',                         metrics=['accuracy'])

我设置了50个epochs。你需要在运行过程中观察损失和准确率。实际上,10个epochs是不够的。现在,我将开始训练模型。

model.fit(x_train, y_train, epochs=50)

这是预测的一部分。newSample只是我创建的一个样本。

newSample = np.array([[2, 2, 0, 0, 0, 0, 0, 0]])prediction = model.predict(newSample)print (prediction)print (units[np.argmax(prediction)])

最后,我得到了以下结果:

[[0.96280855 0.00981709 0.0102595  0.01711495]]MicrosoftTech

这表示每个单位的可能性。最高可能性是MicrosoftTech。

MicrosoftTech : 0.96280855JavaTech : 0.00981709....

这是训练步骤的结果。你可以看到损失持续减少。因此,我增加了epochs的数量。

Epoch 1/508/8 [==============================] - 0s 48ms/step - loss: 1.3978 - acc: 0.0000e+00Epoch 2/508/8 [==============================] - 0s 356us/step - loss: 1.3618 - acc: 0.1250Epoch 3/508/8 [==============================] - 0s 201us/step - loss: 1.3313 - acc: 0.3750Epoch 4/508/8 [==============================] - 0s 167us/step - loss: 1.2965 - acc: 0.7500Epoch 5/508/8 [==============================] - 0s 139us/step - loss: 1.2643 - acc: 0.8750................Epoch 45/508/8 [==============================] - 0s 122us/step - loss: 0.3500 - acc: 1.0000Epoch 46/508/8 [==============================] - 0s 140us/step - loss: 0.3376 - acc: 1.0000Epoch 47/508/8 [==============================] - 0s 134us/step - loss: 0.3257 - acc: 1.0000Epoch 48/508/8 [==============================] - 0s 137us/step - loss: 0.3143 - acc: 1.0000Epoch 49/508/8 [==============================] - 0s 141us/step - loss: 0.3032 - acc: 1.0000Epoch 50/508/8 [==============================] - 0s 177us/step - loss: 0.2925 - acc: 1.0000

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注