如何使用机器学习为数据分配标签/评分

我有一个由许多包含推文的行组成的数据框。我希望使用机器学习技术（监督或非监督）对它们进行分类。由于数据集未标记，我考虑选择一些行（50%）进行手动标记（+1 表示正面，-1 表示负面，0 表示中性），然后使用机器学习为其他行分配标签。为了实现这一目标，我做了如下操作：

原始数据集

Date                   ID        Tweet                         01/20/2020           4141    The cat is on the table               01/20/2020           4142    The sky is blue                       01/20/2020           53      What a wonderful day                  ...05/12/2020           532     In this extraordinary circumstance we are together   05/13/2020           12      It was a very bad decision            05/22/2020           565     I know you are the best

将数据集分为50%的训练集和50%的测试集。我手动标记了50%的数据如下：

Date                   ID        Tweet                          PosNegNeu 01/20/2020           4141    The cat is on the table               0 01/20/2020           4142    The weather is bad today              -1 01/20/2020           53      What a wonderful day                  1 ... 05/12/2020           532     In this extraordinary circumstance we are together   1 05/13/2020           12      It was a very bad decision            -1 05/22/2020           565     I know you are the best               1

然后我提取了词频（去除停用词后）：

               Frequency bad               2 circumstance      1 best              1 day               1 today             1 wonderful         1

….

我希望尝试根据以下内容为其他数据分配标签：

频率表中的词语，例如说“如果一条推文包含‘bad’，则分配-1；如果一条推文包含‘wonderful’，则分配1（即我应该创建一个字符串列表和规则）;
基于句子相似度（例如使用Levenshtein距离）。

我知道有几种方法可以做到这一点，甚至有更好的方法，但我对分类/分配标签给我的数据有些问题，我无法手动完成。

我的预期输出，例如使用以下测试数据集

Date                   ID        Tweet                                   06/12/2020           43       My cat 'Sylvester' is on the table            07/02/2020           75       Laura's pen is black                                                07/02/2020           763      It is such a wonderful day                                    ...11/06/2020           1415    No matter what you need to do                  05/15/2020           64      I disagree with you: I think it is a very bad decision           12/27/2020           565     I know you can improve

应该像这样

Date                   ID        Tweet                                   PosNegNeu06/12/2020           43       My cat 'Sylvester' is on the table            007/02/2020           75       Laura's pen is black                          0                       07/02/2020           763      It is such a wonderful day                    1                ...11/06/2020           1415    No matter what you need to do                  0  05/15/2020           64      I disagree with you: I think it is a very bad decision  -1          12/27/2020           565     I know you can improve                         0

可能更好的方法是考虑使用n-gram而不是单个词，或者构建一个语料库/词汇表来分配一个分数，然后是情感。任何建议都将不胜感激，因为这是我在机器学习上的第一个练习。我认为k-means聚类也可以应用，尝试获取更多相似的句子。如果您能提供一个完整的示例（使用我的数据会很好，但使用其他数据也可以），我将非常感激。

回答：

在此上下文中，我将提出要分析极性的句子或推文。这可以使用textblob库来完成。它可以通过pip install -U textblob进行安装。一旦找到文本数据的极性，就可以将其作为数据框中的一个单独列分配。随后，可以使用句子的极性进行进一步的分析。

初始代码

from textblob import TextBlobdf['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)print(df)

中间结果

    Date     ...                                  sentiment0  1/1/2020  ...                                 (0.0, 0.0)1  2/1/2020  ...                                 (0.0, 0.0)2  3/2/2020  ...                                 (0.0, 0.1)3  4/2/2020  ...  (-0.6999999999999998, 0.6666666666666666)4  5/2/2020  ...                                 (0.5, 0.6)[5 rows x 4 columns]

从上述输出的情感列中，我们可以看到情感列分为两类——极性和主观性。

极性是一个在[-1.0到1.0]范围内的浮点值，其中0表示中性，+1表示非常积极的情感，-1表示非常消极的情感。

主观性是一个在[0.0到1.0]范围内的浮点值，其中0.0表示非常客观，1.0表示非常主观。主观句子表达了一些个人感情、观点、信仰、意见、指控、愿望、信仰、怀疑和猜测，而客观句子是事实性的。

注意，情感列是一个元组。因此我们可以将其拆分为两列，如df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)。现在，我们可以创建一个新数据框，我将把拆分的列附加到其中，如下所示;

df_new = dfdf_new['polarity'] = df1['polarity']df_new.polarity = df1.polarity.astype(float)df_new['subjectivity'] = df1['subjectivity']df_new.subjectivity = df1.polarity.astype(float)

最后，基于之前找到的句子极性，我们现在可以为数据框添加一个标签，这将指示推文是积极的、消极的还是中性的。

import numpy as npconditionList = [    df_new['polarity'] == 0,    df_new['polarity'] > 0,    df_new['polarity'] < 0]choiceList = ['neutral', 'positive', 'negative']df_new['label'] = np.select(conditionList, choiceList, default='no_label')print(df_new)

最终，结果将如下所示;

最终结果

[5 rows x 6 columns]       Date  ID                 Tweet  ... polarity  subjectivity     label0  1/1/2020   1  the weather is sunny  ...      0.0           0.0   neutral1  2/1/2020   2       tom likes harry  ...      0.0           0.0   neutral2  3/2/2020   3       the sky is blue  ...      0.0           0.0   neutral3  4/2/2020   4    the weather is bad  ...     -0.7          -0.7  negative4  5/2/2020   5         i love apples  ...      0.5           0.5  positive[5 rows x 7 columns]

数据

import pandas as pd# create a dictionarydata = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],    "ID":[1,2,3,4,5],    "Tweet":["the weather is sunny",             "tom likes harry", "the sky is blue",             "the weather is bad","i love apples"]}# convert data to dataframedf = pd.DataFrame(data)

完整代码

# create some dummy dataimport pandas as pdimport numpy as np# create a dictionarydata = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],        "ID":[1,2,3,4,5],        "Tweet":["the weather is sunny",                 "tom likes harry", "the sky is blue",                 "the weather is bad","i love apples"]}# convert data to dataframedf = pd.DataFrame(data)from textblob import TextBlobdf['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)print(df)# split the sentiment column into twodf1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)# append cols to original dataframedf_new = dfdf_new['polarity'] = df1['polarity']df_new.polarity = df1.polarity.astype(float)df_new['subjectivity'] = df1['subjectivity']df_new.subjectivity = df1.polarity.astype(float)print(df_new)# add label to dataframe based on conditionconditionList = [    df_new['polarity'] == 0,    df_new['polarity'] > 0,    df_new['polarity'] < 0]choiceList = ['neutral', 'positive', 'negative']df_new['label'] = np.select(conditionList, choiceList, default='no_label')print(df_new)

学技术

如何使用机器学习为数据分配标签/评分

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复