我有一个由许多包含推文的行组成的数据框。我希望使用机器学习技术(监督或非监督)对它们进行分类。由于数据集未标记,我考虑选择一些行(50%)进行手动标记(+1 表示正面,-1 表示负面,0 表示中性),然后使用机器学习为其他行分配标签。为了实现这一目标,我做了如下操作:
原始数据集
Date ID Tweet 01/20/2020 4141 The cat is on the table 01/20/2020 4142 The sky is blue 01/20/2020 53 What a wonderful day ...05/12/2020 532 In this extraordinary circumstance we are together 05/13/2020 12 It was a very bad decision 05/22/2020 565 I know you are the best
-
将数据集分为50%的训练集和50%的测试集。我手动标记了50%的数据如下:
Date ID Tweet PosNegNeu 01/20/2020 4141 The cat is on the table 0 01/20/2020 4142 The weather is bad today -1 01/20/2020 53 What a wonderful day 1 ... 05/12/2020 532 In this extraordinary circumstance we are together 1 05/13/2020 12 It was a very bad decision -1 05/22/2020 565 I know you are the best 1
然后我提取了词频(去除停用词后):
Frequency bad 2 circumstance 1 best 1 day 1 today 1 wonderful 1
….
我希望尝试根据以下内容为其他数据分配标签:
- 频率表中的词语,例如说“如果一条推文包含‘bad’,则分配-1;如果一条推文包含‘wonderful’,则分配1(即我应该创建一个字符串列表和规则);
- 基于句子相似度(例如使用Levenshtein距离)。
我知道有几种方法可以做到这一点,甚至有更好的方法,但我对分类/分配标签给我的数据有些问题,我无法手动完成。
我的预期输出,例如使用以下测试数据集
Date ID Tweet 06/12/2020 43 My cat 'Sylvester' is on the table 07/02/2020 75 Laura's pen is black 07/02/2020 763 It is such a wonderful day ...11/06/2020 1415 No matter what you need to do 05/15/2020 64 I disagree with you: I think it is a very bad decision 12/27/2020 565 I know you can improve
应该像这样
Date ID Tweet PosNegNeu06/12/2020 43 My cat 'Sylvester' is on the table 007/02/2020 75 Laura's pen is black 0 07/02/2020 763 It is such a wonderful day 1 ...11/06/2020 1415 No matter what you need to do 0 05/15/2020 64 I disagree with you: I think it is a very bad decision -1 12/27/2020 565 I know you can improve 0
可能更好的方法是考虑使用n-gram而不是单个词,或者构建一个语料库/词汇表来分配一个分数,然后是情感。任何建议都将不胜感激,因为这是我在机器学习上的第一个练习。我认为k-means聚类也可以应用,尝试获取更多相似的句子。如果您能提供一个完整的示例(使用我的数据会很好,但使用其他数据也可以),我将非常感激。
回答:
在此上下文中,我将提出要分析极性的句子或推文。这可以使用textblob
库来完成。它可以通过pip install -U textblob
进行安装。一旦找到文本数据的极性,就可以将其作为数据框中的一个单独列分配。随后,可以使用句子的极性进行进一步的分析。
初始代码
from textblob import TextBlobdf['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)print(df)
中间结果
Date ... sentiment0 1/1/2020 ... (0.0, 0.0)1 2/1/2020 ... (0.0, 0.0)2 3/2/2020 ... (0.0, 0.1)3 4/2/2020 ... (-0.6999999999999998, 0.6666666666666666)4 5/2/2020 ... (0.5, 0.6)[5 rows x 4 columns]
从上述输出的情感列中,我们可以看到情感列分为两类——极性和主观性。
极性是一个在[-1.0到1.0]范围内的浮点值,其中0表示中性,+1表示非常积极的情感,-1表示非常消极的情感。
主观性是一个在[0.0到1.0]范围内的浮点值,其中0.0表示非常客观,1.0表示非常主观。主观句子表达了一些个人感情、观点、信仰、意见、指控、愿望、信仰、怀疑和猜测,而客观句子是事实性的。
注意,情感列是一个元组。因此我们可以将其拆分为两列,如df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
。现在,我们可以创建一个新数据框,我将把拆分的列附加到其中,如下所示;
df_new = dfdf_new['polarity'] = df1['polarity']df_new.polarity = df1.polarity.astype(float)df_new['subjectivity'] = df1['subjectivity']df_new.subjectivity = df1.polarity.astype(float)
最后,基于之前找到的句子极性,我们现在可以为数据框添加一个标签,这将指示推文是积极的、消极的还是中性的。
import numpy as npconditionList = [ df_new['polarity'] == 0, df_new['polarity'] > 0, df_new['polarity'] < 0]choiceList = ['neutral', 'positive', 'negative']df_new['label'] = np.select(conditionList, choiceList, default='no_label')print(df_new)
最终,结果将如下所示;
最终结果
[5 rows x 6 columns] Date ID Tweet ... polarity subjectivity label0 1/1/2020 1 the weather is sunny ... 0.0 0.0 neutral1 2/1/2020 2 tom likes harry ... 0.0 0.0 neutral2 3/2/2020 3 the sky is blue ... 0.0 0.0 neutral3 4/2/2020 4 the weather is bad ... -0.7 -0.7 negative4 5/2/2020 5 i love apples ... 0.5 0.5 positive[5 rows x 7 columns]
数据
import pandas as pd# create a dictionarydata = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"], "ID":[1,2,3,4,5], "Tweet":["the weather is sunny", "tom likes harry", "the sky is blue", "the weather is bad","i love apples"]}# convert data to dataframedf = pd.DataFrame(data)
完整代码
# create some dummy dataimport pandas as pdimport numpy as np# create a dictionarydata = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"], "ID":[1,2,3,4,5], "Tweet":["the weather is sunny", "tom likes harry", "the sky is blue", "the weather is bad","i love apples"]}# convert data to dataframedf = pd.DataFrame(data)from textblob import TextBlobdf['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)print(df)# split the sentiment column into twodf1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)# append cols to original dataframedf_new = dfdf_new['polarity'] = df1['polarity']df_new.polarity = df1.polarity.astype(float)df_new['subjectivity'] = df1['subjectivity']df_new.subjectivity = df1.polarity.astype(float)print(df_new)# add label to dataframe based on conditionconditionList = [ df_new['polarity'] == 0, df_new['polarity'] > 0, df_new['polarity'] < 0]choiceList = ['neutral', 'positive', 'negative']df_new['label'] = np.select(conditionList, choiceList, default='no_label')print(df_new)