我正在尝试从CNBC网站获取一个非常基本的情感分析。我拼凑了这段代码,它运行得很好。
from bs4 import BeautifulSoupimport urllib.requestfrom pandas import DataFrameresp = urllib.request.urlopen("https://www.cnbc.com/finance/")soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) substring = 'https://www.cnbc.com/'df = ['review']for link in soup.find_all('a', href=True): print(link['href']) if (link['href'].find(substring) == 0): # append df.append(link['href'])#print(link['href'])#list(df)# convert list to data framedf = DataFrame(df)#type(df)#list(df)# add column namedf.columns = ['review']# clean updf['review'] = df['review'].str.replace('\d+', '')# Get rid of special charactersdf['review'] = df['review'].str.replace(r'[^\w\s]+', '')from nltk.sentiment.vader import SentimentIntensityAnalyzersid = SentimentIntensityAnalyzer()df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))def convert(x): if x < 0: return "negative" elif x > .2: return "positive" else: return "neutral"df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))df['result']
当我运行上述代码时,我得到了正面和负面的结果,但这些结果并未与原始的’review’对应。我怎样才能在数据框中显示每个情感,并将其与每个链接的文本并列显示?谢谢!
回答:
哦,天哪,我完全搞砸了!这只是一个简单的合并!!
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)df_final
结果:
0 review neutral1 https://www.cnbc.com/business/ neutral2 https://www.cnbc.com/2020/09/15/stocks-making-... neutral3 https://www.cnbc.com/2020/09/15/stocks-making-... neutral4 https://www.cnbc.com/maggie-fitzgerald/ neutral.. ... ...90 https://www.cnbc.com/finance/ neutral91 https://www.cnbc.com/2020/09/10/citi-ceo-micha... neutral92 https://www.cnbc.com/central-banks/ neutral93 https://www.cnbc.com/2020/09/10/watch-ecb-pres... neutral94 https://www.cnbc.com/finance/?page=2 neutral