提升Python算法速度

我使用了Sentiment140数据集进行Twitter情感分析

代码:

从推文中获取单词:

tweet_tokens = [][tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)]

从标记中获取未知单词

words_without_embs = [][[words_without_embs.append(w) for w in tweet if w not in word2vec] for tweet in tweet_tokens]len(words_without_embs)

代码的最后部分，计算向量为左右单词（上下文）的平均值

vectors = {} # algfor word in words_without_embs:  mean_vectors = []  for tweet in tweet_tokens:    if word in tweet:      idx = tweet.index(word)      try:        mean_vector = np.mean([word2vec.get_vector(tweet[idx-1]), word2vec.get_vector(tweet[idx+1])], axis=0)        mean_vectors.append(mean_vector)      except:        pass    if tweet == tweet_tokens[-1]: # last iteration      mean_vector_all_tweets = np.mean(mean_vectors, axis=0)      vectors[word] = mean_vector_all_tweets

有1058532个单词，代码的最后部分运行非常慢，大约每分钟处理250个单词。

如何提高这个算法的速度？

回答：

处理未知单词的更常见（可能也更好）的策略包括:

训练/使用像FastText这样的模型，它可以为词汇表外的（OOV）单词提供猜测向量
获取更多的训练数据，以便可以从真实使用中学习更多未知单词的向量
完全忽略未知单词

看起来你决定通过平均所有直接邻居来为OOV单词合成新向量。我认为这不会特别有效。在许多使用词向量的下游应用中，这往往会过度强调单词的上下文邻居——这也可以通过简单地完全忽略未知单词来轻松实现。

但是，鉴于你想要做的事情，最好的方法是在识别words_without_embs的同时收集邻近的单词。

例如，将words_without_embs设为dict（或者可能是DefaultDict），其中每个键是需要向量的单词，每个值是到目前为止发现的所有邻近单词的list。

然后，对tweet_tokens的一次循环既可以用需要向量的单词填充words_without_embs的键，又可以用到目前为止看到的所有邻近单词填充那些值。

然后，对words_without_embs键的最后一次循环只需获取现有的邻居单词列表进行平均即可。（不再需要多次遍历tweet_tokens。）

但再次强调：所有这些工作可能不会超过简单地删除未知单词的基线实践。

学技术

提升Python算法速度

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复