关于跳字模型批量数据生成的困惑

当我在使用电影数据集在TensorFlow中检查跳字模型的实现时，我遇到了这个函数：

def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):    # Fill up data batch    batch_data = []    label_data = []    while len(batch_data) < batch_size:        # select random sentence to start        rand_sentence = np.random.choice(sentences)        # Generate consecutive windows to look at        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]        # Denote which element of each window is the center word of interest        label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]        # Pull out center word of interest for each window and create a tuple for each window        if method=='skip_gram':            batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]            # Make it in to a big list of tuples (target word, surrounding word)            tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]        elif method=='cbow':            batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]            # Make it in to a big list of tuples (target word, surrounding word)            tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x]        else:            raise ValueError('Method {} not implemented yet.'.format(method))        # extract batch and labels        batch, labels = [list(x) for x in zip(*tuple_data)]        batch_data.extend(batch[:batch_size])        label_data.extend(labels[:batch_size])    # Trim batch and label at the end    batch_data = batch_data[:batch_size]    label_data = label_data[:batch_size]    # Convert to numpy array    batch_data = np.array(batch_data)    label_data = np.transpose(np.array([label_data]))    return(batch_data, label_data)

但我已经尝试了几天理解这段代码，但还是没有弄明白。如果你想从更广泛的角度了解，完整的代码在这里：链接。

在代码中，我们有一个最常见的10000个单词的编号。我们以数字形式将句子传递给上述函数。因为这是一个跳字模型，我们必须查看相邻的单词。但是在这个算法中是如何实现的呢？window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)] 不会创建一个按频率而不是按句子使用顺序相邻的单词窗口吗？
我希望在这里得到澄清。

非常感谢！

回答：

考虑以下句子作为标记：

sentence = ["the","book","is","on","the","table"]

假设window_size为3。构建window_sequences的代码可以这样重写：

for ix in range(len(sentence)):    x = sentence[ix] # 所以这是句子的第ix个单词    from_index = max((ix-window_size) # 这是窗口的初始索引    to_index = (ix+window_size+1) # 这是窗口的最终索引（不包括自身）    window = sentence[from_index, to_index] # 我们选择句子的单词

现在让我们为一些ix运行这段代码：

ix=0, x="the", from_index=0, to_index=4, window = ["the", "book", "is", "on"]ix=3, x="on", from_index=0, to_index=7, window = ["the", "book", "is", "on", "the", "table"]

如你所见，它正在构建单词窗口，这些窗口正是原始句子的部分。

在分析这段代码时，你可能遇到的问题是，句子中的单词被替换成了数字id，单词出现的频率越高，其id就越低。

所以之前的句子看起来会像这样：

sentence = [2,45,7,13,2,67]

它们并不是按频率顺序排序的，而是完全保持了句子中的顺序。只是它们的表面形式从string变为了int，但你可以很容易地在字符串版本上理解代码。

学技术

关于跳字模型批量数据生成的困惑

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复