当我在使用电影数据集在TensorFlow中检查跳字模型的实现时,我遇到了这个函数:
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'): # Fill up data batch batch_data = [] label_data = [] while len(batch_data) < batch_size: # select random sentence to start rand_sentence = np.random.choice(sentences) # Generate consecutive windows to look at window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)] # Denote which element of each window is the center word of interest label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)] # Pull out center word of interest for each window and create a tuple for each window if method=='skip_gram': batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)] # Make it in to a big list of tuples (target word, surrounding word) tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y] elif method=='cbow': batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)] # Make it in to a big list of tuples (target word, surrounding word) tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x] else: raise ValueError('Method {} not implemented yet.'.format(method)) # extract batch and labels batch, labels = [list(x) for x in zip(*tuple_data)] batch_data.extend(batch[:batch_size]) label_data.extend(labels[:batch_size]) # Trim batch and label at the end batch_data = batch_data[:batch_size] label_data = label_data[:batch_size] # Convert to numpy array batch_data = np.array(batch_data) label_data = np.transpose(np.array([label_data])) return(batch_data, label_data)
但我已经尝试了几天理解这段代码,但还是没有弄明白。如果你想从更广泛的角度了解,完整的代码在这里:链接。
在代码中,我们有一个最常见的10000个单词的编号。我们以数字形式将句子传递给上述函数。因为这是一个跳字模型,我们必须查看相邻的单词。但是在这个算法中是如何实现的呢?window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
不会创建一个按频率而不是按句子使用顺序相邻的单词窗口吗?
我希望在这里得到澄清。
非常感谢!
回答:
考虑以下句子作为标记:
sentence = ["the","book","is","on","the","table"]
假设window_size
为3。构建window_sequences
的代码可以这样重写:
for ix in range(len(sentence)): x = sentence[ix] # 所以这是句子的第ix个单词 from_index = max((ix-window_size) # 这是窗口的初始索引 to_index = (ix+window_size+1) # 这是窗口的最终索引(不包括自身) window = sentence[from_index, to_index] # 我们选择句子的单词
现在让我们为一些ix
运行这段代码:
ix=0, x="the", from_index=0, to_index=4, window = ["the", "book", "is", "on"]ix=3, x="on", from_index=0, to_index=7, window = ["the", "book", "is", "on", "the", "table"]
如你所见,它正在构建单词窗口,这些窗口正是原始句子的部分。
在分析这段代码时,你可能遇到的问题是,句子中的单词被替换成了数字id,单词出现的频率越高,其id就越低。
所以之前的句子看起来会像这样:
sentence = [2,45,7,13,2,67]
它们并不是按频率顺序排序的,而是完全保持了句子中的顺序。只是它们的表面形式从string
变为了int
,但你可以很容易地在字符串版本上理解代码。