在TensorFlow数据集中生成层次化文本表示

我目前正在尝试利用tf.data.dataset功能对类似文本的数据集进行可扩展的训练，但我难以找到一种使用内置的TensorFlow函数生成多句字符串的层次化4D表示的方法。过去我会使用类似于下面的方法：

import pandas as pdimport numpy as npfrom nltk.tokenize import sent_tokenizefrom tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequencemax_sent_length = 50        max_sents       = 5max_nb_words    = 100min_freq        = 0text = ["This game is a bit hard to get the hang of, but when you do it's great.", "I played it a while but it was alright. The steam was a bit of trouble. The more they move these game to steam the more of a hard time I have activating and playing a game. But in spite of that it was fun, I liked it. Now I am looking forward to anno 2205 I really want to play my way to the moon."]df = pd.DataFrame({"text":text})tokenizer = Tokenizer(num_words= 100, filters='.')tokenizer.fit_on_texts(df['text'].values)encoded_docs = tokenizer.texts_to_sequences(df['text'].values)word_index = tokenizer.word_indexprint('Total %s unique tokens.' % len(word_index))# limit vocabulary size by token frequencevocab = [k for k in tokenizer.word_counts.keys() if tokenizer.word_counts[k] > min_freq]print('Vocabulary size with frequency > %d = %d' % (min_freq, len(vocab)))max_nb_words = min(max_nb_words, len(vocab)) + 1 # index 0 is not usedprint('Max number of words = %d' % max_nb_words)def create_array(input_text=text, max_sents=5, max_num_words=1000, max_sent_length=50, tokenizer=tokenizer):    data = np.zeros((1, max_sents, max_sent_length), dtype='float32')    for j, sent in enumerate(sent_tokenize(input_text)):        if j < max_sents:            wordTokens = text_to_word_sequence(sent, filters='.', lower=True, split=' ')            k = 0            for _, word in enumerate(wordTokens):                if k < max_sent_length:                     if (word in tokenizer.word_index) and (tokenizer.word_index[word] <= max_num_words):                        data[0, j, k] = tokenizer.word_index[word]                    else:                        data[0, j, k] = max_num_words                    k = k + 1    return datamy_list = [create_array(i, tokenizer=tokenizer, max_sent_length=max_sent_length, max_sents=max_sents) for i in df['text'].tolist()]my_list

期望的输出结果是：

[array([[[14.,  6., 15.,  1., 10., 11.,  2., 16.,  3., 17., 18.,  7.,          19., 20., 21., 22., 23.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.]]], dtype=float32), array([[[ 4., 24.,  5.,  1., 25.,  7.,  5.,  8., 26.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 3., 12.,  8.,  1., 10.,  9., 27.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 3., 13., 28., 29., 30.,  6.,  2., 12.,  3., 13.,  9.,  1.,          11., 31.,  4., 32., 33., 34., 35.,  1.,  6.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [ 7., 36., 37.,  9., 38.,  5.,  8., 39.,  4., 40.,  5.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.],         [41.,  4., 42., 43., 44.,  2., 45., 46.,  4., 47., 48.,  2.,          49., 50., 51.,  2.,  3., 52.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,           0.,  0.]]], dtype=float32)]

我尝试利用这篇文章中的信息，在py_func中创建4D数组：

...（代码内容保持不变）...

但得到了以下错误：TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor("args_0:0", shape=(), dtype=string).

有没有更好的方法来编码并堆叠每个单独的句子，以生成所需的4D数组？

回答：

你遇到了一些问题。我创建了一个自定义的假数据集并修正了错误。这个“数据集”只是随机字母：

...（代码内容保持不变）...

我做了以下修正：

py_function的参数期望是列表
你在尝试追加列表时使用了方括号而不是圆括号
set_shape是可选的，我删除了它
删除了numpy()方法

...（代码内容保持不变）...

最终结果：

...（代码内容保持不变）...

你在发布另一个示例数据集时，我已经发布了我的答案，所以这里是我在你的示例上做的同样的事情：

...（代码内容保持不变）...

输出：

...（代码内容保持不变）...

根据你最后的评论，这里是更新版本：

...（代码内容保持不变）...

...（代码内容保持不变）...

学技术

在TensorFlow数据集中生成层次化文本表示

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复