我正在尝试将文本数据
用作线性回归模型
的输入
,并使用tensorflow hub
中的通用句子编码器
将我的文本数据转换为向量作为预训练模型,但这给了我tf.tensors
,现在我无法将数据分割成训练和测试集以用于scikit learn的线性回归模型,因为我的目标特征是连续的。
这给我提供了嵌入(即我的pandas数据框文本列中每个文本的形状为(1,512)的向量)
import tensorflow_hub as hub
model_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
model = hub.load(model_url)
embeddings = model(train['excerpt'])
数据看起来像这样:
id excerpt target
0 c12129c31 When the young people returned to the ballroom... -0.340259
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372
2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118
3 dd1000b26 And outside before the palace a great garden w... -1.054013
4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197
嵌入看起来像这样:
tf.Tensor: shape=(2834, 512), dtype=float32, numpy=array([[-0.06747025, 0.02054032, -0.01223458, ..., 0.03468879, -0.04216784, 0.01212691],
[-0.01053216, 0.01346854, 0.01992477, ..., 0.03078162, -0.0226634 , 0.04429556],
[-0.10778417, 0.01735378, 0.00803178, ..., 0.00345916, 0.00552441, -0.02448413],
...,
[ 0.0364146 , 0.02996029, -0.06757646, ..., -0.00335971, -0.01381749, -0.08319554],
[ 0.0042374 , 0.02291174, -0.04473154, ..., -0.02009053, -0.00428826, -0.06476445],
[-0.0141812 , 0.03879716, 0.03304171, ..., 0.06709221, -0.05016331, 0.00868828]], dtype=float32)
现在我想将这些嵌入用作scikit learn中的线性回归模型或任何回归模型的输入。但在使用train_test_split()
分割数据时,出现了错误TypeError: Only integers, slices (
:), ellipsis (
…), tf.newaxis (
None) and scalar tf.int32/tf.int64 tensors are valid indices, got array([1434, 2653, 2620, ..., 749, 2114, 2389])
我分割数据的方式如下:
X_train,X_test,y_train,y_test = train_test_split(embeddings,train['target'],test_size =0.2, shuffle =True)
回答:
在train_test_split
中,你传递的是一个张量。相反,你应该像这样传递NumPy数组-
X_train,X_test,y_train,y_test = train_test_split(embeddings.numpy(), train['target'],test_size =0.2, shuffle =True)