打乱字典中的数据以生成测试和训练数据

我想将我从一个字典和一个独立数组中获取的数据分割成训练数据和测试数据。我尝试了各种方法,但没有成功。由于这些数据在我处理流程中的预处理方式,我需要最初将特征保持为字典的形式。社区中有谁对此有建议吗?

字典(特征值):

{'input1': array([42., 50., 68., ..., 60., 46., 60.]), 'input2': array([[-2.00370455, -2.35689664, -1.96147382, ...,  2.11014128,          2.59383321,  1.24209607],        [-1.97130549, -2.19063663, -2.02996445, ...,  2.32125568,          2.27316046,  1.48600614],        [-2.01526666, -2.40440917, -1.94321752, ...,  2.15266657,          2.68460488,  1.23534095],        ...,        [-2.1359458 , -2.52428007, -1.75701785, ...,  2.25480819,          2.68114281,  1.75468981],        [-1.95868206, -2.23297167, -1.96401751, ...,  2.07427239,          2.60306072,  1.28556955],        [-1.80507278, -2.62199521, -2.08697271, ...,  2.34080577,          2.48254585,  1.52028871]])}

目标值

y = array([0.83, 0.4 , 0.53, ..., 0.  , 0.94, 1. ])Shape: (3000,)

创建字典

#字典值input1 = embeddings.numpy()input2 = df['feature'].valuesy = df['target'].valuesfull_model_inputs = [input1 , embeddings]original_model_inputs = dict(input1 = input1 , input2 = input2 )

分割数据

x_train, x_test, y_train, y_test = train_test_split([original_model_inputs['input1'],                                                      original_model_inputs['input2']], y, test_size = 0.2, random_state = 6)

x_train, x_test, y_train, y_test = train_test_split(original_model_inputs, y, test_size = 0.2, random_state = 6)

错误信息

ValueError: Found input variables with inconsistent numbers of samples: [2, 3000]

Input1:

[55., 46., 46., ..., 60., 60., 45.]Shape: (3000,)

Input2:

[[-2.00370455, -2.35689664, -1.96147382, ...,  2.11014128,         2.59383321,  1.24209607],       [-1.97130549, -2.19063663, -2.02996445, ...,  2.32125568,         2.27316046,  1.48600614],       [-2.01526666, -2.40440917, -1.94321752, ...,  2.15266657,         2.68460488,  1.23534095],       ...,       [-2.1359458 , -2.52428007, -1.75701785, ...,  2.25480819,         2.68114281,  1.75468981],       [-1.95868206, -2.23297167, -1.96401751, ...,  2.07427239,         2.60306072,  1.28556955],       [-1.80507278, -2.62199521, -2.08697271, ...,  2.34080577,         2.48254585,  1.52028871]]Shape: (3000, 3840)

构建模型

input1= Input(shape = (1, ))input2= Input(shape = (3840, ))# 第一分支处理第一个输入x = Dense(units = 128, activation="relu")(input1)x = BatchNormalization()(x)x = Dense(units = 128, activation="relu")(x)x = BatchNormalization()(x)x = Model(inputs=input1, outputs=x)# 第二分支处理第二个输入(嵌入)y = Dense(units = 128, activation="relu")(input2)y = BatchNormalization()(y)y = Dense(units = 128, activation="relu")(y)y = BatchNormalization()(y)  y = Model(inputs=input2, outputs=y)# 结合两个分支的输出combined = Concatenate()([x.output, y.output])out = Dense(128, activation='relu')(combined)out = Dropout(0.5)(out)out = Dense(1)(out)# 模型将接受两个分支的输入,然后输出一个单一值model = Model(inputs = [x.input, y.input], outputs = out)model.compile(loss='mse', optimizer = Adam(lr = 0.001), metrics = ['mse'])model.fit([X1,X2], Y, epochs=3)

回答:

将你的字典放入pandas数据框中,这将保留数据的维度,并按你希望的方式分割数据:

df = pd.DataFrame({"input1":original_model_inputs["input1"],                     "input2":list(original_model_inputs["input2"])})X_train, X_test, y_train, y_test = train_test_split(df,y)

转换回原始格式:

X_train = X_train.to_dict("list")X_test = X_test.to_dict("list")

编辑

为了保持你的处理流程的功能性,你可能需要添加以下两行代码:

X_train = {k:np.array(v) for k,v in X_train.items()}X_test = {k:np.array(v) for k,v in X_test.items()}

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注