我尝试在Kaggle上使用Ames房价数据集进行线性回归。
我首先手动清理了数据,去除了许多特征。然后,我使用以下实现进行训练。
train_size = np.shape(x_train)[0]valid_size = np.shape(x_valid)[0]test_size = np.shape(x_test)[0]num_features = np.shape(x_train)[1]graph = tf.Graph()with graph.as_default(): # 输入 tf_train_dataset = tf.constant(x_train) tf_train_labels = tf.constant(y_train) tf_valid_dataset = tf.constant(x_valid) tf_test_dataset = tf.constant(x_test) # 变量 weights = tf.Variable(tf.truncated_normal([num_features, 1])) biases = tf.Variable(tf.zeros([1])) # 损失计算 train_prediction = tf.matmul(tf_train_dataset, weights) + biases loss = tf.losses.mean_squared_error(tf_train_labels, train_prediction) # 优化器 # 使用学习率为alpha的梯度下降优化器 alpha = tf.constant(0.000000003, dtype=tf.float64) optimizer = tf.train.GradientDescentOptimizer(alpha).minimize(loss) # 预测 valid_prediction = tf.matmul(tf_valid_dataset, weights) + biases test_prediction = tf.matmul(tf_test_dataset, weights) + biases
这是我的图执行方式:
num_steps = 10001def accuracy(prediction, labels): return ((prediction - labels) ** 2).mean(axis=None)with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') for step in range(num_steps): _, l, predictions = session.run([optimizer, loss, train_prediction]) if (step % 1000 == 0): print('Loss at step %d: %f' % (step, l)) print('Validation accuracy: %.1f%%' % accuracy(valid_prediction.eval(), y_valid)) t_pred = test_prediction.eval() print('Test accuracy: %.1f%%' % accuracy(t_pred, y_test))
以下是我尝试过的方法:
-
我尝试增加学习率。但是,如果我将学习率增加到目前使用的值以上,模型将无法收敛,即损失会爆炸到无穷大。
-
将迭代次数增加到10,000,000。损失收敛得越慢,我迭代的时间就越长(这是可以理解的)。但我仍然离一个合理的值相去甚远。损失通常是一个10位数的数值
我的图设置有什么问题吗?还是线性回归对这个数据集来说不是一个好的选择,我应该尝试使用其他算法?任何帮助和建议都非常感谢!
回答:
工作代码
import csvimport tensorflow as tfimport numpy as npwith open('train.csv', 'rt') as f: reader = csv.reader(f) your_list = list(reader)def toFloatNoFail( data ) : try : return float(data) except : return 0data = [ [ toFloatNoFail(x) for x in row ] for row in your_list[1:] ]data = np.array( data ).astype( float )x_train = data[:,:-1]print x_train.shapey_train = data[:,-1:]print y_train.shapenum_features = np.shape(x_train)[1]# 输入tf_train_dataset = tf.constant(x_train, dtype=tf.float32)tf_train_labels = tf.constant(y_train, dtype=tf.float32)# 变量weights = tf.Variable(tf.truncated_normal( [num_features, 1] , dtype=tf.float32))biases = tf.Variable(tf.constant(0.0, dtype=tf.float32 ))train_prediction = tf.matmul(tf_train_dataset, weights) + biasesloss = tf.reduce_mean( tf.square( tf.log(tf_train_labels) - tf.log(train_prediction) ))optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)num_steps = 10001def accuracy(prediction, labels): return ((prediction - labels) ** 2).mean(axis=None)with tf.Session() as session: tf.global_variables_initializer().run() print('Initialized') for step in range(num_steps): _, l, predictions = session.run([optimizer, loss, train_prediction]) if (step % 1000 == 0): print('Loss at step %d: %f' % (step, l))
关键更改的解释
你的损失函数没有针对价格进行缩放。上述损失函数考虑到你实际上只对与原始价格成比例的价格误差感兴趣。因此,对于价值100万美元的房子,误差5000美元不应该像对于价值5000美元的房子误差5000美元那么糟糕。
新的损失函数是:
loss = tf.reduce_mean( tf.square( tf.log(tf_train_labels) - tf.log(train_prediction) ))