我创建了一个使用Theano进行线性回归的简单Python脚本,理论上应该在GPU上运行。代码启动时显示“使用GPU设备”,但根据性能分析器的所有操作都是CPU特定的(ElemWise,而不是GpuElemWise,也没有GpuFromHost等)。
我检查了变量,THEANO_FLAGS,一切看起来都正确,我找不到问题所在(特别是当Theano教程使用相同的设置时能够正确地在GPU上运行 :))。
这是代码:
# 线性回归import numpyimport theanoimport theano.tensor as Tinput_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])TS = theano.shared(input_data, "training-set")E = theano.shared(output_data, "expected")W1 = theano.shared(numpy.zeros((1, 2)))O = T.dot(TS, W1.T)cost = T.mean(T.sqr(E - O.T))gradient = T.grad(cost=cost, wrt=W1)update = [[W1, W1 - gradient * 0.0001]]train = theano.function([], cost, updates=update, allow_input_downcast=True)for i in range(1000): train()
- THEANO_FLAGS=cuda.root=/usr/local/cuda
- device=gpu
- floatX=float32
- lib.cnmem=.5
- profile=True
- CUDA_LAUNCH_BLOCKING=1
输出:
使用GPU设备0:GeForce GT 650M(CNMeM已启用)函数性能分析================== 消息:/home/mw/Documents/LiClipse Workspace/theano1/test2.py:18 调用Function.__call__ 1000次的时间:3.348637e-02s 调用Function.fn.__call__的时间:2.419019e-02s (72.239%) 执行thunks的时间:1.839781e-02s (54.941%) 总编译时间:1.350801e-01s Apply节点数量:18 Theano优化器时间:1.101730e-01s Theano验证时间:2.029657e-03s Theano链接器时间(包括C、CUDA代码生成/编译):1.491690e-02s 导入时间2.320528e-03s调用theano.grad()的所有时间8.740902e-03s自从导入theano以来的时间0.881s类---<%时间> <总和%> <应用时间> <每次调用时间> <类型> <调用次数> <应用数量> <类名> 71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 theano.tensor.basic.Dot 12.3% 83.9% 0.002s 3.22e-07s C 7000 7 theano.tensor.elemwise.Elemwise 5.7% 89.6% 0.001s 3.50e-07s C 3000 3 theano.tensor.elemwise.DimShuffle 4.0% 93.6% 0.001s 3.65e-07s C 2000 2 theano.tensor.subtensor.Subtensor 3.6% 97.2% 0.001s 3.31e-07s C 2000 2 theano.compile.ops.Shape_i 1.7% 98.9% 0.000s 3.06e-07s C 1000 1 theano.tensor.opt.MakeVector 1.1% 100.0% 0.000s 2.10e-07s C 1000 1 theano.tensor.elemwise.Sum ... (剩余0个类占运行时间的 0.00%(0.00s))操作---<%时间> <总和%> <应用时间> <每次调用时间> <类型> <调用次数> <应用数量> <操作名> 71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 dot 4.0% 75.6% 0.001s 3.65e-07s C 2000 2 Subtensor{int64} 3.5% 79.1% 0.001s 6.35e-07s C 1000 1 InplaceDimShuffle{1,0} 3.3% 82.4% 0.001s 6.06e-07s C 1000 1 Elemwise{mul,no_inplace} 2.4% 84.8% 0.000s 4.38e-07s C 1000 1 Shape_i{0} 2.3% 87.1% 0.000s 4.29e-07s C 1000 1 Elemwise{Composite{((i0 * i1) / i2)}} 2.3% 89.3% 0.000s 2.08e-07s C 2000 2 InplaceDimShuffle{x,x} 1.8% 91.1% 0.000s 3.25e-07s C 1000 1 Elemwise{Cast{float64}} 1.7% 92.8% 0.000s 3.06e-07s C 1000 1 MakeVector{dtype='int64'} 1.5% 94.3% 0.000s 2.78e-07s C 1000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)] 1.4% 95.7% 0.000s 2.53e-07s C 1000 1 Elemwise{Sub}[(0, 1)] 1.2% 96.9% 0.000s 2.24e-07s C 1000 1 Shape_i{1} 1.1% 98.0% 0.000s 2.10e-07s C 1000 1 Sum{acc_dtype=float64} 1.1% 99.1% 0.000s 1.98e-07s C 1000 1 Elemwise{Sqr}[(0, 0)] 0.9% 100.0% 0.000s 1.66e-07s C 1000 1 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)] ... (剩余0个操作占运行时间的 0.00%(0.00s))应用------<%时间> <总和%> <应用时间> <每次调用时间> <调用次数> <id> <应用名> 37.8% 37.8% 0.007s 6.95e-06s 1000 3 dot(<TensorType(float64, matrix)>, training-set.T) 33.9% 71.7% 0.006s 6.24e-06s 1000 14 dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set) 3.5% 75.1% 0.001s 6.35e-07s 1000 0 InplaceDimShuffle{1,0}(training-set) 3.3% 78.4% 0.001s 6.06e-07s 1000 11 Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0) 3.0% 81.4% 0.001s 5.58e-07s 1000 8 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1}) 2.4% 83.8% 0.000s 4.38e-07s 1000 2 Shape_i{0}(expected) 2.3% 86.2% 0.000s 4.29e-07s 1000 12 Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0) 1.8% 87.9% 0.000s 3.25e-07s 1000 6 Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0) 1.7% 89.6% 0.000s 3.06e-07s 1000 4 MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0) 1.6% 91.2% 0.000s 3.03e-07s 1000 10 InplaceDimShuffle{x,x}(Subtensor{int64}.0) 1.5% 92.7% 0.000s 2.78e-07s 1000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0) 1.4% 94.1% 0.000s 2.53e-07s 1000 5 Elemwise{Sub}[(0, 1)](expected, dot.0) 1.2% 95.3% 0.000s 2.24e-07s 1000 1 Shape_i{1}(expected) 1.1% 96.5% 0.000s 2.10e-07s 1000 15 Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0) 1.1% 97.6% 0.000s 1.98e-07s 1000 13 Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0) 0.9% 98.5% 0.000s 1.72e-07s 1000 7 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0}) 0.9% 99.4% 0.000s 1.66e-07s 1000 17 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0) 0.6% 100.0% 0.000s 1.13e-07s 1000 9 InplaceDimShuffle{x,x}(Subtensor{int64}.0) ... (剩余0个应用实例占运行时间的0.00%(0.00s))
回答:
正如评论中提到的,虽然你已经将allow_input_downcast
参数设置为True
,但你需要确保分配给共享变量的所有数据都为float32
类型。截至2016年1月6日,Theano仍然无法使用除float32
之外的任何数据类型在GPU上进行计算,如这里详细提到的。因此,你必须将数据转换为’float32’格式。
因此,以下是你需要使用的代码:
import numpyimport theanoimport theano.tensor as Tinput_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])TS = theano.shared(input_data.astype('float32'), "training-set")E = theano.shared(output_data.astype('float32'), "expected")W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))O = T.dot(TS, W1.T)cost = T.mean(T.sqr(E - O.T))gradient = T.grad(cost=cost, wrt=W1)update = [[W1, W1 - gradient * 0.0001]]train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)for i in range(1000): train()train.profile.print_summary()
以下是性能分析结果:
消息:LearnTheano.py:18 调用Function.__call__ 1000次的时间:2.642968e-01s 调用Function.fn.__call__的时间:2.460811e-01s (93.108%) 执行thunks的时间:1.877530e-01s (71.039%) 总编译时间:2.483290e+01s Apply节点数量:17 Theano优化器时间:2.818849e-01s Theano验证时间:3.435850e-03s Theano链接器时间(包括C、CUDA代码生成/编译):2.453926e+01s 导入时间1.241469e-02s调用theano.grad()的所有时间1.206994e-02s类---<%时间> <总和%> <应用时间> <每次调用时间> <类型> <调用次数> <应用数量> <类名> 34.8% 34.8% 0.065s 3.27e-05s C 2000 2 theano.sandbox.cuda.blas.GpuGemm 28.8% 63.5% 0.054s 1.80e-05s C 3000 3 theano.sandbox.cuda.basic_ops.GpuElemwise 12.9% 76.4% 0.024s 2.42e-05s C 1000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce 10.3% 86.7% 0.019s 1.93e-05s C 1000 1 theano.sandbox.cuda.basic_ops.GpuFromHost 7.2% 93.9% 0.014s 1.36e-05s C 1000 1 theano.sandbox.cuda.basic_ops.HostFromGpu 1.8% 95.7% 0.003s 1.13e-06s C 3000 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle 1.5% 97.2% 0.003s 2.81e-06s C 1000 1 theano.tensor.elemwise.Elemwise 1.1% 98.4% 0.002s 1.08e-06s C 2000 2 theano.compile.ops.Shape_i 1.1% 99.5% 0.002s 1.02e-06s C 2000 2 theano.sandbox.cuda.basic_ops.GpuSubtensor 0.5% 100.0% 0.001s 9.96e-07s C 1000 1 theano.tensor.opt.MakeVector ... (剩余0个类占运行时间的 0.00%(0.00s))操作---<%时间> <总和%> <应用时间> <每次调用时间> <类型> <调用次数> <应用数量> <操作名> 25.3% 25.3% 0.047s 4.74e-05s C 1000 1 GpuGemm{no_inplace} 12.9% 38.1% 0.024s 2.42e-05s C 1000 1 GpuCAReduce{pre=sqr,red=add}{1,1} 12.8% 51.0% 0.024s 2.41e-05s C 1000 1 GpuElemwise{mul,no_inplace} 10.3% 61.3% 0.019s 1.93e-05s C 1000 1 GpuFromHost 9.5% 70.8% 0.018s 1.79e-05s C 1000 1 GpuGemm{inplace} 8.2% 79.0% 0.015s 1.55e-05s C 1000 1 GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)] 7.7% 86.7% 0.014s 1.44e-05s C 1000 1 GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)] 7.2% 93.9% 0.014s 1.36e-05s C 1000 1 HostFromGpu 1.5% 95.4% 0.003s 2.81e-06s C 1000 1 Elemwise{Cast{float32}} 1.1% 96.5% 0.002s 1.02e-06s C 2000 2 GpuSubtensor{int64} 1.0% 97.5% 0.002s 9.00e-07s C 2000 2 GpuDimShuffle{x,x} 0.8% 98.3% 0.002s 1.59e-06s C 1000 1 GpuDimShuffle{1,0} 0.7% 99.1% 0.001s 1.38e-06s C 1000 1 Shape_i{0} 0.5% 99.6% 0.001s 9.96e-07s C 1000 1 MakeVector 0.4% 100.0% 0.001s 7.76e-07s C 1000 1 Shape_i{1} ... (剩余0个操作占运行时间的 0.00%(0.00s))应用------<%时间> <总和%> <应用时间> <每次调用时间> <调用次数> <id> <应用名> 25.3% 25.3% 0.047s 4.74e-05s 1000 3 GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0}) 12.9% 38.1% 0.024s 2.42e-05s 1000 5 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0) 12.8% 51.0% 0.024s 2.41e-05s 1000 13 GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0) 10.3% 61.3% 0.019s 1.93e-05s 1000 7 GpuFromHost(Elemwise{Cast{float32}}.0) 9.5% 70.8% 0.018s 1.79e-05s 1000 16 GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0}) 8.2% 79.0% 0.015s 1.55e-05s 1000 12 GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0) 7.7% 86.7% 0.014s 1.44e-05s 1000 15 GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0) 7.2% 93.9% 0.014s 1.36e-05s 1000 14 HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0) 1.5% 95.4% 0.003s 2.81e-06s 1000 6 Elemwise{Cast{float32}}(MakeVector.0) 0.8% 96.3% 0.002s 1.59e-06s 1000 0 GpuDimShuffle{1,0}(training-set) 0.7% 97.0% 0.001s 1.38e-06s 1000 2 Shape_i{0}(expected) 0.7% 97.7% 0.001s 1.30e-06s 1000 8 GpuSubtensor{int64}(GpuFromHost.0, Constant{0}) 0.6% 98.3% 0.001s 1.08e-06s 1000 11 GpuDimShuffle{x,x}(GpuSubtensor{int64}.0) 0.5% 98.8% 0.001s 9.96e-07s 1000 4 MakeVector(Shape_i{0}.0, Shape_i{1}.0) 0.4% 99.2% 0.001s 7.76e-07s 1000 1 Shape_i{1}(expected) 0.4% 99.6% 0.001s 7.40e-07s 1000 9 GpuSubtensor{int64}(GpuFromHost.0, Constant{1}) 0.4% 100.0% 0.001s 7.25e-07s 1000 10 GpuDimShuffle{x,x}(GpuSubtensor{int64}.0) ... (剩余0个应用实例占运行时间的0.00%(0.00s))