我正在尝试在Caffe中实现一个孪生网络,该网络由两个不共享权重的ImageNet组成。我的基本目标是为每个网络提供一张图像,并最终尝试找出它们之间的距离以确定相似性。以下是我的prototxt文件。我的主要问题是应该将“num_output”设置为多少?我有两个训练类别,0表示它们不相似,1表示它们相似。
name: "Siamese_ImageNet"layers { name: "data" type: IMAGE_DATA top: "data" top: "label" image_data_param { source: "train1.txt" batch_size: 32 new_height: 256 new_width: 256 } include: { phase: TRAIN }}layers { name: "data" type: IMAGE_DATA top: "data" top: "label" image_data_param { source: "test1.txt" batch_size: 32 new_height: 256 new_width: 256 } include: { phase: TEST }}layers { name: "data_p" type: IMAGE_DATA top: "data_p" top: "label_p" image_data_param { source: "train2.txt" batch_size: 32 new_height: 256 new_width: 256 } include: { phase: TRAIN }}layers { name: "data_p" type: IMAGE_DATA top: "data_p" top: "label_p" image_data_param { source: "test2.txt" batch_size: 32 new_height: 256 new_width: 256 } include: { phase: TEST }}layers { name: "conv1" type: CONVOLUTION bottom: "data" top: "conv1" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 96 kernel_size: 11 stride: 4 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } }}layers { name: "relu1" type: RELU bottom: "conv1" top: "conv1"}layers { name: "pool1" type: POOLING bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 3 stride: 2 }}layers { name: "norm1" type: LRN bottom: "pool1" top: "norm1" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 }}layers { name: "conv2" type: CONVOLUTION bottom: "norm1" top: "conv2" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu2" type: RELU bottom: "conv2" top: "conv2"}layers { name: "pool2" type: POOLING bottom: "conv2" top: "pool2" pooling_param { pool: MAX kernel_size: 3 stride: 2 }}layers { name: "norm2" type: LRN bottom: "pool2" top: "norm2" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 }}layers { name: "conv3" type: CONVOLUTION bottom: "norm2" top: "conv3" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 384 pad: 1 kernel_size: 3 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } }}layers { name: "relu3" type: RELU bottom: "conv3" top: "conv3"}layers { name: "conv4" type: CONVOLUTION bottom: "conv3" top: "conv4" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 2 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu4" type: RELU bottom: "conv4" top: "conv4"}layers { name: "conv5" type: CONVOLUTION bottom: "conv4" top: "conv5" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 2 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu5" type: RELU bottom: "conv5" top: "conv5"}layers { name: "pool5" type: POOLING bottom: "conv5" top: "pool5" pooling_param { pool: MAX kernel_size: 3 stride: 2 }}layers { name: "fc6" type: INNER_PRODUCT bottom: "pool5" top: "fc6" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 inner_product_param { num_output: 4096 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu6" type: RELU bottom: "fc6" top: "fc6"}layers { name: "drop6" type: DROPOUT bottom: "fc6" top: "fc6" dropout_param { dropout_ratio: 0.5 }}layers { name: "fc7" type: INNER_PRODUCT bottom: "fc6" top: "fc7" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 inner_product_param { num_output: 2 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu7" type: RELU bottom: "fc7" top: "fc7"}layers { name: "drop7" type: DROPOUT bottom: "fc7" top: "fc7" dropout_param { dropout_ratio: 0.5 }}layers { name: "conv1_p" type: CONVOLUTION bottom: "data_p" top: "conv1_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 96 kernel_size: 11 stride: 4 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } }}layers { name: "relu1_p" type: RELU bottom: "conv1_p" top: "conv1_p"}layers { name: "pool1_p" type: POOLING bottom: "conv1_p" top: "pool1_p" pooling_param { pool: MAX kernel_size: 3 stride: 2 }}layers { name: "norm1_p" type: LRN bottom: "pool1_p" top: "norm1_p" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 }}layers { name: "conv2_p" type: CONVOLUTION bottom: "norm1_p" top: "conv2_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 256 pad: 2 kernel_size: 5 group: 2 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu2_p" type: RELU bottom: "conv2_p" top: "conv2_p"}layers { name: "pool2_p" type: POOLING bottom: "conv2_p" top: "pool2_p" pooling_param { pool: MAX kernel_size: 3 stride: 2 }}layers { name: "norm2_p" type: LRN bottom: "pool2_p" top: "norm2_p" lrn_param { local_size: 5 alpha: 0.0001 beta: 0.75 }}layers { name: "conv3_p" type: CONVOLUTION bottom: "norm2_p" top: "conv3_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 384 pad: 1 kernel_size: 3 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 0 } }}layers { name: "relu3_p" type: RELU bottom: "conv3_p" top: "conv3_p"}layers { name: "conv4_p" type: CONVOLUTION bottom: "conv3_p" top: "conv4_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 384 pad: 1 kernel_size: 3 group: 2 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu4_p" type: RELU bottom: "conv4_p" top: "conv4_p"}layers { name: "conv5_p" type: CONVOLUTION bottom: "conv4_p" top: "conv5_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 convolution_param { num_output: 256 pad: 1 kernel_size: 3 group: 2 weight_filler { type: "gaussian" std: 0.01 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu5_p" type: RELU bottom: "conv5_p" top: "conv5_p"}layers { name: "pool5_p" type: POOLING bottom: "conv5_p" top: "pool5_p" pooling_param { pool: MAX kernel_size: 3 stride: 2 }}layers { name: "fc6_p" type: INNER_PRODUCT bottom: "pool5_p" top: "fc6_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 inner_product_param { num_output: 4096 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu6_p" type: RELU bottom: "fc6_p" top: "fc6_p"}layers { name: "drop6_p" type: DROPOUT bottom: "fc6_p" top: "fc6_p" dropout_param { dropout_ratio: 0.5 }}layers { name: "fc7_p" type: INNER_PRODUCT bottom: "fc6_p" top: "fc7_p" blobs_lr: 1 blobs_lr: 2 weight_decay: 1 weight_decay: 0 inner_product_param { num_output: 2 weight_filler { type: "gaussian" std: 0.005 } bias_filler { type: "constant" value: 1 } }}layers { name: "relu7_p" type: RELU bottom: "fc7_p" top: "fc7_p"}layers { name: "drop7_p" type: DROPOUT bottom: "fc7_p" top: "fc7_p" dropout_param { dropout_ratio: 0.5 }}layers { name: "loss" type: CONTRASTIVE_LOSS contrastive_loss_param { margin: 1.0 } bottom: "fc7" bottom: "fc7_p" bottom: "label" top: "loss"}
我的训练文件结构:0表示不相似,1表示相似
train1.txt: /aer/img1_1.jpg 0 /aer/img1_2.jpg 1 /aer/img1_3.jpg 1 train2.txt: /tpd/img2_1.jpg 0 /tpd/img2_2.jpg 1 /tpd/img2_3.jpg 1
回答:
我应该将“num_output”设置为多少?
在了解应该将num_output
设置为多少之前,让我们先解释一下它的含义。实际上,你可以将孪生网络的两部分,data -> fc7
和data_p -> fc7_p
视为两个特征提取器。每个提取器从相应的数据层中的图像中提取特征,例如fc7
和fc7_p
。因此,num_output
定义了提取的特征向量的维度。
在训练过程中,ContrastiveLoss
层总是试图在图像相似时(label == 1
)最小化两个提取的特征向量的距离,而在图像不相似时(label == 0
)最大化距离。也就是说,特征向量的距离越小,图像就越相似。
那么,特征向量的最佳维度是多少,才能最好地包含指示相似性的信息?或者说,你应该将num_output
设置为多少?可能没有一个确切的值,这取决于特征提取器的编码质量(你可以将特征视为图像的编码)和识别图像相似性的难度。因此,基本上如果网络(特征提取器)足够深,并且识别相似性并不太困难,你可以选择一个相对较小的num_output
值,例如200,因为较大的网络可以更好地编码特征,使其更具区分性。如果不是这样,你可以尝试更大的值,例如500、1000,或者尝试更复杂的网络。
如果你想尝试使用MultinomialLogisticLoss
层而不是ContrastiveLoss
层,你应该首先使用像CONCAT
这样的层将两个特征向量fc7
和fc7_p
融合成一个,然后将其输入到SOFTMAX_LOSS
层中,像这样:
...#原始层layers { name: "concat" type: CONCAT bottom: "fc7" bottom: "fc7_p" top: "fc_concat" # 沿着通道轴连接fc7和fc7_p}layer { name: "fc_cls" type: INNER_PRODUCT bottom: "fc_concat" top: "fc_cls" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 2 # 在这种情况下是一个二元分类问题 weight_filler { type: "xavier" } bias_filler { type: "constant" } }}layer { name: "accuracy" type: ACCURACY bottom: "fc_cls" bottom: "label" top: "accuracy" include { phase: TEST }}layer { name: "loss" type: SOFTMAX_LOSS bottom: "fc_cls" bottom: "label" top: "loss"}
更新
为了比较相似性并用于部署,哪种方法是最好的,Contrastive Loss还是SoftMax Loss?
Softmax Loss简单且易于部署。但它只能给你二元预测,即相似或不相似。它给出的两个类别(相似、不相似)的概率分布通常过于硬(不均匀),例如[0.9*, 0.0*]
、[0.0*, 0.9*]
等,在许多情况下不能很好地反映输入的相似度。
而使用Contrastive Loss,你可以为图像获得一个具有区分性的特征向量。你可以使用该向量计算相似性的概率,就像CVPR 2005年论文Learning a Similarity Metric Discriminatively, with Application to Face Verification在第4.1节中所做的那样。(关键点是使用属于同一主体的图像生成的特征向量计算多元正态密度)。此外,你还可以使用阈值来控制模型的假阳性率和假阴性率,以获得ROC曲线,从而更好地评估模型。
顺便说一下,要发掘更多用于预测相似性的CNN架构,你可以参考CVPR 2015年论文Learning to Compare Image Patches via Convolutional Neural Networks。