ssd算法的pytorch实现与解读

首先先放下github地址：https://github.com/acm5656/ssd_pytorch

然后放上参考的代码的github地址：https://github.com/amdegroot/ssd.pytorch

为什么要使用pytorch复现呢，因为好多大佬的代码对于萌新真的不友好，看半天看不懂，所以笔者本着学习和练手的目的，尝试复现下，并分享出来帮助其他萌新学习，大佬有兴趣看了后可以提些建议~

然后对ssd原理感兴趣的同学可以参考我的这篇博客https://www.cnblogs.com/cmai/p/10076050.html，主要对SSD模型进行了讲解。在这就主要讲解代码实现上的内容了，就不再讲原理了。

首先看下项目目录：

VOCdevkit：存放训练数据

weights ：存放权重文件

Config.py ：默认的一些配置

Test.py ：测试单张照片的识别

Train.py ：训练的py文件

augmentation.py：data augmentation的py文件，主要功能是扩大训练数据

detection.py：对识别的结果的数据进行部分筛选，传送给Test.py文件，供其调用使用

l2norm.py：进行l2正则化

loss_function.py：计算损失函数

ssd_net_vgg.py：ssd模型的实现

utils.py：工具类

voc0712.py：重写dataset类，提取voc的数据并规则化

模型搭建

模型搭建在ssd_net_vgg.py中，这个类只需要将一点，即vgg的网络需要注意，必须采用笔者的方式搭建，否则pre-train的model加载出错，具体的原因不在这里阐述。

模型的实现过程，将loc和conf的提取分开进行了，这个不影响正常的使用，只是在计算损失函数时，能够方便编程而已。

default box计算

代码在utils.py文件下，代码如下：

def default_prior_box():
    mean_layer = []
    for k,f in enumerate(Config.feature_map):
        mean = []
        for i,j in product(range(f),repeat=2):
            f_k = Config.image_size/Config.steps[k]
            cx = (j+0.5)/f_k
            cy = (i+0.5)/f_k            s_k = Config.sk[k]/Config.image_size
            mean += [cx,cy,s_k,s_k]            s_k_prime = sqrt(s_k * Config.sk[k+1]/Config.image_size)
            mean += [cx,cy,s_k_prime,s_k_prime]
            for ar in Config.aspect_ratios[k]:
                mean += [cx, cy, s_k * sqrt(ar), s_k/sqrt(ar)]
                mean += [cx, cy, s_k / sqrt(ar), s_k * sqrt(ar)]
        if Config.use_cuda:
            mean = torch.Tensor(mean).cuda().view(Config.feature_map[k], Config.feature_map[k], -1).contiguous()
        else:
            mean = torch.Tensor(mean).view( Config.feature_map[k],Config.feature_map[k],-1).contiguous()
        mean.clamp_(max=1, min=0)
        mean_layer.append(mean)    return mean_layer

该函数则是生成box，与论文中的数量对应，最后的输出是6个list，每个list对应一个特征层输出的default box数，具体数量参考上一篇ssd论文解读的博客。计算公式同参考上篇博客。

Loss函数计算

loss函数的功能实现在loss_function.py中，具体核心代码如下：

class LossFun(nn.Module):
    def __init__(self):
        super(LossFun,self).__init__()
    def forward(self, prediction,targets,priors_boxes):
        loc_data , conf_data = prediction
        loc_data = torch.cat([o.view(o.size(0),-1,4) for o in loc_data] ,1)
        conf_data = torch.cat([o.view(o.size(0),-1,21) for o in conf_data],1)
        priors_boxes = torch.cat([o.view(-1,4) for o in priors_boxes],0)
        if Config.use_cuda:
            loc_data = loc_data.cuda()
            conf_data = conf_data.cuda()
            priors_boxes = priors_boxes.cuda()
        # batch_size
        batch_num = loc_data.size(0)
        # default_box数量
        box_num = loc_data.size(1)
        # 存储targets根据每一个prior_box变换后的数据
        target_loc = torch.Tensor(batch_num,box_num,4)
        target_loc.requires_grad_(requires_grad=False)
        # 存储每一个default_box预测的种类
        target_conf = torch.LongTensor(batch_num,box_num)
        target_conf.requires_grad_(requires_grad=False)
        if Config.use_cuda:
            target_loc = target_loc.cuda()
            target_conf = target_conf.cuda()
        # 因为一次batch可能有多个图，每次循环计算出一个图中的box，即8732个box的loc和conf，存放在target_loc和target_conf中
        for batch_id in range(batch_num):
            target_truths = targets[batch_id][:,:-1].data
            target_labels = targets[batch_id][:,-1].data
            if Config.use_cuda:
                target_truths = target_truths.cuda()
                target_labels = target_labels.cuda()
            # 计算box函数，即公式中loc损失函数的计算公式
            utils.match(0.5,target_truths,priors_boxes,target_labels,target_loc,target_conf,batch_id)
        pos = target_conf > 0
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        # 相当于论文中L1损失函数乘xij的操作
        pre_loc_xij = loc_data[pos_idx].view(-1,4)
        tar_loc_xij = target_loc[pos_idx].view(-1,4)
        # 将计算好的loc和预测进行smooth_li损失函数
        loss_loc = F.smooth_l1_loss(pre_loc_xij,tar_loc_xij,size_average=False)        batch_conf = conf_data.view(-1,21)        # 参照论文中conf计算方式，求出ci
        loss_c = utils.log_sum_exp(batch_conf) - batch_conf.gather(1, target_conf.view(-1, 1))        loss_c = loss_c.view(batch_num, -1)
        # 将正样本设定为0
        loss_c[pos] = 0        # 将剩下的负样本排序，选出目标数量的负样本
        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(3*num_pos, max=pos.size(1)-1)        # 提取出正负样本
        neg = idx_rank < num_neg.expand_as(idx_rank)
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)        conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, 21)
        targets_weighted = target_conf[(pos+neg).gt(0)]
        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)        N = num_pos.data.sum().double()
        loss_l = loss_loc.double()
        loss_c = loss_c.double()
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c

其中较为复杂的是match函数，其具体的代码如下：

def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    """计算default box和实际位置的jaccard比，计算出每个box的最大jaccard比的种类和每个种类的最大jaccard比的box
    Args:
        threshold: (float) jaccard比的阈值.
        truths: (tensor) 实际位置.
        priors: (tensor) default box
        variances: (tensor) 这个数据含义暂时不清楚，笔者测试过，如果不使用同样可以训练.
        labels: (tensor) 一个图片实际包含的类别数.
        loc_t: (tensor) 需要存储每个box不同类别中的最大jaccard比.
        conf_t: (tensor) 存储每个box的最大jaccard比的类别.
        idx: (int) 当前的批次
    """
    # 计算jaccard比
    overlaps = jaccard(
        truths,
        # 转换priors，转换为x_min,y_min,x_max和y_max
        point_form(priors)
    )
    # [1,num_objects] best prior for each ground truth
    # 实际包含的类别对应box中jaccarb最大的box和对应的索引值，即每个类别最优box
    best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)
    # [1,num_priors] best ground truth for each prior
    # 每一个box,在实际类别中最大的jaccard比的类别，即每个box最优类别
    best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True)
    best_truth_idx.squeeze_(0)
    best_truth_overlap.squeeze_(0)
    best_prior_idx.squeeze_(1)
    best_prior_overlap.squeeze_(1)
    # 将每个类别中的最大box设置为2，确保不影响后边操作
    best_truth_overlap.index_fill_(0, best_prior_idx, 2)    # 计算每一个box的最优类别，和每个类别的最优loc
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j
    matches = truths[best_truth_idx]          # Shape: [num_priors,4]
    conf = labels[best_truth_idx] + 1         # Shape: [num_priors]
    conf[best_truth_overlap < threshold] = 0  # label as background
    # 实现loc的转换，具体的转换公式参照论文中的loc的loss函数的计算公式
    loc = encode(matches, priors, variances)
    loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn
    conf_t[idx] = conf  # [num_priors] top class label for each prior

代码已经添加了比较详细的注释了，因此不再做过多的解释了。

个人认为比较难的部分代码就是上述的几块，希望读者有时间可以debug调试测试一下，再配合注释，应该能够理解具体的内容，代码中data augumentation 部分没有做详细的解释，这部分笔者也没搞得太明白，只是知道其功能是对数据集进行了扩大，即扩大图像尺寸或者裁剪其中一部分内容等功能。

注：

这个代码有一个bug，训练的时候loss值有一定的概率会变为nan，个人在训练时候的经验是在Config.py文件中，要修改batch_size大小，越大出现的概率越小，原因应该是部分训练集特征比较分散，导致预测结果得分相差较大，在计算损失函数有一个计算e的次方，导致溢出，这是个人看法，不清楚是否正确。

以上是个人的理解，如果帮到你了，希望能够在github上star一下，谢谢啦。

个人收藏笔记记录

开通VIP