in-batch-negative数据增强

介绍

这篇文章的思路来源自paddleNLP–>语义检索–>recall–>paddleNLP In-batch negative，它提供了一种非常有特点的数据增强，帮助提升模型训练效果。

任务介绍

它是一个语义检索召回阶段任务，输入两个句子s1和s2，判断这两个句子的相似度。

做法

0. 数据格式

我手机丢了，我想换个手机     我想买个新手机，求推荐
求秋色之空漫画全集          求秋色之空全集漫画
学日语软件手机上的          手机学日语的软件
侠盗飞车罪恶都市怎样改车     侠盗飞车罪恶都市怎么改车

注意，每一行的两句话都是相似的。

1. 模型结构

class SemanticIndex(nn.Module):
    def __init__(self, ptm):
        self.ptm = ptm
        self.reducer_linear = nn.Linear(768, 256) # 为了减少计算复杂度，降维
        self.dp = nn.Dropout(0.1)

    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)

        if self.output_emb_size > 0:
            cls_embedding = self.emb_reduce_linear(cls_embedding) # 768维太大了，使用256进行降维
        cls_embedding = self.dropout(cls_embedding)
        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)

        return cls_embedding

    def forward(
        self,
        query_input_ids,
        title_input_ids,
        query_token_type_ids=None,
        query_position_ids=None,
        query_attention_mask=None,
        title_token_type_ids=None,
        title_position_ids=None,
        title_attention_mask=None,
    ):

        query_cls_embedding = self.get_pooled_embedding(
            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
        )

        title_cls_embedding = self.get_pooled_embedding(
            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
        )

        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)

        # Substract margin from all positive samples cosine_sim()
        margin_diag = paddle.full(
            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
        )

        cosine_sim = cosine_sim - paddle.diag(margin_diag)

        # Scale cosine to ease training converge
        cosine_sim *= self.sacle

        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
        labels = paddle.reshape(labels, shape=[-1, 1])

        loss = F.cross_entropy(input=cosine_sim, label=labels)

        return loss

    # def cosine_sim(
    #     self,
    #     query_input_ids,
    #     title_input_ids,
    #     query_token_type_ids=None,
    #     query_position_ids=None,
    #     query_attention_mask=None,
    #     title_token_type_ids=None,
    #     title_position_ids=None,
    #     title_attention_mask=None,
    # ):

    #     query_cls_embedding = self.get_pooled_embedding(
    #         query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
    #     )

    #     title_cls_embedding = self.get_pooled_embedding(
    #         title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
    #     )

    #     cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
    #     return cosine_sim

2. loss

假设query_cls_embedding和title_cls_embedding的维度为(32,256)和(32,256)，经过forward.cosine_sim后，维度变成了(32,32)，这里作者有用到缩放等方式来加快模型收敛。

核心来了，label是什么呢？

label就是torch.arange(0, 32)！！

为什么呢，只有对角线上面的俩句子才是相似的，其余都不是相似的！
这个就叫做什么是in batch negative(在一个batch内构造负样本)

3. 推理

3.1 建立知识库索引

比如有一个语料库corpus.txt，这个文件为一整个知识库，决定搜索的范围。
对corpus.txt建立语义索引，通过self.get_pooled_embedding方法跑出index，然后保存到类似faiss，milvus，hnswlib类ann库中。
build_index完后，剩下就是检索了。

3.2 召回

这里以hnswlib库进行说明，在支持的计算相似度算法上，选择ip.

Distance	parameter	Equation
Squared L2	‘l2’	d = sum((Ai-Bi)^2)
Inner product	‘ip’	d = 1.0 - sum(Ai*Bi)
Cosine similarity	‘cosine’	d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) sum(Bi\Bi))

拿到query embed后，调用hnswlib的knn_query函数进行计算，最终d = 1 - cosine_sim。

3.3 评估指标

比如dev.txt如下面这个样子:

第一列为query，第二列为doc

1	我喜欢张三我李四喜欢张三

输入query，经过knn_query调整topK大小，比如10，代表选取10个相关集。

如果doc在这10个相关集中，那么recall@10就为1

那同理，如果topK为1，并且doc在这1个结果内，那么recall@1就为1

总结

平时在构造负样本的时候，一般是在生成train和dev数据集的时候，而本文提出了in-batch negative，非常不同的实现角度。

不过这个方法也有局限性，不同的任务，可能未必能使用这种方法来构造负样本，因为一个batch内，sample和sample之间会是相互独立的，没有关系的。

不过本篇文章同时也讲明白了召回的实现思路，给大家进阶之路参考。

BLCL的博客小馆