讯飞2020年事件提取比赛第一名-主客体提取

引言

这是第二篇文章，因为主客体提取需要依赖触发词识别。上一篇是讯飞2020年事件提取比赛第一名-触发词提取。

1. 跑通代码

args = TrainArgs().get_parser()
args.gpu_ids = '0'
args.mode = "train"
args.raw_data_dir = './data/final/raw_data'
args.mid_data_dir = './data/final/mid_data'
args.aux_data_dir = "./data/final/preliminary_clean"
args.bert_dir = '/home/yuzhang/PycharmProjects/xf_event_extraction2020Top1/bert/torch_roberta_wwm'
args.output_dir = './out'
args.bert_type = 'roberta_wwm'
args.task_type = 'role1' # 改动这里
args.max_seq_len = 320
args.train_epochs = 6
args.train_batch_size = 3
args.lr = 2e-5
args.other_lr = 2e-4
args.attack_train = "pgd"
args.swa_start = 4
args.eval_model = True
args.enhance_data = True
args.use_trigger_distance = True
args.use_distant_trigger = True

2. 模型结构

3. 模型结构

Role1Extractor(
  (bert_module): BertModel()
  (dropout_layer): Dropout(p=0.1, inplace=False)
  (conditional_layer_norm): ConditionalLayerNorm(
    (weight_dense): Linear(in_features=1536, out_features=768, bias=False)
    (bias_dense): Linear(in_features=1536, out_features=768, bias=False)
  )
  (trigger_distance_embedding): Embedding(512, 256)
  (layer_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
  (mid_linear): Sequential(
    (0): Linear(in_features=1024, out_features=128, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, inplace=False)
  )
  (obj_classifier): Linear(in_features=128, out_features=2, bias=True)
  (sub_classifier): Linear(in_features=128, out_features=2, bias=True)
  (activation): Sigmoid()
  (criterion): BCELoss()
)

4. 训练中需要注意的点

1. label构造

具体代码这里。因为这个任务作者只提取主客体，所以每一个label的长度为4。
前两个为客体的开始和结束，后两个为主体的开始和结束。
所在的index对应句子的位置。
具体代码这里。

比如：

label = [
    [0, 0, 0, 0],
    [0, 0, 0, 0],
    [0, 0, 0, 0],
    [1, 0, 0, 0],
    [0, 0, 0, 0],
    [0, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 0, 0],

]

看到这里，基本就明白主体思路和触发词提取是一样的。

2.相对位置编码

这个是指引入了一个新的feature，这个feature是以trigger位置来算前面和后面的位置编码。比如trigger为(32,33)，那(32,33)的位置编码为0,左右两边递增。

比如:

1	distince_feature = [3,2,1,0,0,1,2,3]

我对这一步的做法能带来多大的提升保留质疑。
所以此处就不细讲了。以后有机会试试效果。

3. conditional layer norm

这个作者的思想来自苏剑林的CondiationalLayerNorm，但是我没找到它的源码。

作用：
作者这里采用的是利用Conditional Layer Normalization来将外部条件和bert output做了一次注意力。

具体流程：
1、通过trigger index获取对应的bert output（看batch_gather函数），这里假设叫做trigger feature。
2、接着将trigger feature和bert output通过conditional layer norm进行融合。

conditional layer norm流程：
1、对bert output做layer norm，这一步没什么可说的。
2、将trigger feature经过weight linear和bias linear，这个做法其实和正常的layer norm指定elementwise_affine做法是类似的，正常的layer norm做归一化没有训练参数。
3、和bert_output进行相乘。这个地方可以理解成trigger和其他词做了一个注意力。

举例说明
在本次任务里，触发词的长度都为2,所以self.weight_dense和self.bias_dense都是normalized_shape * 2。batch_gather后拿到的触发词shape为（32,2,768）,这步叫做trigger feature。然后进行reshape，变成了(32, 1, 1536)，经过self.weight_dense和self.bias_dense变换后变成了(32, 1, 768)，随后和bert_output进行相乘，即和每个字做了一个注意力。

作者代码如下：


class ConditionalLayerNorm(nn.Module):
    def __init__(self,
                 normalized_shape,
                 eps=1e-12):
        super().__init__()

        self.eps = eps

        self.weight = nn.Parameter(torch.Tensor(normalized_shape))
        self.bias = nn.Parameter(torch.Tensor(normalized_shape))

        self.weight_dense = nn.Linear(normalized_shape * 2, normalized_shape, bias=False)
        self.bias_dense = nn.Linear(normalized_shape * 2, normalized_shape, bias=False)

        self.reset_weight_and_bias()

    def reset_weight_and_bias(self):
        """
        此处初始化的作用是在训练开始阶段不让 conditional layer norm 起作用
        """
        nn.init.ones_(self.weight)
        nn.init.zeros_(self.bias)

        nn.init.zeros_(self.weight_dense.weight)
        nn.init.zeros_(self.bias_dense.weight)

    def forward(self, inputs, cond=None):
        assert cond is not None, 'Conditional tensor need to input when use conditional layer norm'
        cond = torch.unsqueeze(cond, 1)  # (b, 1, h*2)

        weight = self.weight_dense(cond) + self.weight  # (b, 1, h)
        bias = self.bias_dense(cond) + self.bias  # (b, 1, h)

        mean = torch.mean(inputs, dim=-1, keepdim=True)  # （b, s, 1）
        outputs = inputs - mean  # (b, s, h)

        variance = torch.mean(outputs ** 2, dim=-1, keepdim=True)
        std = torch.sqrt(variance + self.eps)  # (b, s, 1)

        outputs = outputs / std  # (b, s, h)
        # 这里做了一个交互
        outputs = outputs * weight + bias

        return outputs

作者说添加了这个layer后效果有小幅度提升，以后可以试试。

如果说有很大创新的吧，算不上，我觉得把bert output不做layer norm最终效果也大差不差。

备注：

这里有个问题，如果触发词的长度是变长的话，怎么用呢？估计引入一个mask，算出来。

以后试试。

4. 多feature layer norm

这地方作者代码是在使用了trigger相对位置编码后和bert output进行concat到一起时用到的，如下所示。


if self.use_trigger_distance:
    assert trigger_distance is not None, \
        'When using trigger distance features, trigger distance should be implemented'

    trigger_distance_feature = self.trigger_distance_embedding(trigger_distance)
    seq_out = torch.cat([seq_out, trigger_distance_feature], dim=-1)
    seq_out = self.layer_norm(seq_out)

这里的做法觉得是平时没注意到的一个点～

平时俩向量直接concat到一起完事。这里还进行了一个layer norm。算是一个挺细心的点。至于能不能带来效果提升，此处就不特别注意啦。

5. 计算loss

if labels is not None:
    masks = torch.unsqueeze(attention_masks, -1)

    labels = labels.float()
    obj_loss = self.criterion(obj_logits * masks, labels[:, :, :2])
    sub_loss = self.criterion(sub_logits * masks, labels[:, :, 2:])

    loss = obj_loss + sub_loss

这里没什么特别需要注意的地方了，这里和trigger的做法类似，只是这里分成了两个loss，一个是subject loss，一个是object loss。

6. 解码

这部分没细看了，猜测和trigger解码应该也是类似的。

7. 备注

关于conditional layer norm

# -*- coding: utf8 -*-
#

from torch import nn
import torch


class ConditionalLayerNorm(nn.Module):
    def __init__(self,
                 normalized_shape,
                 eps=1e-12):
        super().__init__()

        self.eps = eps

        self.weight = nn.Parameter(torch.Tensor(normalized_shape))
        self.bias = nn.Parameter(torch.Tensor(normalized_shape))

        self.weight_dense = nn.Linear(normalized_shape * 2, normalized_shape, bias=False)
        self.bias_dense = nn.Linear(normalized_shape * 2, normalized_shape, bias=False)

        self.reset_weight_and_bias()

    def reset_weight_and_bias(self):
        """
        此处初始化的作用是在训练开始阶段不让 conditional layer norm 起作用
        """
        nn.init.ones_(self.weight)
        nn.init.zeros_(self.bias)

        nn.init.zeros_(self.weight_dense.weight)
        nn.init.zeros_(self.bias_dense.weight)

    def forward(self, inputs, cond=None):
        assert cond is not None, 'Conditional tensor need to input when use conditional layer norm'
        cond = torch.unsqueeze(cond, 1)  # (b, 1, h*2)

        weight = self.weight_dense(cond) + self.weight  # (b, 1, h)
        bias = self.bias_dense(cond) + self.bias  # (b, 1, h)

        mean = torch.mean(inputs, dim=-1, keepdim=True)  # （b, s, 1）
        outputs = inputs - mean  # (b, s, h)

        variance = torch.mean(outputs ** 2, dim=-1, keepdim=True)
        std = torch.sqrt(variance + self.eps)  # (b, s, 1)

        outputs = outputs / std  # (b, s, h)
        # 这里做了一个交互
        outputs = outputs * weight + bias

        return outputs


if __name__ == '__main__':

    bert_output = torch.rand(32, 128, 768)
    # 当然，每个句子的trigger_index都是变化的
    # 此处假设这个batch的index span为(56, 58)
    trigger_feature = bert_output[:, 56:58, :]

    trigger_feature = trigger_feature.view(32, 768 * 2)
    cln = ConditionalLayerNorm(768)
    output = cln(bert_output, trigger_feature)
    print(output)

BLCL的博客小馆