介绍

书接上文,layoutLM微调FUNSD数据集介绍了layoutlm和layoutxlm如何做named entity recognition,以及多模态-CLIP多模态-字幕生成介绍多模态是如何融合的,本文继续基于layoutLM系列,基于huggingface document_question_answering来进行debug是如何实现的。

更新:针对layoutxlm在docvqa_zh上的训练代码已经放到document-qa

原始数据

之前,都是在介绍如何处理数据,也即如下代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#
from datasets import load_dataset

dataset = load_dataset("nielsr/docvqa_1200_examples")

updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)
updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)

updated_dataset = updated_dataset.remove_columns("words")
updated_dataset = updated_dataset.remove_columns("bounding_boxes")

updated_dataset['train'] = updated_dataset['train'].select(range(10))
updated_dataset['test'] = updated_dataset['test'].select(range(5))

1
2
3
4
5
>>> dataset['test'].select(range(1)).to_dict().keys()
dict_keys(['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'])

>>> dataset['test'].select(range(1)).to_dict()['query']
[{'de': 'Was ist die Standortadresse von NSDA?', 'en': 'What the location address of NSDA?', 'es': '¿Cuál es la dirección de ubicación del NSDA?', 'fr': "Quelle est l'adresse de la NSDA?", 'it': "Qual e' l'indirizzo della NSDA?"}]

可以看到,默认dataset有如上几个字段,其中query有德语以及英语,后面updated_dataset做了过滤,只保留了为英语的、以及长度小于512的,最终保留字段如下:

1
2
3
4
5
6
7
8
9
10
11
>>> updated_dataset['test']
Dataset({
features: ['id', 'image', 'answer', 'question'],
num_rows: 5
})

>>> updated_dataset['test'].select(range(1)).to_dict()['question']
['What the location address of NSDA?']
>>> updated_dataset['test'].select(range(1)).to_dict()['answer']
['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036']

反而变简单了,所以咱们也不用再刻意关注dataset了。

其中一条数据如下:

1
2
3
4
5
6
7
8
>>> aaa = updated_dataset['test'].select(range(1)).to_dict()

>>> import io
>>> Image.open(io.BytesIO(aaa['image'][0]['bytes'])).show()
>>> aaa['question']
['What the location address of NSDA?']
>>> aaa['answer']
['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036']

图像处理

看看人家,标注的bbox之类的就不要啦,咱要自己搞。。不过这可以理解它是怎么处理滴。

这部分对应Preprocessing document images,也即如下代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
image_processor = processor.image_processor


def get_ocr_words_and_boxes(examples):
images = [image.convert("RGB") for image in examples["image"]]
encoded_inputs = image_processor(images)

examples["image"] = encoded_inputs.pixel_values
examples["words"] = encoded_inputs.words
examples["boxes"] = encoded_inputs.boxes

return examples

dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

image_processor进去,最终到apply_tesseract,其代码如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

def apply_tesseract(
image: np.ndarray,
lang: Optional[str],
tesseract_config: Optional[str] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""Applies Tesseract OCR on a document image, and returns recognized words + normalized bounding boxes."""
tesseract_config = tesseract_config if tesseract_config is not None else ""

# apply OCR
pil_image = to_pil_image(image, input_data_format=input_data_format)
image_width, image_height = pil_image.size
data = pytesseract.image_to_data(pil_image, lang=lang, output_type="dict", config=tesseract_config)
words, left, top, width, height = data["text"], data["left"], data["top"], data["width"], data["height"]

# filter empty words and corresponding coordinates
irrelevant_indices = [idx for idx, word in enumerate(words) if not word.strip()]
words = [word for idx, word in enumerate(words) if idx not in irrelevant_indices]
left = [coord for idx, coord in enumerate(left) if idx not in irrelevant_indices]
top = [coord for idx, coord in enumerate(top) if idx not in irrelevant_indices]
width = [coord for idx, coord in enumerate(width) if idx not in irrelevant_indices]
height = [coord for idx, coord in enumerate(height) if idx not in irrelevant_indices]

# turn coordinates into (left, top, left+width, top+height) format
actual_boxes = []
for x, y, w, h in zip(left, top, width, height):
actual_box = [x, y, x + w, y + h]
actual_boxes.append(actual_box)

# finally, normalize the bounding boxes
normalized_boxes = []
for box in actual_boxes:
normalized_boxes.append(normalize_box(box, image_width, image_height))

assert len(words) == len(normalized_boxes), "Not as many words as there are bounding boxes"

return words, normalized_boxes

咱来看下tesseract识别结果:

1
2
3
4
5
6
7
8
from PIL import ImageDraw
draw = ImageDraw.ImageDraw(pil_image)
import random
for b in actual_boxes:
draw.rectangle(b, outline=(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)))

pil_image.show()

识别结果如下:

可以看到,tesseract拿到每个词的识别坐标。

注意:这里忽略了图片本身操作,比如resize、reshape等操作哦

相关的也有:

原始图片1 OCR1

文本处理

这部分对应Preprocessing text data.

基于上图知道其answerT.F. Riehl,通过subfinder函数其在原文的位置为start_index=17end_index=18,通过OCR1图可知其具体位置。

接着tokenizer传入了question,words(ocr原文识别结果),boxes,我们来看其是怎么实现的以及其具体目的。

1
2
encoding = tokenizer(example["question"], example["words"], example["boxes"])
tokenizer.decode(encoding["input_ids"])

之前,我们可以看到,其具体做的就是encode拿input_ids, attention_mask和token_type_ids,其具体如下:

1
2
3
4

>>> self.decode(sanitized_tokens['input_ids'][0])
'[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development internal correspondence to : r. h. honeycutt ce : t. f. riehl from :. c. j. cook date : may 8, 1995 subject : review of existing brainstorming ideas / 483 the major function of the product innovation graup is to develop marketable nove! products that would be profitable to manufacture and sell. novel is defined as : of a new kind, or different from anything seen or known before. innovation is defined as : something new or different introduced ; act of innovating ; introduction of new things or methods. the products may incorporate the latest technologies, materials and know - how available to give then a unique taste or look. the first task of the product innovation group was to assemble, review and categorize a list of existing brainstorming ideas. ideas were grouped into two major categories labeled appearance and taste / aroma. these categories are used for novel products that may differ from a visual and / or taste / aroma point of view compared to canventional cigarettes. other categories include a combination of the above, filters, packaging and brand extensions. appearance this category is used for novel cigarette constructions that yield visually different products with minimal changes in smoke chemistry two cigarettes in cne. emulti - plug te build yaur awn cigarette. eswitchable menthol or non menthol cigarette. * cigarettes with interspaced perforations to enable smoker to separate unburned section for future smoking. « short cigarette, tobacco section 30 mm. « extremely fast buming cigarette. « novel cigarette constructions that permit a significant reduction iretobacco weight while maintaining smoking mechanics and visual characteristics. higher basis weight paper : potential reduction in tobacco weight. « more rigid tobacco column ; stiffing agent for tobacco ; e. g. starch * colored tow and cigarette papers ; seasonal promotions, e. g. pastel colored cigarettes for easter or in an ebony and ivory brand containing a mixture of all black ( black paper and tow ) and ail white cigarettes. 499150498 [SEP]'

但是也是从开始,讲述了bbox是如何跟words对齐的。

其代码如下:

最终生成的结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

>>> for input_id, box in zip(sanitized_tokens['input_ids'][0], sanitized_tokens['bbox'][0]):
>>> print(self.decode(input_id),'\t', box)

[CLS] [0, 0, 0, 0]
who [0, 0, 0, 0]
is [0, 0, 0, 0]
in [0, 0, 0, 0]
cc [0, 0, 0, 0]
in [0, 0, 0, 0]
this [0, 0, 0, 0]
letter [0, 0, 0, 0]
? [0, 0, 0, 0]
[SEP] [1000, 1000, 1000, 1000]
wi [455, 66, 502, 91]
##e [455, 66, 502, 91]
ba [455, 93, 503, 103]
##w [455, 93, 503, 103]
brown [296, 116, 348, 133]
& [356, 121, 367, 129]
williamson [372, 120, 470, 129]
tobacco [475, 120, 547, 128]
corporation [552, 118, 661, 127]
research [372, 133, 452, 142]
& [457, 133, 468, 141]
development [473, 133, 585, 142]
internal [623, 158, 691, 165]
correspondence [694, 158, 823, 165]
to [143, 200, 168, 215]
: [143, 200, 168, 215]
r [239, 201, 253, 211]
. [239, 201, 253, 211]
h [259, 201, 273, 211]
. [259, 201, 273, 211]
honey [279, 201, 351, 212]
##cut [279, 201, 351, 212]
##t [279, 201, 351, 212]
ce [144, 224, 168, 245]
: [144, 224, 168, 245]
t [231, 224, 265, 244]
. [231, 224, 265, 244]
f [231, 224, 265, 244]
. [231, 224, 265, 244]
ri [267, 224, 307, 244]
##eh [267, 224, 307, 244]
##l [267, 224, 307, 244]
from [145, 259, 193, 269]
: [145, 259, 193, 269]
. [211, 268, 212, 269]
c [239, 259, 269, 268]
. [239, 259, 269, 268]
j [239, 259, 269, 268]
. [239, 259, 269, 268]
cook [276, 259, 313, 268]
date [145, 285, 189, 302]

着重看上图40~46行,即可明白tokenizer分成subword后,其box按照原词的box进行分配。这个也和原来使用layoutXLM来做是一样的,其在这里

剩下部分就是encode_dataset函数了,除了和box对齐,另外一个就是基于subfinder函数来找到start_positionsend_positions来作为label。

至此,大致理解了其文本处理方式以及如何和box进行对齐,但是要注意subfinder函数,如果answer没有在words(即ocr识别原文)没有找到,这条数据就废掉了

模型

模型部分简单如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
self
LayoutLMv2ForQuestionAnswering(
(layoutlmv2): LayoutLMv2Model(
(embeddings): LayoutLMv2Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(x_position_embeddings): Embedding(1024, 128)
(y_position_embeddings): Embedding(1024, 128)
(h_position_embeddings): Embedding(1024, 128)
(w_position_embeddings): Embedding(1024, 128)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(visual): LayoutLMv2VisualBackbone(
(backbone): FPN(
(fpn_lateral2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(fpn_output2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fpn_lateral3): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(fpn_output3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fpn_lateral4): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(fpn_output4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fpn_lateral5): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(fpn_output5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(top_block): LastLevelMaxPool()
(bottom_up): ResNet(

)
(visual_proj): Linear(in_features=256, out_features=768, bias=True)
(visual_LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(visual_dropout): Dropout(p=0.1, inplace=False)
(encoder): LayoutLMv2Encoder(

)
(qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)


但是一直没搞清楚其visual用的是resnet还是变种,不过这里就忽略了。。。(layoutLMv3用的就是ViT了)。

那接下来我们就一个目的了,看visual feature和text feature如何融合。

这部分反而看的云里雾里,比如为什么生成一个visual_bbox,剩下生成embedding、图像、transformer部分就是常规操作了,先忽略。

More

这种双指针的方式可以解决一部分文档问答问题,但是针对表格之类的,比如:

Q1: 班级1班的老师的姓名?
Q2: 班级1班语文老师的姓名和数学老师的姓名?

即一个表格中多个答案和一个疑问句中多个疑问点,就造成这类模型的是无法满足的。