表格结构识别算法概览

2022年3月

Posted by franztao on April 3, 2022

简介背景

历史由来

从古老石块上,人类记录信息的形式就有表格的呈现形式 sumer 到现在为止Spreadsheets工具来电子化处理表格

excel

价值

表格是各类文档中常见的页面元素,随着各类文档的爆炸性增长,如何高效地从文档中找到表格并获取内容与结构信息即表格识别,成为了一个亟需解决的问题。对表格结构的还原和内容的识别,能帮助计算机更好的理解表格,在教学内容生产、智能解答等场景下,具有非常重要的应用价值。随着深度学习技术的飞速发展,目标检测、OCR和文档结构识别等技术也取得了许多新的进展,为表格识别提供了多种可能的解决方案。

学术上的价值

CV目标检测上的细分领域,当前还不太卷。表格中的文本解析,让这个任务又是极好的多模态任务。

工业上的价值

在提高生产效率、调整产业结构、提高产品服务质量、降低人工运营成本等战略目标的驱动下,我国各行各业都在从传统模式向数字化、网络化、智能化转变。

定义

表格结构识别是表格区域检测之后的任务,其目标是识别出表格的布局结构、层次结构等,将表格视觉信息转换成可重建表格的结构描述信息。这些表格结构描述信息包括:单元格的具体位置、单元格之间的关系、单元格的行列位置等。在当前的研究中,表格结构信息主要包括以下两类描述形式:1)单元格的列表(包含每个单元格的位置、单元格的行列信息、单元格的内容);2)HTML代码或Latex代码(包含单元格的位置信息,有些也会包含单元格的内容)。

表格数据呈现形式的本质

一句话概括就是:Meaning in Tables Meaning = Language + Layout img_1.png

表格数据中呈现的语义逻辑

img_4.png

表格数据存储格式

img_5.png

表格分类

存储格式
    pdf
    图片
展现形式
    垂直
    水平
    旋转X度
记录方式
    电子记录
    手写
    扫描
表格内容
    单表
    多表
格子形式
    Lattice : For tables formed with lines.
        border
    Stream : For tables formed with whitespaces.
        borderless
逻辑结构
    同一列里展示两列数据
    两列上面有共同的主列
图片质量
    整体:分辨率/清晰度
    局部:噪音程度/类型
文档类型
文档来源
研究领域

Document AI任务描述及相关子任务

表格结构识别是属于document AI任务的子任务,从整体document AI任务看表格结构识别在技术链上的定位

The three points of view

1. Document analysis and understanding

其中5个子问题如下

  1. Table detection

  2. Structure Recognition

  3. Functional Analysis

  4. Structural Analysis

  5. Interpretation

img_2.png

2. semantic web + Volume large database

其中3个子问题如下

  1. Entity Linking

  2. Column Type Identification

  3. Relation Extraction

img_3.png

3. semantic table interpretation

3 stages of the Table Understanding

评估指标 metric

1.格子bbox metric

  1. P R Recall

  2. BLEU

  3. map@IOU coco标准

  4. GriTS

    (“grits”) computes the grid table similarity (GriTS) metrics for table structure recognition. GriTS is a measure of table cell correctness and is defined as the average correctness of each cell averaged over all tables. GriTS can measure the correctness of predicted cells based on: 1. cell topology alone, 2. cell topology and the reported bounding box location of each cell, or 3. cell topology and the reported text content of each cell. For more details on GriTS Comparison of metrics proposed for table structure recognition

    1. STIOU

    1-2.png

  5. TEDS

2.格子中文本的评估指标

  1. 普通文本

    • one 全对准确率: 每张图片版面上有多个文本时候,每个文本都对的张数占总的张数的比例; 标签全对准确率:每张图片版面上有多个文本时候,文本对的个数占总的文本个数的比例;
    • 平均编辑距离:平均编辑距离越小说明识别率越高。平均编辑距离主要衡量整行或整篇文章的指标,可以同时反应识别错,漏识别和多识别的情况;
    • 字符识别准确率,即识别对的字符数占总识别出来字符数的比例,可以反应识别错和多识别的情况,但无法反应漏识别的情况;
    • 字符识别召回率,即识别对的字符数占实际字符数的比例,可以反应识别错和漏识别的情况,但是没办法反应多识别的情况,可以配套字符识别准确率一起使用;
    • 文本行定位为的准确率和召回率,同字符识别的准确率和召回率。主要反应文本行定位的指标,是ocr算法的重要指标; two 第一种是字符准确率,单字识别率,就是按单字算,一百个字里错5个字,识别率95%。
    • 第二种是字段准确率,整行识别率,一个字段算一个整体,假如100个字分为20个字段,里面错了5个字,分布在4个字段里,那么识别率是16/20=80%。
    • 第三种是整张准确率。通常在票据证件里面有这种计算方式,假设一张票据上有20字,4个字段,5张票上100个字,20字段,错了5个字,分布在4个字段里,分布在3张票据上。那么识别率只有2/5=40%。而且票据字段越多,容易出错的概率越高,整张识别率这个要求就越严苛。实测过程中也会有一些特别约定,说整张识别里错一两个字可以忽略的,这种再另说。 同样是100字错5个,用字符、字段、整张准确率来测算的结果是完全不同的,所以对比不同OCR算法时候一定要看清描述的是单字识别率、整行识别率还是整张识别率。一样的识别率99%,整张识别率可比单字识别率的含金量要大得多。
  2. latex文本

AA: Alpha-Numeric Characters Prediction Accuracy, LTA: LaTeX Token Accuracy, LSA: LaTeX Symbol Accuracy, SA: Non-LaTeX Symbol Prediction Accuracy, EM: Exact Match Accuracy, EM@95%: Exact Match Accuracy @95% similarity.

相关技术分类

method overall

表格识别各技术的进展情况

  1. 基于启发式规则的方法

  2. 基于CNN的方法

  3. 基于GCN的方法

  4. 基于End to End的方法

img_11.png 1)利用OCR检测文本,从文本框的空间排布信息推导出有哪些行、有哪些列、哪些单元格需合并,由此生成电子表格; 1)运用图像形态学变换、纹理提取、边缘检测等手段,提取表格线,再由表格线推导行、列、合并单元格的信息; 2,3)神经网络端到端学习,代表工作是TableBank,使用image to text技术,将表格图片转为某种结构化描述语言(比如html定义表格结构的标签)。 缺陷

思路1)极度依赖OCR检测结果和人工设计的规则,对于不同样式的表格,需做针对性开发,推广性差; 思路2)依赖传统图像处理算法,在鲁棒性方面较欠缺,并且对于没有可见线的表格,传统方法很吃力,很难把所有行/列间隙提取出来; 思路3)解决方案没有次第,一旦出现bad case,无法从中间步骤快速干预修复,只能重新调整模型(还不一定能调好),看似省事,实则不适合工程落地。

基于CNN的方法

Mask R-CNN architecture for table cell structure detection img.png  The architecture of Mask R-CNN (a) and Cascade Mask R-CNN (b). C, B and M denote class label, bounding box and mask predictions of the corresponding stage, respectively. Proposal head outputs are denoted with zero indexes.

Hybrid Task Cascade architecture. S component represents the Semantic Segmentation Branch.

disadvantage: 1)数据漂移性 2) 识别的框会依次影响

基于GCN的方法

img_14.png img_15.png

基于End to End的方法

img_16.png

识别效果如下

###

相关损失函数

从L1 loss到EIoU loss,目标检测边框回归的损失函数一览

当前TableMaster用L1 loss出现:损失函数会在稳定值附近波动,难以收敛到更高精度。

1. Papers

A curated list of resources dedicated to table recognition ,借鉴(Awesome-Table-Recognition)

  • *CODE means official code and CODE means not official code
Conf. Date Title Highlight code
CVPR 2022 TableFormer: Table Structure Understanding with Transformers. Sequence No
CVPR 2022 Neural Collaborative Graph Machines for Table Structure Recognition GNN No
CVPR 2022 PubTables-1M: Towards comprehensive table extraction from unstructured documents Dataset *CODE
arXiv 2021/5/23 Multi-Type-TD-TSR – Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: from OCR to Structured Table Representations Others *CODE
ACM-MM 2021 Show, Read and Reason: Table Structure Recognition with Flexible Context Aggregator GNN No
ICCV 2021 Parsing Table Structures in the Wild Dectction No
ICCV 2021 TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition GNN *CODE
ICDAR Competition 2021 ICDAR 2021 Competition on Scientific Literature Parsing Dataset *CODE
ICDAR Competition 2021 PingAn-VCGroup’s Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML Sequence *CODE
ICDAR Competition 2021 LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment Others *CODE
WACV 2021 Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context Others No
CVPR Workshop 2020 CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents Others *CODE
ECCV 2020 Image-based table recognition: data, model, and evaluation Dataset *CODE
ECCV 2020 Table structure recognition using top-down and bottom-up cues Others *CODE
LREC 2020 TableBank: A Benchmark Dataset for Table Detection and Recognition Dataset *CODE
arXiv 2019/8/28 Complicated table structure recognition Others *CODE
ICDAR 2019 Rethinking Table Recognition using Graph Neural Networks GNN *CODE
ICDAR 2019 Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images Others No
ICDAR 2019 Res2tim: Reconstruct syntactic structures from table images. Others *CODE
ICDAR 2017 Deepdesrt: Deep learning for detection and structure recognition of tables in document images Others No

数据集的汇总

Table datasets. TD denotes table detection, TSR is table structure recognition whereas TR is table recognition.

竞赛

ICDAR

year picture 备注 size metric
2013 表格检测,表格格子识别 150 IOU@AP
2019 The participating methods will be evaluated on a modern dataset and archival documents with printed and handwritten tables present. 表格格子关系预测 P,R,F1
2021 表格格子识别,表格结构识别 TEDS

2. Datasets

2.1 Introduction

Dataset Description dataset link
TableBank English TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.It only contain cell Topology groudtruth TableBank
SciTSR *English SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.It contain cell Topology, cell content groudtruth SciTSR
PubTabNet English PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables.It contain cell Topology, cell content and non-blank cell location groudtruth PubTabNet
FinTabNet English This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. FinTabNet
PubTables-1M English A large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. PubTables-1M
WTW English and Chinese WTW-Dataset is the first wild table dataset for table detection and table structure recongnition tasks, which is constructed from photoing, scanning and web pages, covers 7 challenging cases like: (1)Inclined tables, (2) Curved tables, (3) Occluded tables or blurredtables (4) Extreme aspect ratio tables (5) Overlaid tables, (6) Multi-color tables and (7) Irregular tables in table structure recognition.It contain cell Topology, all cell location groudtruth WTW
TNCR English a new table dataset with varying image quality collected from open access websites.TNCR contains 9428 labeled tables with approximately 6621 images.their classification into 5 different classes(Full Lined,Merged Cells,No lines,Partial Lined,Partial Lined Merged Cells). TNCR
TAL_OCR_TABLE Chinese TAL_OCR_TABLE dataset come from TAL Form Recognition Technology Challenge.The data of comes from the real homework of students in the education scene and the scene of the test paper. It contain 16k train image and 4k test imageIt contain cell Topology, cell content and all cell location groudtruth TAL_OCR_TABLE

2.2 Comparison of datasets for table structure recognition.

Dataset Cell Topology Cell content Cell Location Table Location
TableBank
SciTSR
PubTabNet
FinTabNet
PubTables-1M
WTW
TNCR
TAL_OCR_TABLE

For these datasets, cell bounding boxes are given for non-blank cells only and exclude any non-text portion of a cell.

badcase数据集

图片

#

IOU框各场景分析

利用两个 IoU 阈值,前景阈值 (Tf) 和背景阈值 (Tb),我们可以定义以下错误类型(在 TIDE 论文的第 2.2 节中有更详细的解释):

  1. 分类错误 (CLS):IoU >= Tf 用于不正确类的目标(即,定位正确但分类错误)。

  2. 定位误差 (LOC):Tb <= IoU < Tf 用于正确类别的目标(即,分类正确但定位不正确)。

  3. Cls 和 Loc 错误 (CLS & LOC):Tb <= IoU < Tf 用于不正确类的目标(即,分类和定位不正确)。

  4. 重复检测错误 (DUP):IoU >= Tf 表示正确类别的目标,但另一个得分较高的检测已经与目标匹配(即,如果不是得分较高的检测,那将是正确的)。

  5. 背景误差 (BKG):所有目标的 IoU < Tb(即,检测到的背景为前景)。

  6. 丢失目标错误(MISS):分类或定位错误尚未涵盖的所有未检测到的目标(假阴性)。

图片

Resourse

github

https://github.com/shahrukhqasim/TIES-2.0 Code for: S.R. Qasim, H. Mahmood, and F. Shafait, Rethinking Table Recognition using Graph Neural Networks (2019) TIES was my undergraduate thesis, Table Information Extraction System. I picked the name from there and made it 2.0 from there.

https://github.com/Irene323/GFTE A GCN-based table structure recognition method, which integrates position feature, text feature and image feature together.

https://github.com/jainammm/TableNet About Unofficial implementation of “TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images”

https://github.com/eihli/image-table-ocr Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

https://github.com/HazyResearch/TreeStructure Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables. A crucial step in this process is the construction of the hierarchical tree of context objects such as text blocks, figures, tables, etc. The system currently uses PDF to HTML conversion provided by Adobe Acrobat converter. Adobe Acrobat converter is not an open source tool and this can be very inconvenient for Fonduer users. We therefore need to build our own module as replacement to Adobe Acrobat. Several open source tools are available for pdf to html conversion but these tools do not preserve the cell structure in a table. Our goal in this project is to develop a tool that extracts text, figures and tables in a pdf document and maintains the structure of the document using a tree data structure.

https://github.com/mawanda-jun/TableTrainNet Table recognition inside douments using neural networks

https://github.com/doc-analysis/TableBank TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

https://github.com/cseas/ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

https://github.com/eihli/image-table-ocr Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

https://github.com/JiaquanYe/TableMASTER-mmocr 2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

##

Other technical solutions

PRCV2021 Table Recognition Technology Challenge