简介背景

历史由来

从古老石块上，人类记录信息的形式就有表格的呈现形式 sumer 到现在为止Spreadsheets工具来电子化处理表格

excel

价值

表格是各类文档中常见的页面元素，随着各类文档的爆炸性增长，如何高效地从文档中找到表格并获取内容与结构信息即表格识别，成为了一个亟需解决的问题。对表格结构的还原和内容的识别，能帮助计算机更好的理解表格，在教学内容生产、智能解答等场景下，具有非常重要的应用价值。随着深度学习技术的飞速发展，目标检测、OCR和文档结构识别等技术也取得了许多新的进展，为表格识别提供了多种可能的解决方案。

学术上的价值

CV目标检测上的细分领域，当前还不太卷。表格中的文本解析，让这个任务又是极好的多模态任务。

工业上的价值

在提高生产效率、调整产业结构、提高产品服务质量、降低人工运营成本等战略目标的驱动下，我国各行各业都在从传统模式向数字化、网络化、智能化转变。

定义

表格结构识别是表格区域检测之后的任务，其目标是识别出表格的布局结构、层次结构等，将表格视觉信息转换成可重建表格的结构描述信息。这些表格结构描述信息包括：单元格的具体位置、单元格之间的关系、单元格的行列位置等。在当前的研究中，表格结构信息主要包括以下两类描述形式：1）单元格的列表（包含每个单元格的位置、单元格的行列信息、单元格的内容）；2）HTML代码或Latex代码（包含单元格的位置信息，有些也会包含单元格的内容）。

表格数据呈现形式的本质

一句话概括就是：Meaning in Tables Meaning = Language + Layout

表格数据中呈现的语义逻辑

表格数据存储格式

表格分类

存储格式
    pdf
    图片
展现形式
    垂直
    水平
    旋转X度
记录方式
    电子记录
    手写
    扫描
表格内容
    单表
    多表
格子形式
    Lattice : For tables formed with lines.
        border
    Stream : For tables formed with whitespaces.
        borderless
逻辑结构
    同一列里展示两列数据
    两列上面有共同的主列
图片质量
    整体：分辨率/清晰度
    局部：噪音程度/类型
文档类型
文档来源
研究领域

Document AI任务描述及相关子任务

表格结构识别是属于document AI任务的子任务，从整体document AI任务看表格结构识别在技术链上的定位

The three points of view

1. Document analysis and understanding

其中5个子问题如下

Table detection

Structure Recognition

Functional Analysis

Structural Analysis

Interpretation

2. semantic web + Volume large database

其中3个子问题如下

Entity Linking

Column Type Identification

Relation Extraction

3. semantic table interpretation

3 stages of the Table Understanding

评估指标 metric

1.格子bbox metric

P R Recall
BLEU
map@IOU coco标准
GriTS

(“grits”) computes the grid table similarity (GriTS) metrics for table structure recognition. GriTS is a measure of table cell correctness and is defined as the average correctness of each cell averaged over all tables. GriTS can measure the correctness of predicted cells based on: 1. cell topology alone, 2. cell topology and the reported bounding box location of each cell, or 3. cell topology and the reported text content of each cell. For more details on GriTS
1. STIOU
TEDS

2.格子中文本的评估指标

普通文本
- one 全对准确率：每张图片版面上有多个文本时候，每个文本都对的张数占总的张数的比例；标签全对准确率：每张图片版面上有多个文本时候，文本对的个数占总的文本个数的比例；
- 平均编辑距离：平均编辑距离越小说明识别率越高。平均编辑距离主要衡量整行或整篇文章的指标，可以同时反应识别错，漏识别和多识别的情况；
- 字符识别准确率，即识别对的字符数占总识别出来字符数的比例，可以反应识别错和多识别的情况，但无法反应漏识别的情况；
- 字符识别召回率，即识别对的字符数占实际字符数的比例，可以反应识别错和漏识别的情况，但是没办法反应多识别的情况，可以配套字符识别准确率一起使用；
- 文本行定位为的准确率和召回率，同字符识别的准确率和召回率。主要反应文本行定位的指标，是ocr算法的重要指标； two 第一种是字符准确率，单字识别率，就是按单字算，一百个字里错5个字，识别率95%。
- 第二种是字段准确率，整行识别率，一个字段算一个整体，假如100个字分为20个字段，里面错了5个字，分布在4个字段里，那么识别率是16/20=80%。
- 第三种是整张准确率。通常在票据证件里面有这种计算方式，假设一张票据上有20字，4个字段，5张票上100个字，20字段，错了5个字，分布在4个字段里，分布在3张票据上。那么识别率只有2/5=40%。而且票据字段越多，容易出错的概率越高，整张识别率这个要求就越严苛。实测过程中也会有一些特别约定，说整张识别里错一两个字可以忽略的，这种再另说。同样是100字错5个，用字符、字段、整张准确率来测算的结果是完全不同的，所以对比不同OCR算法时候一定要看清描述的是单字识别率、整行识别率还是整张识别率。一样的识别率99%，整张识别率可比单字识别率的含金量要大得多。
latex文本

AA: Alpha-Numeric Characters Prediction Accuracy, LTA: LaTeX Token Accuracy, LSA: LaTeX Symbol Accuracy, SA: Non-LaTeX Symbol Prediction Accuracy, EM: Exact Match Accuracy, EM@95%: Exact Match Accuracy @95% similarity.

Conf.	Date	Title	Highlight	code
CVPR	2022	TableFormer: Table Structure Understanding with Transformers.	Sequence	No
CVPR	2022	Neural Collaborative Graph Machines for Table Structure Recognition	GNN	No
CVPR	2022	PubTables-1M: Towards comprehensive table extraction from unstructured documents	Dataset	*CODE
arXiv	2021/5/23	Multi-Type-TD-TSR – Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: from OCR to Structured Table Representations	Others	*CODE
ACM-MM	2021	Show, Read and Reason: Table Structure Recognition with Flexible Context Aggregator	GNN	No
ICCV	2021	Parsing Table Structures in the Wild	Dectction	No
ICCV	2021	TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition	GNN	*CODE
ICDAR Competition	2021	ICDAR 2021 Competition on Scientific Literature Parsing	Dataset	*CODE
ICDAR Competition	2021	PingAn-VCGroup’s Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML	Sequence	*CODE
ICDAR Competition	2021	LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment	Others	*CODE
WACV	2021	Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context	Others	No
CVPR Workshop	2020	CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents	Others	*CODE
ECCV	2020	Image-based table recognition: data, model, and evaluation	Dataset	*CODE
ECCV	2020	Table structure recognition using top-down and bottom-up cues	Others	*CODE
LREC	2020	TableBank: A Benchmark Dataset for Table Detection and Recognition	Dataset	*CODE
arXiv	2019/8/28	Complicated table structure recognition	Others	*CODE
ICDAR	2019	Rethinking Table Recognition using Graph Neural Networks	GNN	*CODE
ICDAR	2019	Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images	Others	No
ICDAR	2019	Res2tim: Reconstruct syntactic structures from table images.	Others	*CODE
ICDAR	2017	Deepdesrt: Deep learning for detection and structure recognition of tables in document images	Others	No

数据集的汇总

Table datasets. TD denotes table detection, TSR is table structure recognition whereas TR is table recognition.

竞赛

ICDAR

year	备注	size	metric
2013	表格检测，表格格子识别	150	IOU@AP
2019	The participating methods will be evaluated on a modern dataset and archival documents with printed and handwritten tables present. 表格格子关系预测		P，R，F1
2021	表格格子识别，表格结构识别		TEDS

2. Datasets

2.1 Introduction

Dataset	Description	dataset link
TableBank	English TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.It only contain cell Topology groudtruth	TableBank
SciTSR	*English SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.It contain cell Topology, cell content groudtruth	SciTSR
PubTabNet	English PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables.It contain cell Topology, cell content and non-blank cell location groudtruth	PubTabNet
FinTabNet	English This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition.	FinTabNet
PubTables-1M	English A large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis.	PubTables-1M
WTW	English and Chinese WTW-Dataset is the first wild table dataset for table detection and table structure recongnition tasks, which is constructed from photoing, scanning and web pages, covers 7 challenging cases like: (1)Inclined tables, (2) Curved tables, (3) Occluded tables or blurredtables (4) Extreme aspect ratio tables (5) Overlaid tables, (6) Multi-color tables and (7) Irregular tables in table structure recognition.It contain cell Topology, all cell location groudtruth	WTW
TNCR	English a new table dataset with varying image quality collected from open access websites.TNCR contains 9428 labeled tables with approximately 6621 images.their classification into 5 different classes(Full Lined,Merged Cells,No lines,Partial Lined,Partial Lined Merged Cells).	TNCR
TAL_OCR_TABLE	Chinese TAL_OCR_TABLE dataset come from TAL Form Recognition Technology Challenge.The data of comes from the real homework of students in the education scene and the scene of the test paper. It contain 16k train image and 4k test imageIt contain cell Topology, cell content and all cell location groudtruth	TAL_OCR_TABLE

2.2 Comparison of datasets for table structure recognition.

Dataset	Cell Topology	Cell content	Cell Location	Table Location
TableBank	✓	✕	✕	✓
SciTSR	✓	✓	✕	✓
PubTabNet	✓	✓	✓^†	✓
FinTabNet	✓	✓	✓^†	✓
PubTables-1M	✓	✓	✓	✓
WTW	✓	✕	✓	✓
TNCR	✕	✕	✕	✓
TAL_OCR_TABLE	✓	✓	✓	✓

^† For these datasets, cell bounding boxes are given for non-blank cells only and exclude any non-text portion of a cell.

badcase数据集

IOU框各场景分析

利用两个 IoU 阈值，前景阈值 (Tf) 和背景阈值 (Tb)，我们可以定义以下错误类型（在 TIDE 论文的第 2.2 节中有更详细的解释）：

分类错误 (CLS)：IoU >= Tf 用于不正确类的目标（即，定位正确但分类错误）。
定位误差 (LOC)：Tb <= IoU < Tf 用于正确类别的目标（即，分类正确但定位不正确）。
Cls 和 Loc 错误 (CLS & LOC)：Tb <= IoU < Tf 用于不正确类的目标（即，分类和定位不正确）。
重复检测错误 (DUP)：IoU >= Tf 表示正确类别的目标，但另一个得分较高的检测已经与目标匹配（即，如果不是得分较高的检测，那将是正确的）。
背景误差 (BKG)：所有目标的 IoU < Tb（即，检测到的背景为前景）。
丢失目标错误（MISS）：分类或定位错误尚未涵盖的所有未检测到的目标（假阴性）。

Resourse

github

https://github.com/shahrukhqasim/TIES-2.0 Code for: S.R. Qasim, H. Mahmood, and F. Shafait, Rethinking Table Recognition using Graph Neural Networks (2019) TIES was my undergraduate thesis, Table Information Extraction System. I picked the name from there and made it 2.0 from there.

https://github.com/Irene323/GFTE A GCN-based table structure recognition method, which integrates position feature, text feature and image feature together.

https://github.com/jainammm/TableNet About Unofficial implementation of “TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images”

https://github.com/eihli/image-table-ocr Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

https://github.com/HazyResearch/TreeStructure Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables. A crucial step in this process is the construction of the hierarchical tree of context objects such as text blocks, figures, tables, etc. The system currently uses PDF to HTML conversion provided by Adobe Acrobat converter. Adobe Acrobat converter is not an open source tool and this can be very inconvenient for Fonduer users. We therefore need to build our own module as replacement to Adobe Acrobat. Several open source tools are available for pdf to html conversion but these tools do not preserve the cell structure in a table. Our goal in this project is to develop a tool that extracts text, figures and tables in a pdf document and maintains the structure of the document using a tree data structure.

https://github.com/mawanda-jun/TableTrainNet Table recognition inside douments using neural networks

https://github.com/doc-analysis/TableBank TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables.

https://github.com/cseas/ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

https://github.com/eihli/image-table-ocr Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

https://github.com/JiaquanYe/TableMASTER-mmocr 2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

简介背景

历史由来

价值

学术上的价值

工业上的价值

定义

表格数据呈现形式的本质

表格数据中呈现的语义逻辑

表格数据存储格式

表格分类

Document AI任务描述及相关子任务

1. Document analysis and understanding

2. semantic web + Volume large database

3. semantic table interpretation

评估指标 metric

1.格子bbox metric

2.格子中文本的评估指标

相关技术分类

表格识别各技术的进展情况

基于CNN的方法

基于GCN的方法

基于End to End的方法

相关损失函数

1. Papers

数据集的汇总

竞赛

ICDAR

2. Datasets

2.1 Introduction

2.2 Comparison of datasets for table structure recognition.

badcase数据集

IOU框各场景分析

Resourse

github

Other technical solutions

PRCV2021 Table Recognition Technology Challenge

ICDAR 2021 Competition on Scientfic Literature Parsing TaskB: Table Recognition to HTML

CATALOG

FEATURED TAGS

FRIENDS