business_cv_table extraction
陶恒franz
2022-11-13
business_cv_table
extraction
简介背景
任务描述
实验结果 针对各种不同方法,实验对比
实验分析 对人类表现、模型能力和任务进行分析
基线模型及运行 支持多种基线模型
测评及规则
数据集介绍 介绍数据集及示例
数据处理方法简介
学习资料 文章、PPT、分享视频及选手方案
贡献与参与 如何参与项目或反馈问题
tabula-py
https://github.com/chezou/tabula-py
Image-Based Table Recognition: Data, Model, and Evaluation
https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-
with-tabula-py
https://nbviewer.jupyter.org/github/chezou/tabula-py/blob/master/examples/
tabula_example.ipynb
Camelot: PDF Table Extraction for Humans
https://github.com/atlanhq/camelot
https://camelot-py.readthedocs.io/en/master/
pdfplumber
比较
https://github.com/atlanhq/camelot/wiki/Comparison-with-other-PDF-Table-
Extraction-libraries-and-tools
pdftables
pdf-table-extract
TrapRange: a Method to Extract Table Content in PDF Files
方案
DeepTable
pdf2table
CascadeTabNet
table-detection-dataset
TableBank
Marmot Dataset
https://www.icst.pku.edu.cn/cpdp/sjzy/index.htm
ICDAR2019_cTDaR
TableNet
https://blog.csdn.net/fangxiananvhai/article/details/110875261
pdf文档解析相关工具包
标注原则
Follow all guidelines and use consistent labeling on all documents
Do not label whitespace
Use the image labels on images and diagrams
Do not treat bold, italic, or underlined text differently. Label based on
the context, not the style.
When labeling a document, work from the first page to the last.
If you incorrectly label an item, choose another label for the item to
overwrite the first.
Pages can be submitted at any time. Ensure that all appropriate labeling
is complete before submitting.
Documents that appear to have text overlaying other text are considered
“double overlaid” and cannot be annotated. Report these documents to
your administrator.
Documents that contain multiple columns of text on a single page cannot be
annotated. Report these documents to your administrator.
Consider only labeling footnotes when they appear at the bottom of the
page and are referenced in the main body of text in the document.
Consider labeling notes that appear within sections or lists, for example
notes that are explicitly called out as “Notes,” as text.
When you annotate a table, make sure to select the entire table before you
apply the table label.
If the visual structure of one set of documents is drastically different
from another, group the documents that are structured similarly in
separate collections and then retrain each collection.
概览
GCN
TABLE2LATEX-450K
文档层面分为了训练集(约44.7万)、验证集(约0.9万)和测试集(约0.9万)
EnronCorpus
1165份文件的数据集
格式
URL
https://www.doc88.com/p-74961713831713.html?r=1
概念区分
表格检测(Table Detection)
表格结构分解:也称表格线检测(单元格检测)
表格结构识别(Table Structure Recognition)
逻辑子表
架构
Word detection
Text chunk detection
Merging “multi-line” text chunks
Page column detection
Line detection
Bullet detection
TIES-2.0
Tabulo
TableMASTER-mmocr
https://github.com/WZBSocialScienceCenter/pdftabextract/blob/master/
examples/catalogue_30s/catalog_30s_notebook.ipynb
GROBID
a machine learning library for extracting, parsing and re-structuring raw
documents such as PDF into structured XML/TEI encoded documents with a
particular focus on technical and scientific publications.
表格分类
表格分析
定义
Global Features
Table Structure
MAX_ROWS
MAX_COLS
MAX_CELL_LENGTH
AVG_ROWS
AVG_COLS
AVG_CELL_LENGTH
Consistency and Variation
STD_DEV_ROWS
STD_DEV_COLS
STD_DEV_CELL_LENGTH
Content Ratio
RATIO_IMG
RATIO_FORM
RATIO_HYPERLINK
RATIO_ALPHABETIC
RATIO_DIGIT
RATIO_OTHER
三篇论文,纵览深度学习在表格识别中的最新应用
https://zhuanlan.zhihu.com/p/188538404
子主题
http://cells.icc.ru/pdfte/
OCRmyPDF
https://github.com/ocrmypdf/OCRmyPDF
PINGAN-VCGROUP
PubTabNet
consists of 500,777 training
samples, 9,115 validation samples, 9,138 samples for the development
stage, and 9,064 samples for the final evaluation
stage.
four sub-tasks
table structure recognition
text line detection
text line recognition
box assignment
Center Point Rule
IOU Rule
Distance Rule
Experiment
8 Tesla V100 GPUs are used with the batch size 10 in each GPU.
Ablation Studies
Synchronized Batch Normalization (SyncBN)
Feature Concatenation of Layers in Transformer Decoder
Label Encoding in Structure Prediction
Ranger Optimizer
Data Augmentation
Multiple Resolutions
Synchronized Batch Normalization (SyncBN)
Feature Concatenation of Layers in Transformer Decoder (FeaC)
Model Ensemble
Table Content Reconstruction
Multiple Resolutions
Pre-train Model
Metric
Exact Match Accuracy
Exact Match Accuracy @95% similarity
Row Prediction Accuracy
Column Prediction Accuracy
Alpha-Numeric Characters Prediction Accuracy
LaTeX Token Accuracy
LaTeX Symbol Accuracy
Non-LaTeX Symbol Prediction Accuracy
LGPMA
TabStructNet
TGRNet
SEM
Cycle-CenterNet
简介
https://www.cnblogs.com/dan-baishucaizi/p/15540834.html
github
https://github.com/hikopensource/DAVAR-Lab-OCR
sem中的网格间的合并关系可以启发做split merger cell的问题
简介
https://baijiahao.baidu.com/s?id=1718411766247548021&wfr=spider&for=pc
TEDS(Tree-Edit-Distance Similarity)
disadvantage
1.只检查非空单元格之间的直接关系,而对于由空单元格和非直接关系单元格之间未
对齐引起的错误无法检测。
2.另一个问题是无法同时对的内容进行评测
将表格结构用树状结构表示,采用树的距离来进行两颗树之间相似度的度量
文本框对齐方案
中心点对齐(计算文本检测框的中心点是否落在单元格内),IOU最大(计算与该文
本框IOU最大的单元格),距离最近(计算与该文本框距离最近的单元格)。之后将
文本识别出来的字符串回填到单元格中完成。
GriTS
RAGE模型 modified method
业界方案
1)利用OCR检测文本,从文本框的空间排布信息推导出有哪些行、有哪些列、哪些单
元格需合并,由此生成电子表格;
2)运用图像形态学变换、纹理提取、边缘检测等手段,提取表格线,再由表格线推
导行、列、合并单元格的信息;
3)神经网络端到端学习,代表工作是TableBank,使用image to text技术,将表格
图片转为某种结构化描述语言(比如html定义表格结构的标签)。
缺陷
思路1)极度依赖OCR检测结果和人工设计的规则,对于不同样式的表格,需做针对性
开发,推广性差;
思路2)依赖传统图像处理算法,在鲁棒性方面较欠缺,并且对于没有可见线的表
格,传统方法很吃力,很难把所有行/列间隙提取出来;
思路3)解决方案没有次第,一旦出现bad case,无法从中间步骤快速干预修复,只
能重新调整模型(还不一定能调好),看似省事,实则不适合工程落地。
based on handcrafted features
and heuristic rules
some statistical machine learning based methods
deep learning
based approaches
row/column extraction based methods
image-to-markup generation based methods
bottom-up methods
tablemaster
LGPMA
? TSRFOMER
基本原理
1) A two-stage DETR [56] based separator regression module to
directly predict linear and curvilinear row/column separation lines
from input table images;
2) A relation network based cell merging
module to recover spanning cells by merging adjacent cells generated
by intersecting row and column separators.
? table-transfomer
pub-1m
DETR
CascadeTabNet
子主题
业界方案趋势
技术方案
表格数据
boundary extraction-based
generative modelbased
graph-based methods
子主题
an end-to-end network training
cell detection and structure recognition networks in a joint
manner.
目标检测 COCO 数据
{
"images": [
{
"id": 1,
"width": 800,
"height": 600,
"file_name": "test.jpg",
"..":"补充自有图片属性"
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [0,0,10,10],
"segmentation": [
[0,0,10,0,10,10,0,10]
],
"area": 100,
"iscrowd": 0,
"NLP",{依据脑图树形层次写入自有标签,并且跟coco格式兼
容}
"CV",{依据脑图树形层次写入自有标签,并且跟coco格式兼
容}
}
]
}
Table structure type: CV(所见即所得)
展现形式
记录方式
表达形式
表格内容
格子形式
span格子形式
垂直
水平
竖表
横表(行名为化合物名称,列名为各药化属性)
同一列里展示两列数据
复合表(compound在表标题上)
单表
多表
电子记录
手写
扫描
有线格
无线格
部分有线格
表头唯水平线格
表头田字格
Table Ptragmatic type :NLP(有语义先验知识)
图片属性
宽度
高度
Lattice 有线表: For tables formed with lines.
Stream 无线表: For tables formed with whitespaces.
单个行span格子
逻辑结构
唯cell中多语义
单列中多语义
多个行span格子
单个列span格子
多个列span格子
表头header
表列stub
表体
混合行span
格子类型
表头呈现形式
表头是灰色像素头
普通字符
分子图
空白
跨页
表头行数
带有数学字符
依据表格分类,树形分类依次统计
存储大小
色彩分布空间
色偏分布
标注信息
标注框宽高比分布
标注框面积分布
Created With
MindMaster