business_cv_table extraction
陶恒franz
2022-11-13
business_cv_table extraction 简介背景 任务描述 实验结果 针对各种不同方法,实验对比 实验分析 对人类表现、模型能力和任务进行分析 基线模型及运行 支持多种基线模型 测评及规则 数据集介绍 介绍数据集及示例 数据处理方法简介 学习资料 文章、PPT、分享视频及选手方案 贡献与参与 如何参与项目或反馈问题 tabula-py https://github.com/chezou/tabula-py Image-Based Table Recognition: Data, Model, and Evaluation https://aegis4048.github.io/parse-pdf-files-while-retaining-structure- with-tabula-py https://nbviewer.jupyter.org/github/chezou/tabula-py/blob/master/examples/ tabula_example.ipynb Camelot: PDF Table Extraction for Humans https://github.com/atlanhq/camelot https://camelot-py.readthedocs.io/en/master/ pdfplumber 比较 https://github.com/atlanhq/camelot/wiki/Comparison-with-other-PDF-Table- Extraction-libraries-and-tools pdftables pdf-table-extract TrapRange: a Method to Extract Table Content in PDF Files 方案 DeepTable pdf2table CascadeTabNet table-detection-dataset TableBank Marmot Dataset https://www.icst.pku.edu.cn/cpdp/sjzy/index.htm ICDAR2019_cTDaR TableNet https://blog.csdn.net/fangxiananvhai/article/details/110875261 pdf文档解析相关工具包 标注原则 Follow all guidelines and use consistent labeling on all documents Do not label whitespace Use the image labels on images and diagrams Do not treat bold, italic, or underlined text differently. Label based on the context, not the style. When labeling a document, work from the first page to the last. If you incorrectly label an item, choose another label for the item to overwrite the first. Pages can be submitted at any time. Ensure that all appropriate labeling is complete before submitting. Documents that appear to have text overlaying other text are considered “double overlaid” and cannot be annotated. Report these documents to your administrator. Documents that contain multiple columns of text on a single page cannot be annotated. Report these documents to your administrator. Consider only labeling footnotes when they appear at the bottom of the page and are referenced in the main body of text in the document. Consider labeling notes that appear within sections or lists, for example notes that are explicitly called out as “Notes,” as text. When you annotate a table, make sure to select the entire table before you apply the table label. If the visual structure of one set of documents is drastically different from another, group the documents that are structured similarly in separate collections and then retrain each collection. 概览 GCN TABLE2LATEX-450K 文档层面分为了训练集(约44.7万)、验证集(约0.9万)和测试集(约0.9万) EnronCorpus 1165份文件的数据集 格式 URL https://www.doc88.com/p-74961713831713.html?r=1 概念区分 表格检测(Table Detection) 表格结构分解:也称表格线检测(单元格检测) 表格结构识别(Table Structure Recognition) 逻辑子表 架构 Word detection Text chunk detection Merging “multi-line” text chunks Page column detection Line detection Bullet detection TIES-2.0 Tabulo TableMASTER-mmocr https://github.com/WZBSocialScienceCenter/pdftabextract/blob/master/ examples/catalogue_30s/catalog_30s_notebook.ipynb GROBID a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. 表格分类 表格分析 定义 Global Features Table Structure MAX_ROWS MAX_COLS MAX_CELL_LENGTH AVG_ROWS AVG_COLS AVG_CELL_LENGTH Consistency and Variation STD_DEV_ROWS STD_DEV_COLS STD_DEV_CELL_LENGTH Content Ratio RATIO_IMG RATIO_FORM RATIO_HYPERLINK RATIO_ALPHABETIC RATIO_DIGIT RATIO_OTHER 三篇论文,纵览深度学习在表格识别中的最新应用 https://zhuanlan.zhihu.com/p/188538404 子主题 http://cells.icc.ru/pdfte/ OCRmyPDF https://github.com/ocrmypdf/OCRmyPDF PINGAN-VCGROUP PubTabNet consists of 500,777 training samples, 9,115 validation samples, 9,138 samples for the development stage, and 9,064 samples for the final evaluation stage. four sub-tasks table structure recognition text line detection text line recognition box assignment Center Point Rule IOU Rule Distance Rule Experiment 8 Tesla V100 GPUs are used with the batch size 10 in each GPU. Ablation Studies Synchronized Batch Normalization (SyncBN) Feature Concatenation of Layers in Transformer Decoder Label Encoding in Structure Prediction Ranger Optimizer Data Augmentation Multiple Resolutions Synchronized Batch Normalization (SyncBN) Feature Concatenation of Layers in Transformer Decoder (FeaC) Model Ensemble Table Content Reconstruction Multiple Resolutions Pre-train Model Metric Exact Match Accuracy Exact Match Accuracy @95% similarity Row Prediction Accuracy Column Prediction Accuracy Alpha-Numeric Characters Prediction Accuracy LaTeX Token Accuracy LaTeX Symbol Accuracy Non-LaTeX Symbol Prediction Accuracy LGPMA TabStructNet TGRNet SEM Cycle-CenterNet 简介 https://www.cnblogs.com/dan-baishucaizi/p/15540834.html github https://github.com/hikopensource/DAVAR-Lab-OCR sem中的网格间的合并关系可以启发做split merger cell的问题 简介 https://baijiahao.baidu.com/s?id=1718411766247548021&wfr=spider&for=pc TEDS(Tree-Edit-Distance Similarity) disadvantage 1.只检查非空单元格之间的直接关系,而对于由空单元格和非直接关系单元格之间未 对齐引起的错误无法检测。 2.另一个问题是无法同时对的内容进行评测 将表格结构用树状结构表示,采用树的距离来进行两颗树之间相似度的度量 文本框对齐方案 中心点对齐(计算文本检测框的中心点是否落在单元格内),IOU最大(计算与该文 本框IOU最大的单元格),距离最近(计算与该文本框距离最近的单元格)。之后将 文本识别出来的字符串回填到单元格中完成。 GriTS RAGE模型 modified method 业界方案 1)利用OCR检测文本,从文本框的空间排布信息推导出有哪些行、有哪些列、哪些单 元格需合并,由此生成电子表格; 2)运用图像形态学变换、纹理提取、边缘检测等手段,提取表格线,再由表格线推 导行、列、合并单元格的信息; 3)神经网络端到端学习,代表工作是TableBank,使用image to text技术,将表格 图片转为某种结构化描述语言(比如html定义表格结构的标签)。 缺陷 思路1)极度依赖OCR检测结果和人工设计的规则,对于不同样式的表格,需做针对性 开发,推广性差; 思路2)依赖传统图像处理算法,在鲁棒性方面较欠缺,并且对于没有可见线的表 格,传统方法很吃力,很难把所有行/列间隙提取出来; 思路3)解决方案没有次第,一旦出现bad case,无法从中间步骤快速干预修复,只 能重新调整模型(还不一定能调好),看似省事,实则不适合工程落地。 based on handcrafted features and heuristic rules some statistical machine learning based methods deep learning based approaches row/column extraction based methods image-to-markup generation based methods bottom-up methods tablemaster LGPMA ? TSRFOMER 基本原理 1) A two-stage DETR [56] based separator regression module to directly predict linear and curvilinear row/column separation lines from input table images; 2) A relation network based cell merging module to recover spanning cells by merging adjacent cells generated by intersecting row and column separators. ? table-transfomer pub-1m DETR CascadeTabNet 子主题 业界方案趋势 技术方案 表格数据 boundary extraction-based generative modelbased graph-based methods 子主题 an end-to-end network training cell detection and structure recognition networks in a joint manner. 目标检测 COCO 数据 { "images": [ { "id": 1, "width": 800, "height": 600, "file_name": "test.jpg", "..":"补充自有图片属性" } ], "annotations": [ { "id": 1, "image_id": 1, "category_id": 1, "bbox": [0,0,10,10], "segmentation": [ [0,0,10,0,10,10,0,10] ], "area": 100, "iscrowd": 0, "NLP",{依据脑图树形层次写入自有标签,并且跟coco格式兼 容} "CV",{依据脑图树形层次写入自有标签,并且跟coco格式兼 容} } ] } Table structure type: CV(所见即所得) 展现形式 记录方式 表达形式 表格内容 格子形式 span格子形式 垂直 水平 竖表 横表(行名为化合物名称,列名为各药化属性) 同一列里展示两列数据 复合表(compound在表标题上) 单表 多表 电子记录 手写 扫描 有线格 无线格 部分有线格 表头唯水平线格 表头田字格 Table Ptragmatic type :NLP(有语义先验知识) 图片属性 宽度 高度 Lattice 有线表: For tables formed with lines. Stream 无线表: For tables formed with whitespaces. 单个行span格子 逻辑结构 唯cell中多语义 单列中多语义 多个行span格子 单个列span格子 多个列span格子 表头header 表列stub 表体 混合行span 格子类型 表头呈现形式 表头是灰色像素头 普通字符 分子图 空白 跨页 表头行数 带有数学字符 依据表格分类,树形分类依次统计 存储大小 色彩分布空间 色偏分布 标注信息 标注框宽高比分布 标注框面积分布