### 文档布局分析用于文本提取

我需要分析不同类型文档的布局结构,如:pdfdocdocxodt等。

我的任务是:给定一个文档,将文本分组成块,并找出每个块的正确边界。

我使用了Apache Tika进行了一些测试,它是一个很好的提取工具,非常有用,但它经常会弄乱块的顺序,让我解释一下我所说的“顺序”的意思。

Apache Tika只提取文本,所以如果我的文档有两列,Tika会先提取第一列的全部文本,然后再提取第二列的文本,这没问题……但有时候第一列的文本与第二列的文本有关联,就像表格中的行关系一样。

所以我必须注意每个块的位置,因此问题在于:

  1. 定义盒子的边界,这很困难……我需要判断一个句子是否开始了一个新的块。

  2. 定义方向,例如,给定一个表格,“句子”应该是行,而不是列。

所以基本上我必须处理布局结构,以正确理解块的边界。

我给出一个视觉示例:

enter image description here

经典的提取器返回的是:

201920182017201620152014Oregon Arts Commission Individual Artist Fellowship...

这在我的情况下是错误的,因为日期与右侧的文本有关联。

这项任务是其他NLP分析的准备工作,因此非常重要,因为例如在需要识别文本中的实体(NER)并识别它们之间的关系时,在正确的上下文中工作非常重要

如何从文档中提取文本并将相关的文本片段(理解文档的布局结构)组合在同一个块下?


回答:

这只是解决你问题的一部分解决方案,但它可能会简化手头的工作。这个工具接收PDF文件并将其转换为文本文件。它运行得非常快,并且可以批量处理文件。

它为每个PDF创建一个输出文本文件。这个工具相较于其他工具的优势在于输出文本是按照其原始布局对齐的。

例如,这是一个布局复杂的简历:

enter image description here

它的输出是以下文本文件:

Christopher                         Summary                                    Senior Web Developer specializing in front end development.Morgan                              Experienced with all stages of the development cycle for                                    dynamic web projects. Well-versed in numerous programming                                    languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.                                    Strong background in project management and customer                                    relations.                                    Skill Highlights                                        •   Project management          •   Creative design                                        •   Strong decision maker       •   Innovative                                        •   Complex problem             •   Service-focused                                            solver                                    ExperienceContact                                    Web Developer - 09/2015 to 05/2019Address:                            Luna Web Design, New York177 Great Portland Street, London      • Cooperate with designers to create clean interfaces andW5W 6PQ                                   simple, intuitive interactions and experiences.                                       • Develop project concepts and maintain optimalPhone:                                    workflow.+44 (0)20 7666 8555                                       • Work with senior developer to manage large, complex                                          design projects for corporate clients.                                       • Complete detailed programming and development tasksEmail:                                       for front end public and internal websites as well as[email protected]                                          challenging back-end server code.                                       • Carry out quality assurance tests to discover errors andLinkedIn:                                       optimize usability.linkedin.com/christopher.morganLanguages                           EducationSpanish – C2                                    Bachelor of Science: Computer Information Systems - 2014Chinese – A1                                    Columbia University, NYGerman – A2Hobbies                             Certifications                                    PHP Framework (certificate): Zend, Codeigniter, Symfony.   •   Writing                                    Programming Languages: JavaScript, HTML5, PHP OOP, CSS,   •   Sketching                                    SQL, MySQL.   •   Photography   •   Design-----------------------Page 1 End-----------------------

现在你的任务简化为在文本文件中查找块,并使用单词之间的空格作为对齐提示。作为开始,我提供了一个脚本,该脚本查找两列文本之间的边距,并生成rhslhs – 分别是右列和左列的文本流。

import numpy as npimport matplotlib.pyplot as pltimport retxt_lines = txt.split('\n')max_line_index = max([len(line) for line in txt_lines])padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spacesspace_idx_counters = np.zeros(max_line_index)for idx, line in enumerate(padded_txt_lines):    if line.find("-----------------------Page") >= 0: # reached end of page        break    space_idxs = [pos for pos, char in enumerate(line) if char == " "]    space_idx_counters[space_idxs] += 1padded_txt_lines = padded_txt_lines[:idx] #remove end page line# plot histogram of spaces in each character columnplt.bar(list(range(len(space_idx_counters))), space_idx_counters)plt.title("Number of spaces in each column over all lines")plt.show()# find the separator column idxseparator_idx = np.argmax(space_idx_counters)print(f"separator index: {separator_idx}")left_lines = []right_lines = []# separate two columns of textfor line in padded_txt_lines:    left_lines.append(line[:separator_idx])    right_lines.append(line[separator_idx:])# join each bulk into one stream of text, remove redundant spaceslhs = ' '.join(left_lines)lhs = re.sub("\s{4,}", " ", lhs)rhs = ' '.join(right_lines)rhs = re.sub("\s{4,}", " ", rhs)print("************ Left Hand Side ************")print(lhs)print("************ Right Hand Side ************")print(rhs)

绘图输出:

enter image description here

文本输出:

separator index: 33************ Left Hand Side ************Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: [email protected] LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies •   Writing •   Sketching •   Photography •   Design ************ Right Hand Side ************   Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights •   Project management •   Creative design •   Strong decision maker •   Innovative •   Complex problem •   Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL. 

下一步将是将此脚本推广到多页文档,删除冗余符号等。

祝你好运!

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注