多模态公文的结构知识抽取与组织研究

doi:10.12305/j.issn.1001-506X.2022.07.20

Abstract

Abstract:

For the fact that triplet-based knowledge in large-scale knowledge graphs lacks structural logic and is difficult to form a knowledge system, this paper presents a multi-modal governmental documents dataset called GovDoc-CN. A multi-modal knowledge structure elements extraction model is proposed and knowledge structure elements are extracted, including titles, abstracts, authors, time of completion, document number, and other knowledge structure elements in documents through both text modal and image modal. The document structure tree (DST) model is designed to organize the extracted document knowledge structure elements, and a structured graph network is constructed to realize organization and management. Experiments prove that the multi-modal knowledge structural elements extraction model has achieved a significant improvement compared with the single-modal extraction models. The DST model and the structured graph network based on the DST model can provide a new way for the organization and management of document knowledge and have significant application value.

Key words: multi-modal, information extraction, knowledge organization, document structuring, governmental documents automation

CLC Number:

TP391.1

Ruilin XU, Boying GENG, Shukan LIU. Research on structural knowledge extraction and organization for multi-modal governmental documents[J]. Systems Engineering and Electronics, 2022, 44(7): 2241-2250.

Figures/Tables 12

Table 1

Table 2

Table 3

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Table 4

Table 5

Fig.6

Fig.7

References 33

1	谢子哲, 罗雪峰, 彭程, 等. 一种FAQ对话方法、装置及电子设备[P]. 中国: CN110096580A, 2019-08-06.
	XIE Z Z, LUO X F, PENG C, et al. A FAQ dialogue method, device and electronic equipment[P]. China: CN110096580A, 2019-08-06.
2	周辉阳, 闫昭. 问答对生成方法及装置[P]. 中国: CN111897934A, 2020-11-06.
	ZHOU H Y, YAN Z. Generating method and device of question answer pair[P]. China: CN111897934A, 2020-11-06.
3	李舟军, 李水华. 基于Web的问答系统综述[J]. 计算机科学, 2017, 44 (6): 1- 7.1-7, 42
	LI Z J , LI S H . Survey on Web-based question answering[J]. Computer Science, 2017, 44 (6): 1- 7.1-7, 42
4	张笑一. 基于知识图谱的问答关键技术研究[D]. 哈尔滨: 哈尔滨工业大学, 2020.
	ZHANG X Y. Research on key technologies of question answering based on knowledge graph[D]. Harbin: Harbin Institute of Technology, 2020.
5	BOLLACKER K, COOK R, TUFTS P. Freebase: a shared database of structured general human knowledge[C]//Proc. of the 22nd National Conference on Artificial Intelligence, 2007: 1962-1963.
6	BIZER C , LEHMANN J , KOBILAROV G , et al. DBpedia-acrystallization point for the Web of Data[J]. Web Semantics Science Services & Agents on the World Wide Web, 2009, 7 (3): 154- 165.
7	SUCHANEK F M, KASNECI G, WEIKUM G. YAGO: a core of semantic knowledge[C]//Proc. of the 16th International Conference on World Wide Web, 2007: 697-706.
8	黄胜, 王博博, 朱菁. 基于文档结构与深度学习的金融公告信息抽取[J]. 计算机工程与设计, 2020, 41 (1): 115- 121.
	HUANG S , WANG B B , ZHU J . Information extraction of financial announcement based on document structure and deep learning[J]. Computer Engineering and Design, 2020, 41 (1): 115- 121.
9	ZHUANG C H, ZHOU Y M, GE J D, et al. Information extraction from Chinese judgment documents[C]//Proc. of the 14th Web Information Systems and Applications Conference, 2017.
10	JI D H , TAO P , FEI H , et al. An end-to-end joint model for evidence information extraction from court record document[J]. Information Processing & Management, 2020, 57 (6): 102305.
11	邬宗玲. 非结构化医学病历信息抽取的方法研究[D]. 厦门: 华侨大学, 2020.
	WU Z L. Research on the method of information extraction based on unstructured medical records[D]. Xiamen: Huaqiao University, 2020.
12	李丹, 魏明欣, 张兵, 等. 基于规则和模型结合的法律文书信息抽取方法及系统[P]. 中国:
	CN111476034A, 2020-07-31. LI D, WEI M X, ZHANG B, et al. Method and system of legal document information extraction based on rule and model[P]. China: CN111476034A, 2020-07-31.
13	PRIETO J R, BOSCH V, VIDAL E, et al. Text content based layout analysis[C]//Proc. of the 17th International Conference on Frontiers in Handwriting Recognition, 2020: 258-263.
14	ORAL B , EMEKLIGIL E , ARSLAN S , et al. Information extraction from text intensive and visually rich banking documents[J]. Information Processing and Management, 2020, 57.
15	SARKHEL R, NANDI A. Visual segmentation for information extraction from heterogeneous visually rich documents[C]//Proc. of the ACM SIGMOD International Conference on Mana-gement of Data, 2020: 247-262.
16	YANG X, YUMER E, ASENTE P, et al. Learning to extract semantic structure from documents using multimodal fully convolutional neural network[C]//Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4342-4351.
17	LIU X J, GAO F Y, ZHANG Q, et al. Graph convolution for multimodal information extraction from visually rich documents[C]//Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2019, 2: 32-39.
18	WEI M X, HE Y F, ZHANG Q. Robust layout-aware IE for visually rich documents with pre-trained language models[C]//Proc. of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020: 2367-2376.
19	XU Y H, LI M H, CUI L et al. Layoutlm: pre-training of text and layout for document image understanding[C]//Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, 2020: 1192-1200.
20	XU Y, XU Y H, LV T C, et al. Layoutlmv2: multi-modal pre-training for visually-rich document understanding[C]//Proc. of the 59th Annual Meeting of the Association for Computational Linguistics, 2021: 2579-2591.
21	GB/T 9704-2012. 党政机关公文格式[S]. 北京: 中国标准出版社, 2012.
	GB/T 9704-2012. Layout key for official document of Party and government organs[S]. Beijing: Standards Press of China, 2012.
22	GJB 5100A-2017. 军队机关公文格式[S]. 北京: 中央军委装备发展部, 2017.
	GJB 5100A-2017. Layout key for official document of military administration[S]. Beijing: Equipment Development Department of Central Military Commission of P.R. China, 2017.
23	BOCHKOVSKIY A, WANG C Y, LIAO H Y. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2021-04-10]. https://arxiv.org/abs/2004.10934.
24	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proc. of the IEEE Conference on Computer Vision & Pattern Recognition, 2016: 770-778.
25	HUANG G, LIU Z, MAATEN L V D, et al. Densely connected convolutional networks[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2261-2269.
26	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proc. of the IEEE Confe-rence on Computer Vision and Pattern Recognition, 2017: 936-944.
27	RAY S. An overview of the tesseract OCR engine[C]//Proc. of the International Conference on Document Analysis & Recognition, 2007: 629-633.
28	HU J L , GAO L , QIN G M . Evaluation of subgraph searching algorithms detecting network motif in biological networks[J]. Frontiers of Computer Science in China, 2009, 3 (3): 412- 416. doi: 10.1007/s11704-009-0045-z
29	ZHONG X, TANG J B, YEPES A J. PubLayNet: largest dataset ever for document layout analysis[C]//Proc. of the International Conference on Document Analysis and Recognition, 2020.
30	LI M H, XU Y F, CUI L, et al. DocBank: a benchmark dataset for document layout analysis[C]//Proc. of the 28th International Conference on Computational Linguistics, 2020: 949-960.
31	伍舒婷. 基于深度学习的政府公文智能分类技术研究[D]. 成都: 电子科技大学, 2020.
	WU S T. Research on intelligent classification of government official documents based on deep learning[D]. Chengdu: University of Electronic Science and Technology of China, 2020.
32	吴雄辉. 政务公文命名实体识别方法研究及应用[D]. 北京: 北京化工大学, 2019.
	WU X H. Research and application of identification method of government official document named entity[D]. Beijing: Beijing University of Chemical Technology, 2019.
33	中华人民共和国中央人民政府. 国务院政策文件库[EB/OL]. [2021-04-10]. http://www.gov.cn/zhengce/zhengcewenjianku/index.htm.
	The Central People's Government of the P.R. China. Policy document library of the State Council[EB/OL]. [2021-04-10]. http://www.gov.cn/zhengce/zhengcewenjianku/index.htm.

大纲级别	编号方法
一级标题	中文基数词+顿号, 如“一、”
二级标题	中文基数词+全角括号, 如“(一)”
三级标题	英文数字+半角句号, 如“1.”

编号方法	示例
中文基数词+“要”	一要
中文基数词+“是”	一是
“其”+中文基数词	其一
中文序数词	第一
词汇	首先、其次、再次、最后

要素	规则
密级	①只能为“秘密”“机密”“绝密”“绝密·核心” 4种; ② “秘密”“机密”“绝密”字间空一个汉字符; ③一般位于文档首部左顶格, 份号之下, 紧急程度之上。
紧急程度	①只能为“特急”“加急”两种; ② “特急”“加急”字间空一个汉字符; ③一般位于文档首部左顶格, 份号和密级之下, 发文机关标志之上; ④可能出现于发文机关标志之下, 正文标题之上的一行的左顶格。
发文字号	①由“发文代字+〔成文年份〕+成文序号”组成; ②认定中文编码的独占符号六角括号“〔〕”为发文字号的识别依据。
主送机关	一般位于份号、密级、紧急程度、发文字号等之下, 正文标题之上的一行。
标题	①一般位于发文字号之下, 主送机关之上; ②可能为一行或者连续多行。
抄送机关	①一般位于版记部分独立成行且注明有“抄送: ”标记; ②可能与印制份数位于同一行且居于左顶格位置。

要素实体	数量
发文机关标志	1 347
发文字号	1 344
正文标题	1 359
主送机关	1 065
一级标题	5 184
二级标题	4 697
三级标题	1 415
发文机关	2 073
成文日期	1 178
正文	10 280

要素类型	F1分数
要素类型	方法1^A	方法2^B	方法3^C
发文机关标志	0.790 5	0.829 6	0.902 9
发文字号	0.971 2	0.962 0	0.987 7
正文标题	0.570 7	0.962 0	0.962 0
主送机关	0.973 4	0.821 1	0.998 9
一级标题	0.973 4	0.907 6	0.996 1
二级标题	0.955 5	0.773 3	0.930 2
三级标题	0.650 2	0.819 0	0.931 0
发文机关	0.795 5	0.611 5	0.846 6
成文日期	0.820 4	0.909 1	0.953 0
正文	0.848 8	0.752 2	0.921 2
平均	0.835 0	0.834 7	0.943 0

Research on structural knowledge extraction and organization for multi-modal governmental documents

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 33

Related Articles 2

Recommended Articles

Metrics

Comments

[1]	YE Wen1, OUYANG Zhong-hui1, ZHU Ai-hong2, FAN Hong-da1. Niche clonal selection algorithm for multi-modal function optimization [J]. Journal of Systems Engineering and Electronics, 2010, 32(5): 1100-1104.
[2]	LI Shao-jun, ZHU Zhen-fu. Multi-target tracking based on mixtures of particle filtering [J]. Journal of Systems Engineering and Electronics, 2009, 31(8): 1795-1800.