系统工程与电子技术

• 软件、算法与仿真 • 上一篇    下一篇

基于元数据与领域概念树的文本相似度计算

张佩云1,2, 陈恩红2, 谢荣见3, 宫秀文1, 黄波4   

  1. (1. 安徽师范大学数学计算机科学学院, 安徽 芜湖 241003;
    2. 中国科学技术大学计算机科学与技术学院, 安徽 合肥 230026;
    3. 中国科学技术大学管理学院, 安徽 合肥 230026;
    4. 南京理工大学计算机科学与技术学院, 江苏 南京 210094)
  • 出版日期:2014-03-24 发布日期:2010-01-03

Computation of document similarity based on metadata and domain concept tree

ZHANG Peiyun1,2, CHEN Enhong2, XIE Rongjian3, CONG Xiuwen1, HUANG Bo4   

  1. (1.School of Mathematics and Computer Science, Anhui Normal University, Wuhu 241003, China; 
    2.School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; 
    3.School of Management, University of Science and Technology of China, Hefei 230026, China;
    4.School of Computer Science & Technology, Nanjing University of Science and Technology, Nanjing 210094, China)
  • Online:2014-03-24 Published:2010-01-03

摘要:

随着网络与信息技术的快速发展,导致网络上产生了大量的电子文本,而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集,通常采用向量空间模型(vector space model, VSM)进行文本表示,但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法,从大量的特征空间中选择出具有代表性的元数据特征向量元素,以降低向量空间的维度;构建领域概念树并设计基于领域概念树的文本相似度算法,对领域概念中广泛存在的同义词进行处理,以提高文本之间语义相似度度量的性能。实验结果表明:通过降维和概念相似度计算可提高文本相似度计算的性能。

Abstract:

With the rapid development of network and information technology, a large number of electronic documents appear on the network, and the similarity computaion between the documents is an important means of document processing. For large-scale collection of documents, vector space model (VSM) is usually used for document representation, but the method is facing the problems of higher dimension and lack of semantic similarity. An improved method for calculating the similarity of document is proposed. Metadata feature vectors are selected from a large number of representative feature space, so that it can reduce the dimension of the vector space. The domain concept tree is constructed and the algorithm for computing document similarity is designed. In order to improve the document semantic similarity of algorithm performance, the synonym concepts which exist in widespread areas are processed. The experimental results show that the proposed method can improve the performance of document similarity computation based on the dimensionality reduction and the concepts similarity computing.