Journal of Systems Engineering and Electronics ›› 2010, Vol. 32 ›› Issue (5): 1088-1093.doi: 10.3969/j.issn.1001-506X.2010.05.044

• 软件、算法与仿真 • 上一篇    下一篇

多层文本分类性能评价方法

宋胜利, 鲍亮, 陈平   

  1. (西安电子科技大学软件工程研究所, 陕西 西安 710071)
  • 出版日期:2010-05-24 发布日期:2010-01-03

Hierarchical text classification and evaluation

SONG Sheng-li, BAO Liang, CHEN Ping   

  1. (Software Engineering Inst., Xidian Univ.,   Xi’an 710071, China)
  • Online:2010-05-24 Published:2010-01-03

摘要:

为了准确评价多层文本分类方法,解决传统平面分类评价指标应用到多层分类中的局限性,在研究基于概念树的多层文本分类方法基础上,有效利用多层结构中类别之间的层次关系和“亲疏”关系,提出了一组能够准确描述多层分类性能的扩展评价指标。利用错误分类样本分布定义了错误分类集中度,在评价分类结果的同时能够指导训练样本的选择过程,使得训练样本更具有代表性。通过中文新闻语料的分类实验,证明了扩展评价指标对于多层分类结果的评价更为准确,错误分类集中度有助于训练出更加准确的分类模型。

Abstract:

To evaluate hierarchical classification methods and resolve the limitations of conventional flat classification measures for hierarchical classification evaluation, after studying the hierarchical classification method based on concept tree, a set of extended measures are put forward to accurately describe its performance, by effectively using the level and “affinity” among the categories in hierarchical structure. And further a definition of error classification concentration ratio (ECCR) is given based on the distribution of misclassification samples. Besides evaluation the classification result, ECCR can guide the training samples selection process to make the training set more representative. Through the experiment of Chinese news corpus classification, it proves that the extended measures for hierarchical classification result are more accurate, and ECCR is helpful to train the more accurate classification model.