系统工程与电子技术

• 软件、算法与仿真 • 上一篇    下一篇

面向大数据处理的划分聚类新方法

卢志茂1,2,冯进玫1,3,范冬梅2,杨朋1,田野1,4   

  1. 1. 哈尔滨工程大学模式识别与自然计算研究室, 黑龙江 哈尔滨 150001;
    2.大连理工大学计算机科学与技术学院, 辽宁 大连 116024;
    3. 黑龙江科技大学电子与信息工程学院, 黑龙江 哈尔滨 150022;
    4. 哈尔滨师范大学物理与电子工程学院, 黑龙江 哈尔滨 150025
  • 出版日期:2014-05-22 发布日期:2010-01-03

Novel partitional clustering algorithm for large data processing

LU Zhi-mao1,2, FENG Jin-mei1,3, FAN Dong-mei2,YANG Peng1, TIAN Ye1,4   

  1. 1. Pattern Recognition and Natural Computation Laboratory, Harbin Engineering University, Harbin 150001, China;
    2. School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China;
    3. College of Electronics and Information Engineering, Heilongjiang University of Science and Technology,Harbin 150022, China;                                                      4. School of Physics & Electronic Engineering, Harbin Normal University, Harbin 150025, China
  • Online:2014-05-22 Published:2010-01-03

摘要:

大数据处理是物联网研究和应用上不可回避的难题之一,针对常用聚类方法在大数据处理上的不足,设计了一种划分聚类新方法。该方法采用了大数据集的抽样技术,对多次抽取的规模足够大的样本进行聚类以确定自然簇质心的初始位置,在此基础上采用抽样后剩余数据样本对质心的初始位置进行更新,以便校正偏离理想位置的初始质心。该划分聚类算法具有线性空间复杂度和时间复杂度。实验结果表明所提的新聚类算法不仅能得到比常用聚类算法更理想的结果,而且运行速度快,适合处理大规模数据的聚类任务。

Abstract:

Large data processing is an inevitable problem for the internet of things research and application. To solve the shortcomings of large data processing with the common clustering methods, a novel partitional clustering method is designed. The new method determines the initial positions of natural cluster centroids by clustering the samples in sizes large enough, which are selected using the large data sampling method repeatedly. Next it updates the initial positions using the remaining data to correct the centroids positions deviating from the ideal positions. The designed partitional clustering algorithm has linear space and time complexity. The experimental results show that this new clustering algorithm can not only give better clustering results than common clustering algorithms, but also run fast and be suitable for large data clustering processing.