基于MapReduce的K-means聚类算法并行实现毕业论文

2021-04-05 11:02:11

摘要

随着信息技术的发展，网络中数据的规模也呈爆炸式增长。为提取数据中的信息，数据挖掘技术受到了极大的关注。聚类技术作为数据挖掘技术的一个重要课题，其可用无监督学习的方式自发地发现数据之间隐含的关系。然而，当数据量过大时，传统的聚类算法将出现以下难题：（1）由于聚类算法需要将所有数据读至计算机内存中进行处理，这使得面对大数据时其对硬件要求过高（2）传统聚类算法收单机CPU性能限制，在处理大型数据时速度较慢，效率低下。

而近年来分布式计算技术的发展为这一问题提供了一个可行的解决方案：使用并行计算技术实现分布式数据挖掘算法。本文以K-Means聚类算法为例，使用Hadoop分布式框架与MapReduce编程模型，实现了K-Means算法在分布式集群上的并行化。并行K-Means算法的分布式部分主要分为三个阶段，包括Map、Combine和Reduce函数。在Map阶段，程序运行在各节点中，读取其数据对象并将其按距离分配给各簇；Combine位于Map和Reduce之间，负责通过Map函数的结果计算某一节点内部的局部聚类中心；Reduce接受各Combine计算的局部聚类中心并将其整合成整体的聚类中心。这样就以分布式计算的方式实现了K-Means算法的迭代计算。

在实验环节，本文通过将并行K-Means算法与传统K-Means算法进行运行速度对比，说明了其在处理大型数据集时具有良好的性能。随后通过对Hadoop参数的调整，发现了并行K-Means性能受制于并行Map数量，Map数量达到集群并行任务数上限时达到最优。

关键词：聚类分析；K-Means；MapReduce；Hadoop；并行算法

Abstract

With the development of information technology, the scale of data in the network has also exploded. In order to extract the information in the data, data mining technology has received great attention. As an important topic of data mining technology, clustering technology can spontaneously discover the implicit relationship between data in the form of unsupervised learning. However, when the amount of data is too large, the traditional clustering algorithm will have the following problems: (1) Since the clustering algorithm needs to read all the data into the computer memory for processing, it makes the hardware requirements in the face of big data higher; (2) traditional clustering algorithm, the CPU performance limit of the acquirer is slow, and the efficiency is low when processing large data.

·In recent years, the development of distributed computing technology provides a feasible solution to this problem: the use of parallel computing technology to achieve distributed data mining algorithms. Taking K-Means clustering algorithm as an example, this paper uses Hadoop distributed framework and MapReduce programming model and realizes the parallelization of K-Means algorithm on distributed cluster. The distributed part of the parallel K-Means algorithm is mainly divided into three phases, including Map, Combine, and Reduce functions. In the Map phase, the program runs in each node, reads its data objects and assigns them to each cluster by distance; Combine is located between Map and Reduce, and is responsible for calculating the local clustering center inside a node through the result of the Map function. Reduce accepts the local clustering centers calculated by each Combine and integrates them into an overall clustering center. In this way, the iterative calculation of the K-Means algorithm is implemented in a distributed computing manner.

In the experimental part, this paper compares the running speed of parallel K-Means algorithm with serial K-Means algorithm, and shows that it has good performance when dealing with large data sets. Then, through the adjustment of Hadoop parameters, it is found that the parallel K-Means performance is subject to the number of parallel maps, and the number of maps reaches the optimal limit of the number of cluster parallel tasks.

Keywords: Clustering; K-Means; MapReduce; Hadoop; Parallel algorithm

绪论

研究背景及意义

人类在互联网中的活动无时无刻都在产生数据。然而，作为人类产生的复杂信息的载体，这些数据往往是杂乱、异构、难以量化的。若无法有效地分析和提取大量数据中蕴含的信息并加以利用，数据的累积就无法对商业活动、科学研究等活动提供指导，甚至反而成为累赘。为解决这一问题，数据挖掘技术应运而生。数据挖掘是一门从大量数据或数据库中提取有用信息的科学，它的实际工作是对大规模数据进行自动或半自动的分析，以提取过去未知的有价值的潜在信息。如果说原始数据是矿山中产出的矿石，那么其中的信息就是提炼出的金属，而数据挖掘所对应的就是精炼这一过程。数据挖掘现已广泛应用于用户行为分析、医疗决策、市场趋势预测等领域。

在数据挖掘中，聚类分析是一个重要的课题。所谓聚类，就是根据数据对象之间的相似性，将一整个物理或抽象的数据集分割成多个集合的过程。在被分割后的数据集中，属于同一集合的数据间相似度较高而属于不同集合的数据间相似度较低。目前已经存在许多的聚类算法，比如基于划分的 KMeans算法，基于层次的CURE算法，基于密度的DBSCAN算法，基于网格的STING方法，基于模型的COBWEB方法等等^[1]。

您需要先支付 50元 才能查看全部内容！立即支付

注册

找回密码

基于MapReduce的K-means聚类算法并行实现毕业论文

Abstract

目录

绪论

研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于MapReduce的K-means聚类算法并行实现毕业论文

Abstract

目 录

绪论

研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

目录