基于python的分布式网络爬虫设计毕业论文

2022-01-27 15:19:38

论文总字数：18988字

摘要

网络爬虫主要用来进行网上各种资源的收集。是一种按照预先设定的逻辑和规则，不断的对网页信息和数据进行爬取的程序或者脚本。随着技术方面不断的发展，网络爬虫也已经变得更加成熟，用途也更加广泛。搜索引擎就是爬虫的一种具体应用体现，人们通过任何一个搜索引擎能可以很快的获得他们所需要的数据和信息。

分布式爬虫中的一台计算机负责抓取连接，其他计算机负责下载存储，以达到高效爬取的目的。采用分布式爬虫可以提高数据爬取效率，从而降低成本，提高收益。不论对个人或者集体，分布式爬虫都是抓取信息的高效手段。

在数据爆炸的大环境下，对爬虫的设计开发是非常具有前景的。为了能够更加深入的了解爬虫，熟悉爬虫的设计和开发，本课题根据已有的爬虫开发经验和爬虫架构，对分布式爬虫进行设计开发。本课题根据实际情况，采用python语言实现了较为简单的主从模式的分布式爬虫。

Python是目前很具有潜力的一门语言，而且爬虫大多数都用python来编写。本课题就采用python进行分布式网络爬虫的设计。

关键词：网络爬虫分布式系统大数据搜索引擎

Design of distributed network crawler based on Python

Abstract

Web crawler is mainly used to collect all kinds of resources on the Internet. It is a program or script to crawl web information and data according to pre-set logic and rules. With the continuous development of technology, web crawler has become more mature and more widely used. Search engine is a specific application of reptiles. People can get the data and information they need quickly through any search engine.

A computer in a distributed crawler is responsible for grabbing connections, and other computers are responsible for downloading and storing to achieve efficient crawling. Using distributed crawler can improve data crawling efficiency, thereby reducing costs and increasing revenue. Distributed crawlers are efficient means of grasping information for individuals or collectives.

In the environment of data explosion, the design and development of reptiles is very promising. In order to be able to understand the crawler more deeply and be familiar with the design and development of reptiles, this topic designs and develops the distributed crawler based on the existing crawler development experience and the crawler architecture. According to the actual situation, we use Python language to implement a simpler distributed crawler based on master slave mode.

Python is a very potential language at present, and most reptiles are written in Python. This topic adopts Python to design distributed web crawler.

Keywords: Web crawler; Distributed system; Big data; Search engine

摘要 I

Abstract II

第一章　绪论 1

1.1 背景介绍及意义 1

1.2 网络爬虫 1

1.2.1 爬虫分类 1

1.2.2 主要搜索策略 2

1.2.3 分布式爬虫原理 2

1.3 课题主要研究内容 3

1.4 本章小结 3

第二章　开发语言和开发环境简介 4

2.1 python语言 4

2.1.1 python简介 4

2.1.2 Python开发环境 4

2.2 python Eclipse pydev 4

2.3 本章小结 5

第三章　系统需求分析与概要设计 6

3.1 系统可行性分析 6

3.2 需求分析 6

3.2.1 系统需求 6

3.2.2 功能需求 7

3.3 系统总体设计 7

3.3.1 系统功能设定目标 7

3.3.2分布式网络爬虫的模型分析 8

3.3.3分布式网络爬虫的概要设计 8

第四章　系统实现 11

4.1控制节点 11

4.1.1 URL管理器 11

4.1.2数据存储器 14

4.1.3控制调度器 16

4.2爬虫节点 20

4.2.1 HTML下载器 20

4.2.2 HTML解析器 22

4.2.3爬虫调度器 24

4.3程序运行与结果 26

第五章　系统测试 28

5.1测试方法 28

5.2测试内容 28

5.2.1功能测试 28

5.2.2性能测试 28

5.2.3接口测试 29

5.2.4兼容性测试 29

5.2.5回归测试 29

5.3问题及解决方案 29

第六章　总结 32

参考文献 33

致谢 I

第一章　绪论

1.1 背景介绍及意义

随着科学技术的不断发展，我们进入了一个信息爆炸的时代，各种信息不断的以指数形式增长，我们每个人都生活并参与其中。那么筛选出有效信息并加以利用，才能更好的适应这个信息爆炸的时代。因特网的出现使得信息的传播和采集更为方便，同时采集和传播数据的速度和规模也在不断的扩大，相应的也增加了我们分析处理的难度。

信息爆炸产生了海量的数据，并且这些数据仍在不断的膨胀变大，其具有价值密度高、体量大、多样性强、速度快、真实性高的特点。数据的飞速发展超出了传统数据软件抓取、存储、管理和分析的能力。在大数据时代，将数据有效的检索并组织呈现出来有着很重要的意义，谁抓住了数据，就抓住了先机。

请支付后下载全文，论文总字数：18988字

您需要先支付 50元 才能查看全部内容！立即支付

注册

找回密码