编程随想

会python真的可以为所欲为 by Python自学0(回) 451天前

这里还有人吗 by mikeKil1(回) 636天前

这里还有人吗 by mikeKil0(回) 636天前

每天面对着电脑屏幕，敲打键盘。我所面对的并不只是代码，而是一种生活方式。 by js特效0(回) 861天前

到处都是羊，不想上班 by Python自学0(回) 941天前

鸽子 by 张书娥0(回) 945天前

云代码 - python代码库

python爬虫-广度优先算法，最大深度控制

2024-04-27 作者： Python自学举报

[python]代码库

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
 
 
def bfs_crawler(seed_url, max_depth):
    visited = set()  # 用于存储已访问的URL
    queue = [(seed_url, 0)]  # 使用队列存储待访问的URL和对应的深度
 
    while queue:
        url, depth = queue.pop(0)  # 从队列中取出一个URL和对应的深度
 
        if depth > max_depth:
            continue
 
        if url in visited:
            continue
 
        try:
            response = requests.get(url)  # 发送HTTP请求获取页面内容
            print('crawl ' + url)
            html_text = response.text
 
            visited.add(url)  # 将URL添加到已访问集合中
 
            soup = BeautifulSoup(html_text, 'html.parser')
            # print(soup.prettify())  # 输出页面内容
            # todo 网页内容提取后保存到数据库
 
            links = soup.find_all('a')  # 提取页面中的所有链接
 
            for link in links:
                absolute_url = urljoin(url, link.get('href'))  # 将相对URL转换为绝对URL
 
                if absolute_url.startswith('https') and absolute_url not in visited:
                    queue.append((absolute_url, depth + 1))  # 将新的URL添加到队列中，同时增加深度
        except requests.exceptions.RequestException:
            continue
 
 
seed_url = 'https://www.oreilly.com/search/?q=python&type=*&rows=10'  # 设置种子URL
max_depth = 2  # 设置最大深度
 
bfs_crawler(seed_url, max_depth)  # 调用广度优先算法爬虫函数

[代码运行效果截图]

网友评论 (发表评论)

暂无评论 (抢沙发)

发表评论：

评论须知：

1、评论每次加2分，每天上限为30；
2、请文明用语，共同创建干净的技术交流环境；
3、若被发现提交非法信息，评论将会被删除，并且给予扣分处理，严重者给予封号处理；
4、请勿发布广告信息或其他无关评论，否则将会删除评论并扣分，严重者给予封号处理。

用户注册

用户登录

发表随想

该用户最新代码

编程随想

python爬虫-广度优先算法，最大深度控制

[python]代码库

[代码运行效果截图]

网友评论 (发表评论)

发表评论：

评论须知：

扫码下载

输入口令后可复制整站源码