综合练习:迷你爬虫

46.9 综合练习:迷你爬虫

爬取 httpbin.org/html,提取 h1 和段落文字,写入 JSON。

完整小爬虫

扩展:多 URL 列表 + sleep 间隔。

# ========================================
# 示例:迷你爬虫
# ========================================
import requests
import re
import json
import time

def crawl(url):
    r = requests.get(url, timeout=10, headers={'User-Agent': 'LearnBot/1.0'})
    r.raise_for_status()
    html = r.text
    h1 = re.findall(r'<h1>(.*?)</h1>', html)
    ps = re.findall(r'<p>(.*?)</p>', html, re.S)
    return {'url': url, 'h1': h1, 'paragraphs': [p.strip()[:80] for p in ps]}

data = crawl('https://httpbin.org/html')
with open('crawl_result.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)
print('爬取完成', data['h1'])