综合练习:迷你爬虫
46.9 综合练习:迷你爬虫
爬取 httpbin.org/html,提取 h1 和段落文字,写入 JSON。
完整小爬虫
扩展:多 URL 列表 + sleep 间隔。
# ========================================
# 示例:迷你爬虫
# ========================================
import requests
import re
import json
import time
def crawl(url):
r = requests.get(url, timeout=10, headers={'User-Agent': 'LearnBot/1.0'})
r.raise_for_status()
html = r.text
h1 = re.findall(r'<h1>(.*?)</h1>', html)
ps = re.findall(r'<p>(.*?)</p>', html, re.S)
return {'url': url, 'h1': h1, 'paragraphs': [p.strip()[:80] for p in ps]}
data = crawl('https://httpbin.org/html')
with open('crawl_result.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print('爬取完成', data['h1'])