正则解析 HTML

46.4 正则解析 HTML

简单结构可用 re;复杂页面用 BeautifulSoup(见专门章节)。

re.findall、re.search 提取 title、链接 href 等。

提取 title

# ========================================
# 示例:正则取标题
# ========================================
import re

html = '<html><head><title>Python教程</title></head></html>'
m = re.search(r'<title>(.*?)</title>', html, re.I)
print(m.group(1) if m else '无')

提取所有链接

# ========================================
# 示例:提取 href
# ========================================
import re

html = '<a href="/page1">A</a><a href="https://x.com">B</a>'
links = re.findall(r'href=["\'](.*?)["\']', html)
print(links)