正则解析 HTML
46.4 正则解析 HTML
简单结构可用 re;复杂页面用 BeautifulSoup(见专门章节)。
re.findall、re.search 提取 title、链接 href 等。
提取 title
# ======================================== # 示例:正则取标题 # ======================================== import re html = '<html><head><title>Python教程</title></head></html>' m = re.search(r'<title>(.*?)</title>', html, re.I) print(m.group(1) if m else '无')
提取所有链接
# ======================================== # 示例:提取 href # ======================================== import re html = '<a href="/page1">A</a><a href="https://x.com">B</a>' links = re.findall(r'href=["\'](.*?)["\']', html) print(links)