Python 在网络编程领域有着丰富的生态系统。从标准库的 urllib 到第三方库 requests,从同步到异步,Python 提供了从简单到复杂的全方位网络编程能力。无论是爬取网页数据、调用 RESTful API,还是构建自己的网络服务,Python 都是绝佳的选择。
1. requests 库深度使用
requests 是 Python 最流行的 HTTP 库,提供了简洁优雅的 API:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# 创建 Session 并配置重试策略
session = requests.Session()
retry_strategy = Retry(
total=3, # 最多重试 3 次
backoff_factor=1, # 退避因子(1s, 2s, 4s)
status_forcelist=[500, 502, 503, 504], # 需要重试的状态码
allowed_methods=["GET", "POST"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
# 设置超时和请求头
headers = {
"User-Agent": "Mozilla/5.0 (compatible; CodingPlus/1.0)",
"Accept": "application/json"
}
try:
response = session.get(
"https://api.example.com/data",
headers=headers,
timeout=(3.05, 10) # (连接超时, 读取超时)
)
response.raise_for_status() # 触发 HTTP 错误异常
data = response.json()
except requests.exceptions.RequestException as e:
print(f"请求失败: {e}")
2. urllib 标准库
虽然 requests 更友好,但 urllib 是 Python 标准库的一部分,无需安装第三方依赖:
from urllib.request import Request, urlopen
from urllib.parse import urlencode
import json
# 构建 POST 请求
params = urlencode({'key': 'value'}).encode()
req = Request(
'https://api.example.com/post',
data=params,
headers={'Content-Type': 'application/x-www-form-urlencoded'},
method='POST'
)
with urlopen(req, timeout=10) as response:
result = json.loads(response.read().decode('utf-8'))
3. 异步网络请求(aiohttp)
当需要同时发起大量请求时(如爬虫、API 批量调用),使用异步可以大幅提升性能:
import aiohttp
import asyncio
async def fetch_url(session, url):
try:
async with session.get(url, timeout=10) as response:
return await response.json()
except Exception as e:
return {"error": str(e)}
async def main():
urls = ["https://api.example.com/1", "https://api.example.com/2"]
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
# 所有请求并发执行,总耗时 ≈ 最慢的单个请求
return results
if __name__ == '__main__':
results = asyncio.run(main())
4. 数据解析
JSON 解析:
import json
# 从文件读取 JSON
with open('data.json', 'r', encoding='utf-8') as f:
data = json.load(f)
# 处理深层嵌套数据
def deep_get(d, keys, default=None):
for key in keys:
if not isinstance(d, dict):
return default
d = d.get(key, default)
return d
value = deep_get(data, ['results', 0, 'name', 'full'], 'N/A')
XML 解析:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
# 使用 XPath 查找元素
for item in root.findall('.//item[@type="article"]'):
title = item.find('title').text
print(f"文章: {title}")
5. API 调用最佳实践
- 限流(Rate Limiting):使用
time.sleep() 或令牌桶算法控制请求频率
- 指数退避(Exponential Backoff):429(Too Many Requests)或 5xx 错误时,逐渐增加重试间隔
- 缓存响应:使用
functools.lru_cache 或 Redis 缓存不经常变化的数据
- 分页处理:遍历 API 分页结果时,检查响应中的
next 或 has_more 字段
- 错误处理:区分网络错误(连接失败)、HTTP 错误(4xx/5xx)和应用错误(API 返回的错误码)
6. 实战:简单的网页爬虫
import requests
from bs4 import BeautifulSoup
def scrape_articles(url):
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
articles = []
for article in soup.select('article.post'):
articles.append({
'title': article.select_one('h2 a').get_text(strip=True),
'link': article.select_one('h2 a')['href'],
'summary': article.select_one('p.summary').get_text(strip=True)
})
return articles
7. 常见陷阱
- 忘记设置超时(Timeout),导致线程永久挂起
- 不处理连接池耗尽,导致请求被阻塞
- 忽略 SSL 证书验证(生产环境不应设置
verify=False)
- 不对响应内容进行编码检测,导致乱码
- 爬虫速度过快触发 IP 封禁