Python如何绕过Investing.com的反爬虫机制获取新闻数据？

来源：IPIPP.com作者：陈平安头衔：全栈工程师

导读：本期聚焦于小伙伴创作的《Python如何绕过Investing.com的反爬虫机制获取新闻数据？》，敬请观看详情，探索知识的价值。以下视频、文章将为您系统阐述其核心内容与价值。如果您觉得《Python如何绕过Investing.com的反爬虫机制获取新闻数据？》有用，将其分享出去将是对创作者最好的鼓励。

Investing.com作为知名的财经信息网站，对新闻数据的访问做了严格的反爬虫限制，很多新手直接用基础请求爬取时很容易触发拦截。下面我们就一步步分析对应的绕过方法。

Investing.com常见反爬虫机制

在动手写代码前，我们需要先了解目标网站的反爬虫规则，才能针对性破解：

请求头校验：会检查请求的User_Agent、Referer等字段，缺失或不匹配的请求直接拒绝。
访问频率限制：短时间内同一IP发送大量请求，会触发限流，返回403或者验证码页面。
动态参数验证：部分接口会携带动态生成的token或者签名参数，直接用固定参数请求会失败。
Cookie校验：部分页面需要携带有效的登录态或者会话Cookie才能正常返回内容。

基础请求伪装方案

首先解决最基础的请求头校验问题，我们可以用requests库，模拟真实浏览器的请求头：

import requests

# 模拟Chrome浏览器的User_Agent
headers = {
    "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.investing.com/news/",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept_Language": "zh-CN,zh;q=0.9,en;q=0.8"
}

# 发送请求
url = "https://www.investing.com/news/stock-market-news"
response = requests.get(url, headers=headers)
print(response.status_code)
print(response.text[:500])  # 打印前500字符查看返回内容

应对访问频率限制

如果爬取量较大，需要控制请求频率，同时可以配合代理IP池避免IP被封：

import requests
import time
import random

headers = {
    "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.investing.com/news/"
}

# 简单代理示例，实际可以替换为付费代理池
proxies = {
    "http": "http://127.0.0.1:7890",
    "https": "http://127.0.0.1:7890"
}

news_urls = [
    "https://www.investing.com/news/stock-market-news",
    "https://www.investing.com/news/economic-indicators",
    "https://www.investing.com/news/commodities-news"
]

for url in news_urls:
    try:
        # 随机休眠1-3秒，模拟人工访问间隔
        time.sleep(random.uniform(1, 3))
        response = requests.get(url, headers=headers, proxies=proxies, timeout=10)
        if response.status_code == 200:
            print(f"{url} 请求成功")
        else:
            print(f"{url} 请求失败，状态码：{response.status_code}")
    except Exception as e:
        print(f"请求{url}出现异常：{e}")

处理动态参数和Cookie

如果遇到需要动态参数的情况，可以先发送前置请求获取必要的Cookie和参数，再拼接后续请求：

import requests

session = requests.Session()
headers = {
    "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.investing.com/"
}

# 先访问首页获取基础Cookie
session.get("https://www.investing.com/", headers=headers)

# 再请求新闻接口，此时会自动携带之前获取的Cookie
news_url = "https://www.investing.com/news/stock-market-news"
response = session.get(news_url, headers=headers)
print(response.status_code)

合规注意事项

爬取数据前一定要查看网站的robots.txt协议，控制爬取频率，不要对网站服务器造成过大压力，同时爬取的数据仅用于个人学习、分析，不要用于商业用途，避免产生法律风险。

如果遇到更复杂的反爬虫机制，比如动态JS渲染的内容，可以结合selenium或者playwright等工具模拟浏览器执行JS，获取渲染后的页面内容，不过这类方式的效率会相对较低，需要根据实际需求选择方案。

Python 反爬虫数据爬取 requests User_Agent修改时间：2026-05-28 20:56:58

免责声明：已尽一切努力确保本网站所含信息的准确性。网站内容多为原创整理与精心编撰，观点力求客观中立。本站旨在免费分享，内容仅供个人学习、研究或参考使用。若引用了第三方作品，版权归原作者所有。如内容涉及您的权益，请联系我们处理。