有些网站会接入 Web 应用程序防火墙,可以帮助网站防止 DDoS 攻击、恶意流量和爬虫攻击等。

Cloudflare WAF 和五秒盾都是 Web 应用程序防火墙(WAF)的一种,Cloudflare WAF 是由 Cloudflare 提供的网络安全服务,五秒盾是阿里云提供的 Web 应用程序防火墙解决方案,是阿里云 CDN 的附加功能之一。技术架构有所不同,解决方法类似。

如果你打开一个网站显示如下页面

Checking your browser before accessing website. com.

This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds…

对方就可能接入了这类防火墙,需要完成环境检查。如果从表单和 js 检测入手,需要下断点调试,过混淆和 js 解密。

介绍几种简单的反爬处理:

  1. python 库 cloudscraper
  2. selenium+undetected_chromedriver 模拟真实浏览器
  3. 第三方服务

cloudscraper

开源解决方案, cloudscraper,可以解决免费版的 Cloudflare,同时支持添加 Captcha solvers
GitHub 地址:GitHub - VeNoMouS/cloudscraper: A Python module to bypass Cloudflare’s anti-bot page.

1
pip install cloudscraper
TI:"CodeBlock"
1
2
3
4
5
6
7
import requests
import cloudscraper

url = "https://www.thisisfresh.com/london/spring-mews/flexfloorplancomparision/0/I"
scraper = cloudscraper.create_scraper()
html = scraper.get(url).text
print(html)

有些网站使用了 Python cloudscraper 模块,还是会代码访问网站 403,异常信息 cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
Cloudscraper 检测到了 Cloudflare v2 验证码,而开源免费版本并不支持该功能。需要购买作者的付费版本,详情需要加入作者的 discord.
要解决这个问题,有两个常用的办法,

  1. 使用 undetected_chromedriver ,但是该方法很难受,很占内存。
  2. 使用 FlareSolverr,它封装了 undetected_chromedriver,通过代理的方式返回正常页面的 cookie 和原始页面数据。简单易用,而且有效。它是个开源的项目,有兴趣的可以去研究一下 github 上的源码

Flaresolverr

GitHub 地址:GitHub - FlareSolverr/FlareSolverr: Proxy server to bypass Cloudflare protection
参考了网上的 FlareSolverr 文章
FlareSolverr Tutorial: Scrape Cloudflare Sites - ZenRows

FlareSolverr 是一个解决 Cloudflare 网站防护的代理服务器。
它使用极少的资源在空闲状态下等待用户请求。启用 Seleniumundetected-chromedriver 来创建一个 Chrome 浏览器,并打开用户提供的 URL 等参数。FlareSolverr 会等待 Cloudflare 被解决或者超时,然后将 HTML 代码和 cookies 发送回给用户,这些 cookies 可以在其他 HTTP 客户端中使用用于绕过 Cloudflare。但是,浏览器会占用大量的内存资源。如果你的机器内存较小,不要同时进行太多请求。每个请求都会启动一个新的浏览器。FlareSolverr 也支持使用 session 永久会话,但是如果使用会话,应该及时关闭。

推荐使用 docker 安装,可以用 GitHub 项目地址中的 docker-compose.yml 文件来创建容器。或者直接使用 docker cli 命令

1
2
3
4
5
6
docker run -d \
--name=flaresolverr \
-p 8191:8191 \
-e LOG_LEVEL=info \
--restart unless-stopped \
ghcr.io/flaresolverr/flaresolverr:latest

搭建成功,防火墙放开端口。直接访问 ip+端口会显示

1
{"msg": "FlareSolverr is ready!", "version": "3.1.2", "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"}

get 请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests

api_url = "http://localhost:8191/v1"
headers = {"Content-Type": "application/json"}

data = {
"cmd": "request.get",
"url": "https://nowsecure.nl/",
"maxTimeout": 60000
}

response = requests.post(api_url, headers=headers, json=data)
# 这个Docker镜像启动的接口,返回的数据是JOSN,网页源代码在其中的.solution.response中
print(response.content)

FlareSolverr 获取 cookie,并用 cookie 发送新请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
import json

url = " https://nowsecure.nl/"
api_url = "http://localhost:8191/v1"
headers = {"Content-Type": "application/json"}

data = {
"cmd": "request.get",
"url": url,
"maxTimeout": 60000
}

response = requests.post(api_url, headers=headers, json=data)

# retrieve the entire JSON response from FlareSolverr
response_data = json.loads(response.content)

# Extract the cookies from the FlareSolverr response
cookies = response_data["solution"]["cookies"]

# Clean the cookies
cookies = {cookie["name"]: cookie["value"] for cookie in cookies}

# Extract the user agent from the FlareSolverr response
user_agent = response_data["solution"]["userAgent"]

response = requests.get(url, cookies=cookies, headers={"User-Agent": user_agent})
print(response.content)

post 请求

postData 必须是带有 application/x-www-form-urlencoded 的字符串,例如 a=b&c=d

TI:"CodeBlock"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests

api_url = 'http://localhost:8191/v1'
headers = {'Content-Type': 'application/json'}

data = {
"cmd": "request.post",
"url":"https://www.example.com/POST",
"postData": POST_DATA,
"maxTimeout": 60000
}

response = requests.post(api_url, headers=headers, json=data)

print(response.data)

建立 session 会话

FlareSolverr/README.md at master · FlareSolverr/FlareSolverr · GitHub

sessions.create 是一个通过 FlareSolverr 创建新会话的 API。要使用这个 API,你需要发送一个 HTTP POST 请求到 FlareSolverr 的 API 端点,URL 应该类似于 http://<flareSolverr_host>:<flareSolverr_port>/api/v1/sessions/create

这个 API 需要一个 JSON 格式的 POST 请求体,格式如下:

1
2
3
4
5
6
7
8
9
10
11
12
{
"maxTimeout": 30000,
"session_ttl_minutes": 15,
"cookies": {},
"headers": {},
"captcha": {
"harvester": "2captcha",
"sitekey": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
"url": " https://www.google.com/recaptcha/api2/demo" ,
"invisible": false
}
}

请求体的参数包括:

  • maxTimeout:允许解决 CAPTCHA 的最长时间(以毫秒为单位)。
  • session_ttl_minutes:会话的生存时间(以分钟为单位),在超过这个时间后,FlareSolverr 将删除这个会话。
  • cookies:一个包含必要的 Cookies 的对象。
  • headers:一个包含必要 HTTP 头的对象。
  • captcha:一个包含所有必要的 CAPTCHA 信息的对象。

请替换 JSON 请求体中的参数来符合您自己的需要,然后发送 POST 请求来创建一个新的 FlareSolverr 会话。成功执行此操作后,您将收到一个会话 ID 作为响应。这个会话 ID 可以用于后续的 API 操作,如解决 CAPTCHA 等。

我封装的 FlareSolverr

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# -*- coding: utf-8 -*-  
# @Time : 2023/5/9 19:26
# @Author : flyrr
# @File : common/CloudflareSolver. py
# @IDE : pycharm
import json
import requests


class FlareSolverr:
"""returns the solved response from cloudflare challenge."""

def __init__(self):
self.flaresolverr_url = " http://ip地址:8191/v1"
self.headers = {
'Content-Type': 'application/json'
}

def get_cookies (self, url):
import requests
"""
只能获取 cookies,用 cookies 再请求也过不了环境检测
""" flaresolverr_url = " http://47.92.135.60:8191/v1"
payload = json.dumps ({
"cmd": "request.get",
"url": " https://www.thisisfresh.com/london/glassyard-building/flexfloorplancomparision/0/I" ,
"maxTimeout": 60000
})
response = requests.post (self.flaresolverr_url, headers=self.headers, data=payload)

cookies = response.json ()['solution']['cookies']
# Clean the cookies
cookies = {cookie["name"]: cookie["value"] for cookie in cookies}

# Extract the user agent from the FlareSolverr response
user_agent = response.json ()["solution"]["userAgent"]

url = ''
response = requests.get (url, cookies=cookies, headers={"User-Agent": user_agent})

def check_session_list (self):
payload = {
"cmd": "sessions.list"
}

response = requests.post(self.flaresolverr_url, headers=self.headers, json=payload)
print(f"=====================\n checkSessionList: {response.text} \n=====================")

def create_session(self, session_id=None, ttl_min=5):
payload = {
"cmd": "sessions.create",
"session": session_id,
"session_ttl_minutes": int (ttl_min),
# "proxy": {}
}

response = requests.post(self.flaresolverr_url, headers=self.headers, json=payload)
print(f"=====================\n createSession: {response.text} \n=====================")

def destroy_session (self, session_id):
payload = {
"cmd": "sessions.destroy",
"session": session_id
}

response = requests.post(self.flaresolverr_url, headers=self.headers, json=payload)
print(f"=====================\n destroySession: {response.text} \n=====================")

def test_session(self, session_id):
payload = json.dumps({
"cmd": "request.get",
# "url": "https://nowsecure.nl/",
"url": "https://www.thisisfresh.com/london/glassyard-building/flexfloorplancomparision/0/I",
"maxTimeout": 60000,
"session": session_id
})
response = requests.post (self.flaresolverr_url, headers=self.headers, data=payload)
# print(response.text)
if response.json ()['status'] == 'ok':
print ("Challenge Solved!")
# # 这个Docker镜像启动的接口,返回的数据是JOSN,网页源代码在其中的.solution.response中
# print(response.json()['solution']['response'])
return response.json()['solution']['response']

def session_get(self, session_id, url):
payload = json.dumps ({
"cmd": "request.get",
"url": url,
"maxTimeout": 60000,
"session": session_id
})
response = requests.post (self.flaresolverr_url, headers=self.headers, data=payload)
# print (response.text)

# print (response.json ()['solution']['response']) return response.json ()['solution']['response']

def get (self, url):
payload = {
"cmd": "request.get",
# "url": " https://www.thisisfresh.com/london/glassyard-building/flexfloorplancomparision/0/I" ,
# "url": " https://www.thisisfresh.com/london/spring-mews/flexfloorplancomparision/0/flexfloorplancomparision" , "url": url,
"maxTimeout": 60000
}

response = requests.post (self.flaresolverr_url, headers=self.headers, json=payload)
# # 这个 Docker 镜像启动的接口,返回的数据是 JOSN,网页源代码在其中的.solution.response 中
print (response.json ()['solution']['response'])
return response.json ()['solution']['response']

def post (self, url, post_data):
"""未测试
`postData` 必须是带有 `application/x- www-form-urlencoded` 的字符串,例如 `a=b&c=d` """ payload = {
"cmd": "request.post",
"url": url,
"postData": post_data,
"maxTimeout": 60000
}

response = requests.post (self.flaresolverr_url, headers=self.headers, json=payload)
print (response.json ())
return response.json ()['solution']['response']


if __name__ == '__main__':
# 实例化类
fs = FlareSolverr ()
# session的创建、测试、删除
fs.create_session('test_session', 5)
fs.check_session_list()
fs.test_session('test_session')
fs.destroy_session(session_id='test_session')
fs.check_session_list ()

# 不使用session
# res = fs.get("https://www.thisisfresh.com/london/glassyard-building/flexfloorplancomparision/0/I") # post_res = fs.post("", post_data='')

当 FlareSolverr 返回的 cookie 不起作用时会发生此错误。如果 Docker 和 FlareSolverr 的 IP 不同导致 cookie 不匹配,就会发生这种情况。换句话说:当它们在不同的网络上运行时。

使用代理或 VPN 时经常出现这种情况,因为 FlareSolverr 目前不支持它们。要解决此问题,请尝试禁用代理或 VPN。如果那不可能,请参阅此问题

第三方服务

FlareSolverr 无法解决的高级 Cloudflare

TI:"CodeBlock"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests

url = "http://localhost:8191/v1"
headers = {"Content-Type": "application/json"}

data = {
"cmd": "request.get",
"url": "https://www.glassdoor.com/Overview/Working-at-Google-EI_IE9079.11,17.htm",
"maxTimeout": 60000
}

response = requests.post (url, headers=headers, json=data)

print(response.content)

响应

1
2
b'{"status": "error", "message": "Error: Error solving the challenge. Timeout after 60.0 seconds.", "startTimestamp": 1681908319571, "endTimestamp": 1681908380332, "version": "3.1.2"}'

可以尝试第三方服务:

ScrapingAnt,免费套餐每个月送 1W 点数,查询一次十点:ScrapingAnt - Web Scraping API | Proxy API

付费的 ZenRows,注册获取 API 密钥,有试用点数。

1
pip install zenrows
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from zenrows import ZenRowsClient

#create new zenrowsclient instance
client = ZenRowsClient ("Your_API_Key")

url = " https://www.glassdoor.com/Overview/Working-at-Google-EI_IE9079.11 , 17. htm"
#define the necessary parameters
params = {"js_render": "true","antibot": "true","premium_proxy": "true"}

#make a get request
response = client.get (url, params=params)

print (response.text)