快速提升爬蟲性能的幾種方法

達(dá)坂城大豆 2018-01-29

展開全文

作者：孟慶健
來源：http://www.cnblogs.com/mengqingjian/p/8329651.html

一、背景知識(shí)

爬蟲的本質(zhì)就是一個(gè)socket客戶端與服務(wù)端的通信過程，如果我們有多個(gè)url待爬取，只用一個(gè)線程且采用串行的方式執(zhí)行，

那只能等待爬取一個(gè)結(jié)束后才能繼續(xù)下一個(gè)，效率會(huì)非常低。需要強(qiáng)調(diào)的是：對(duì)于單線程下串行N個(gè)任務(wù)，并不完全等同于低效。

如果這N個(gè)任務(wù)都是純計(jì)算的任務(wù)，那么該線程對(duì)cpu的利用率仍然會(huì)很高，之所以單線程下串行多個(gè)爬蟲任務(wù)低效，是因?yàn)榕老x任務(wù)是明顯的IO密集型程序。

關(guān)于IO模型詳見鏈接：http://www.cnblogs.com/linhaifeng/articles/7454717.html

那么該如何提高爬取性能呢？且看下述概念。

二、同步、異步、回調(diào)機(jī)制

*1、同步調(diào)用：即提交一個(gè)任務(wù)后就在原地等待任務(wù)結(jié)束，等到拿到任務(wù)的結(jié)果后再繼續(xù)下一行代碼，效率低下*

import requests
def parse_page(res):
print('解析 %s' %(len(res)))
def get_page(url):
print('下載 %s' %url)
response=requests.get(url)
if response.status_code == 200:
return response.text
urls=['https://www.baidu.com/','http://www.sina.com.cn/','https://www.']
for url in urls:
res=get_page(url) #調(diào)用一個(gè)任務(wù)，就在原地等待任務(wù)結(jié)束拿到結(jié)果后才繼續(xù)往后執(zhí)行
parse_page(res)

2、一個(gè)簡(jiǎn)單的解決方案：多線程或多進(jìn)程

在服務(wù)器端使用多線程（或多進(jìn)程）。

多線程（或多進(jìn)程）的目的是讓每個(gè)連接都擁有獨(dú)立的線程（或進(jìn)程），這樣任何一個(gè)連接的阻塞都不會(huì)影響其他的連接。

#IO密集型程序應(yīng)該用多線程
import requests
from threading import Thread,current_thread
def parse_page(res):
print('%s 解析 %s' %(current_thread().getName(),len(res)))
def get_page(url,callback=parse_page):
print('%s 下載 %s' %(current_thread().getName(),url))
response=requests.get(url)
if response.status_code == 200:
callback(response.text)
if __name__ == '__main__':
urls=['https://www.baidu.com/','http://www.sina.com.cn/','https://www.']
for url in urls:
t=Thread(target=get_page,args=(url,))
t.start()

該方案的問題是：

開啟多進(jìn)程或都線程的方式，我們是無法無限制地開啟多進(jìn)程或多線程的：在遇到要同時(shí)響應(yīng)成百上千路的連接請(qǐng)求，則無論多線程還是多進(jìn)程都會(huì)嚴(yán)重占據(jù)系統(tǒng)資源，降低系統(tǒng)對(duì)外界響應(yīng)效率，

而且線程與進(jìn)程本身也更容易進(jìn)入假死狀態(tài)。

3、改進(jìn)方案：

線程池或進(jìn)程池+異步調(diào)用：提交一個(gè)任務(wù)后并不會(huì)等待任務(wù)結(jié)束，而是繼續(xù)下一行代碼**

很多程序員可能會(huì)考慮使用'線程池'或'連接池'。'線程池'旨在減少創(chuàng)建和銷毀線程的頻率，其維持一定合理數(shù)量的線程，并讓空閑的線程重新承擔(dān)新的執(zhí)行任務(wù)。'連接池'維持連接的緩存池，盡量重用已有的連接、減少創(chuàng)建和關(guān)閉連接的頻率。這兩種技術(shù)都可以很好的降低系統(tǒng)開銷，都被廣泛應(yīng)用很多大型系統(tǒng)，如websphere、tomcat和各種數(shù)據(jù)庫等。

#IO密集型程序應(yīng)該用多線程，所以此時(shí)我們使用線程池
import requests
from threading import current_thread
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
def parse_page(res):
res=res.result()
print('%s 解析 %s' %(current_thread().getName(),len(res)))
def get_page(url):
print('%s 下載 %s' %(current_thread().getName(),url))
response=requests.get(url)
if response.status_code == 200:
return response.text
if __name__ == '__main__':
urls=['https://www.baidu.com/','http://www.sina.com.cn/','https://www.']
pool=ThreadPoolExecutor(50)
# pool=ProcessPoolExecutor(50)
for url in urls:
pool.submit(get_page,url).add_done_callback(parse_page)
pool.shutdown(wait=True)

改進(jìn)后方案其實(shí)也存在著問題：

'線程池'和'連接池'技術(shù)也只是在一定程度上緩解了頻繁調(diào)用IO接口帶來的資源占用。而且，所謂'池'始終有其上限，當(dāng)請(qǐng)求大大超過上限時(shí)，'池'構(gòu)成的系統(tǒng)對(duì)外界的響應(yīng)并不比沒有池的時(shí)候效果好多少。所以使用'池'必須考慮其面臨的響應(yīng)規(guī)模，并根據(jù)響應(yīng)規(guī)模調(diào)整'池'的大小。

對(duì)應(yīng)上例中的所面臨的可能同時(shí)出現(xiàn)的上千甚至上萬次的客戶端請(qǐng)求，'線程池'或'連接池'或許可以緩解部分壓力，但是不能解決所有問題。

總之，多線程模型可以方便高效的解決小規(guī)模的服務(wù)請(qǐng)求，但面對(duì)大規(guī)模的服務(wù)請(qǐng)求，多線程模型也會(huì)遇到瓶頸，可以用非阻塞接口來嘗試解決這個(gè)問題。**

三、高性能

上述無論哪種解決方案其實(shí)沒有解決一個(gè)性能相關(guān)的問題：IO阻塞，無論是多進(jìn)程還是多線程，在遇到IO阻塞時(shí)都會(huì)被操作系統(tǒng)強(qiáng)行剝奪走CPU的執(zhí)行權(quán)限，程序的執(zhí)行效率因此就降低了下來。

解決這一問題的關(guān)鍵在于，我們自己從應(yīng)用程序級(jí)別檢測(cè)IO阻塞，然后切換到我們自己程序的其他任務(wù)執(zhí)行，這樣把我們程序的IO降到最低，我們的程序處于就緒態(tài)就會(huì)增多，以此來迷惑操作系統(tǒng)，操作系統(tǒng)便以為我們的程序是IO比較少的程序，從而會(huì)盡可能多的分配CPU給我們，這樣也就達(dá)到了提升程序執(zhí)行效率的目的**。

1、在python3.3之后新增了asyncio模塊，可以幫我們檢測(cè)IO（只能是網(wǎng)絡(luò)IO），實(shí)現(xiàn)應(yīng)用程序級(jí)別的切換

import asyncio
@asyncio.coroutine
def task(task_id,senconds):
print('%s is start' %task_id)
yield from asyncio.sleep(senconds) #只能檢測(cè)網(wǎng)絡(luò)IO,檢測(cè)到IO后切換到其他任務(wù)執(zhí)行
print('%s is end' %task_id)
tasks=[task(task_id='任務(wù)1',senconds=3),task('任務(wù)2',2),task(task_id='任務(wù)3',senconds=1)]
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

2、但asyncio模塊只能發(fā)tcp級(jí)別的請(qǐng)求，不能發(fā)http協(xié)議，因此，在我們需要發(fā)送http請(qǐng)求的時(shí)候，需要我們自定義http報(bào)頭。

import asyncio
import requests
import uuid
user_agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
def parse_page(host,res):
print('%s 解析結(jié)果 %s' %(host,len(res)))
with open('%s.html' %(uuid.uuid1()),'wb') as f:
f.write(res)
@asyncio.coroutine
def get_page(host,port=80,url='/',callback=parse_page,ssl=False):
print('下載 http://%s:%s%s' %(host,port,url))
#步驟一（IO阻塞）：發(fā)起tcp鏈接，是阻塞操作，因此需要yield from
if ssl:
port=443
recv,send=yield from asyncio.open_connection(host=host,port=443,ssl=ssl)
# 步驟二：封裝http協(xié)議的報(bào)頭，因?yàn)閍syncio模塊只能封裝并發(fā)送tcp包，因此這一步需要我們自己封裝http協(xié)議的包
request_headers='''GET %s HTTP/1.0rnHost: %srnUser-agent: %srnrn''' %(url,host,user_agent)
# requset_headers='''POST %s HTTP/1.0rnHost: %srnrnname=egon&password=123''' % (url, host,)
request_headers=request_headers.encode('utf-8')
# 步驟三（IO阻塞）：發(fā)送http請(qǐng)求包
send.write(request_headers)
yield from send.drain()
# 步驟四（IO阻塞）：接收響應(yīng)頭
while True:
line=yield from recv.readline()
if line == b'rn':
break
print('%s Response headers：%s' %(host,line))
# 步驟五（IO阻塞）：接收響應(yīng)體
text=yield from recv.read()
# 步驟六：執(zhí)行回調(diào)函數(shù)
callback(host,text)
# 步驟七：關(guān)閉套接字
send.close() #沒有recv.close()方法，因?yàn)槭撬拇螕]手?jǐn)噫溄?，雙向鏈接的兩端，一端發(fā)完數(shù)據(jù)后執(zhí)行send.close()另外一端就被動(dòng)地?cái)嚅_
if __name__ == '__main__':
tasks=[
get_page('www.baidu.com',url='/s?wd=美女',ssl=True),
get_page('www.cnblogs.com',url='/',ssl=True),
]
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

3、自定義http報(bào)頭多少有點(diǎn)麻煩，于是有了aiohttp模塊，專門幫我們封裝http報(bào)頭，然后我們還需要用asyncio檢測(cè)IO實(shí)現(xiàn)切換。

import aiohttp
import asyncio
@asyncio.coroutine
def get_page(url):
print('GET:%s' %url)
response=yield from aiohttp.request('GET',url)
data=yield from response.read()
print(url,data)
response.close()
return 1
tasks=[
get_page('https://www./doc'),
get_page('https://www.cnblogs.com/linhaifeng'),
get_page('https://www.')
]
loop=asyncio.get_event_loop()
results=loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
print('=====>',results) #[1, 1, 1]

asyncio+aiohttp

4、此外，還可以將requests.get函數(shù)傳給asyncio，就能夠被檢測(cè)了。

import requests
import asyncio
@asyncio.coroutine
def get_page(func,*args):
print('GET:%s' %args[0])
loog=asyncio.get_event_loop()
furture=loop.run_in_executor(None,func,*args)
response=yield from furture
print(response.url,len(response.text))
return 1
tasks=[
get_page(requests.get,'https://www./doc'),
get_page(requests.get,'https://www.cnblogs.com/linhaifeng'),
get_page(requests.get,'https://www.')
]
loop=asyncio.get_event_loop()
results=loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
print('=====>',results) #[1, 1, 1]

5、還有之前在協(xié)程時(shí)介紹的gevent模塊

from gevent import monkey;monkey.patch_all()
import gevent
import requests
def get_page(url):
print('GET:%s' %url)
response=requests.get(url)
print(url,len(response.text))
return 1
# g1=gevent.spawn(get_page,'https://www./doc')
# g2=gevent.spawn(get_page,'https://www.cnblogs.com/linhaifeng')
# g3=gevent.spawn(get_page,'https://www.')
# gevent.joinall([g1,g2,g3,])
# print(g1.value,g2.value,g3.value) #拿到返回值
#協(xié)程池
from gevent.pool import Pool
pool=Pool(2)
g1=pool.spawn(get_page,'https://www./doc')
g2=pool.spawn(get_page,'https://www.cnblogs.com/linhaifeng')
g3=pool.spawn(get_page,'https://www.')
gevent.joinall([g1,g2,g3,])
print(g1.value,g2.value,g3.value) #拿到返回值

6、封裝了gevent+requests模塊的grequests模塊

#pip3 install grequests
import grequests
request_list=[
grequests.get('https://wwww./doc1'),
grequests.get('https://www.cnblogs.com/linhaifeng'),
grequests.get('https://www.')
]
##### 執(zhí)行并獲取響應(yīng)列表 #####
# response_list = grequests.map(request_list)
# print(response_list)
##### 執(zhí)行并獲取響應(yīng)列表（處理異常） #####
def exception_handler(request, exception):
# print(request,exception)
print('%s Request failed' %request.url)
response_list = grequests.map(request_list, exception_handler=exception_handler)
print(response_list)

7、twisted：是一個(gè)網(wǎng)絡(luò)框架，其中一個(gè)功能是發(fā)送異步請(qǐng)求，檢測(cè)IO并自動(dòng)切換。

'''
#問題一：error: Microsoft Visual C++ 14.0 is required. Get it with 'Microsoft Visual C++ Build Tools': http://landinghub./visual-cpp-build-tools
https://www.lfd./~gohlke/pythonlibs/#twisted
pip3 install C:UsersAdministratorDownloadsTwisted-17.9.0-cp36-cp36m-win_amd64.whl
pip3 install twisted
#問題二：ModuleNotFoundError: No module named 'win32api'
https:///projects/pywin32/files/pywin32/
#問題三：openssl
pip3 install pyopenssl
'''
#twisted基本用法
from twisted.web.client import getPage,defer
from twisted.internet import reactor
def all_done(arg):
# print(arg)
reactor.stop()
def callback(res):
print(res)
return 1
defer_list=[]
urls=[
'http://www.baidu.com',
'http://www.bing.com',
'https://www.',
]
for url in urls:
obj=getPage(url.encode('utf=-8'),)
obj.addCallback(callback)
defer_list.append(obj)
defer.DeferredList(defer_list).addBoth(all_done)
reactor.run()
#twisted的getPage的詳細(xì)用法
from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse
def one_done(arg):
print(arg)
reactor.stop()
post_data = urllib.parse.urlencode({'check_data': 'adf'})
post_data = bytes(post_data, encoding='utf8')
headers = {b'Content-Type': b'application/x-www-form-urlencoded'}
response = getPage(bytes('http://dig./login', encoding='utf8'),
method=bytes('POST', encoding='utf8'),
postdata=post_data,
cookies={},
headers=headers)
response.addBoth(one_done)
reactor.run()

8、tornado

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop
def handle_response(response):
'''
處理返回值內(nèi)容（需要維護(hù)計(jì)數(shù)器，來停止IO循環(huán)），調(diào)用 ioloop.IOLoop.current().stop()
:param response:
:return:
'''
if response.error:
print('Error:', response.error)
else:
print(response.body)
def func():
url_list = [
'http://www.baidu.com',
'http://www.bing.com',
]
for url in url_list:
print(url)
http_client = AsyncHTTPClient()
http_client.fetch(HTTPRequest(url), handle_response)
ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

發(fā)現(xiàn)上例在所有任務(wù)都完畢后也不能正常結(jié)束，為了解決該問題，讓我們來加上計(jì)數(shù)器。

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop
count=0
def handle_response(response):
'''
處理返回值內(nèi)容（需要維護(hù)計(jì)數(shù)器，來停止IO循環(huán)），調(diào)用 ioloop.IOLoop.current().stop()
:param response:
:return:
'''
if response.error:
print('Error:', response.error)
else:
print(len(response.body))
global count
count-=1 #完成一次回調(diào)，計(jì)數(shù)減1
if count == 0:
ioloop.IOLoop.current().stop()
def func():
url_list = [
'http://www.baidu.com',
'http://www.bing.com',
]
global count
for url in url_list:
print(url)
http_client = AsyncHTTPClient()
http_client.fetch(HTTPRequest(url), handle_response)
count+=1 #計(jì)數(shù)加1
ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

題圖：pexels，CC0 授權(quán)。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：達(dá)坂城大豆 > 《Python》

舉報(bào)/認(rèn)領(lǐng)