Python爬蟲有多簡單？一文帶你實戰(zhàn)豆瓣電影TOP250數(shù)據(jù)爬?。?/span>

yaohbsg 2020-07-25

展開全文

熟悉Python的requests庫即re之后，可以嘗試構(gòu)建一個簡單的爬蟲系統(tǒng)。我們選用網(wǎng)站結(jié)構(gòu)比較穩(wěn)定且不會造成較大服務(wù)器負載的豆瓣網(wǎng)站，爬取豆瓣評分top250的電影名稱、封面等詳細信息。

一、網(wǎng)頁分析

1.網(wǎng)頁概覽

首先在瀏覽器中輸入以下網(wǎng)址打開爬取的目標網(wǎng)站豆瓣電影top250：https://movie.douban.com/top250?start=225&filter=，得到如下界面。

Python爬蟲有多簡單？一文帶你實戰(zhàn)豆瓣電影TOP250數(shù)據(jù)爬?。?></div></div><p>通過查看豆瓣電影官網(wǎng)的robots協(xié)議，發(fā)現(xiàn)此網(wǎng)站并不在Disallow里，表明該網(wǎng)站不限制爬取。</p><div><div><img doc360img-src='http://image109.360doc.com/DownloadImg/2020/07/2508/197239698_2_20200725081651521' src=

2.匹配分析

接著按下F12鍵查看谷歌瀏覽器的Devtools工具，發(fā)現(xiàn)第一部電影（即肖申克的救贖）的完整內(nèi)容都在一個class屬于'item'的<div>標簽中，且其后每一部電影都在相同的結(jié)構(gòu)中。

Python爬蟲有多簡單？一文帶你實戰(zhàn)豆瓣電影TOP250數(shù)據(jù)爬??！

接著我們逐步通過查看源代碼來進行信息匹配，通過下圖可以看到在class為'pic'的div標簽中儲存了電影排名信息和圖片url。

因此我們可以利用正則表達式中的非貪婪匹配依次匹配到每個電影條目中的電影排名和圖片url信息，非貪婪匹配即(.*?)的原因在于此頁源代碼中包含了多部電影。代碼如下，第一個(.*?)即為排名，第二個(.*?)為圖片url。

<div class='pic'>.*?(.*?).*?<img.*?src='(.*?)' class=''>.*?

依次類推，名稱和別名依次在class為'title'的標簽和class為'other'的標簽中，整合上面的正則表達式后代碼如下：

<div class='pic'>.*?<em class=''>(.*?)</em>.*?<img.*?src='(.*?)' class=''>.*?
div class='info.*?class='hd'.*?class='title'>(.*?)</span>.*?class='other'>(.*?)

接下來是導(dǎo)演主演、年份國家和類型的標簽位置和正則表達式：

<div class='pic'>.*?<em class=''>(.*?)</em>.*?<img.*?src='(.*?)' class=''>.*?
div class='info.*?class='hd'.*?class='title'>(.*?)</span>.*?class='other'>(.*?)</span>.*?<div class='bd'>.*?<p class=''>(.*?)<br>(.*?)</p>.*?

之后是評分和評價人數(shù)的標簽位置及整合后的正則表達式：

<div class='pic'>.*?<em class=''>(.*?)</em>.*?<img.*?src='(.*?)' class=''>.*?
div class='info.*?class='hd'.*?class='title'>(.*?)</span>.*?class='other'>(.*?)</span>.*?<div class='bd'>.*?<p class=''>(.*?)<br>(.*?)</p>.*?
class='star.*?<span class='(.*?)'></span>.*?span class='rating_num'.*?average'>(.*?)</span>.*?<span>(.*?)</span>.*?

最后是提取儲存在class為'inq'的標簽中的經(jīng)典評價內(nèi)容：

$Python爬蟲有多簡單？一文帶你實戰(zhàn)豆瓣電影TOP250數(shù)據(jù)爬?。?></div></div><div><code><div class='pic'>.*?(.*?).*?<img.*?src='(.*?)' class=''>.*? div class='info.*?class='hd'.*?class='title'>(.*?).*?class='other'>(.*?) .*?<div class='bd'>.*?(.*?) (.*?).*?class='star.*?.*?span class='rating_num'.*?average'>(.*?).*?(.*?).* span class='inq'?>(.*?)</code></div><h2>二、爬蟲編寫</h2> 1.網(wǎng)頁獲取在進行上面的匹配分析之后我們得到了本文最核心的正則匹配表達式，接下來我們開始嘗試寫出網(wǎng)頁獲取的代碼。首先導(dǎo)入相關(guān)庫，之后將上述豆瓣電影top250的網(wǎng)址存儲在url變量中，定義瀏覽器頭，之后調(diào)用requests庫的get方法獲取網(wǎng)頁源代碼。<pre><code>import requestsimport reimport json url = 'https://movie.douban.com/top250?start=0&filter='headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'} response = requests.get(url, headers=headers) text = response.text</code></pre> 2.信息提取接著將上文的正則表達式存儲為字符串，調(diào)用re庫的findall函數(shù)匹配出所有滿足條件的子串。<div><code>regix = '<div class='pic'>.*?(.*?).*?<img.*?src='(.*?)' class=''>.*?' 'div class='info.*?class='hd'.*?class='title'>(.*?).*?class='other'>(.*?)' '.*?<div class='bd'>.*?(.*?) (.*?).*?' 'class='star.*?.*?span class='rating_num'.*?average'>(.*?).*?(.*?).*?' 'span class='inq'?>(.*?)' res = re.findall(regix, text, re.S) print(res)</code></div>通過輸出結(jié)果可知電影的排名、封面url、名稱、導(dǎo)演和演員、評分、評價人數(shù)及評價內(nèi)容都在多個元組組成的列表之中。<pre><code>[('1', 'https://img3./view/photo/s_ratio_poster/public/p480747492.jpg', '肖申克的救贖', ' / 月黑高飛(港) / 刺激1995(臺)', '\n 導(dǎo)演: 弗蘭克·德拉邦特 Frank Darabont 主演: 蒂姆·羅賓斯 Tim Robbins /...', '\n 1994 / 美國 / 犯罪劇情\n ', 'rating5-t', '9.7', '1893209人評價', '希望讓人自由。'),</code></pre>由于圖片文件需要單獨發(fā)送請求，我們在此單獨定義一個圖片下載函數(shù)，調(diào)用python內(nèi)置函數(shù)open將響應(yīng)內(nèi)容寫入到jpg格式文件中。<div><code># 定義下載圖片函數(shù)def down_image(url,name,headers):r = requests.get(url,headers = headers) filename = re.search('/public/(.*?)$',url,re.S).group(1)with open('film_pic/' name.split('/')[0] '.jpg','wb') as f: f.write(r.content)</code></div>在此基礎(chǔ)上我們將上文代碼整合為一個網(wǎng)頁解析函數(shù)，此函數(shù)完成了一頁中獲取網(wǎng)頁、提取信息、處理信息和輸出信息的功能，此處的yield生成器能夠在在一次調(diào)用過程中多次返回值，較return有明顯的優(yōu)勢。<pre><code># 定義解析網(wǎng)頁函數(shù)def parse_html(url): response = requests.get(url, headers=headers) text = response.text# 正則表達式頭部([1:排名 2:圖片] [3:名稱 4:別名] [5:導(dǎo)演 6:年份/國家/類型] [7:評星 8:評分 9:評價人數(shù)] [10:評價])regix = '<div class='pic'>.*?(.*?).*?<img.*?src='(.*?)' class=''>.*?' 'div class='info.*?class='hd'.*?class='title'>(.*?).*?class='other'>(.*?)' '.*?<div class='bd'>.*?(.*?) (.*?).*?' 'class='star.*?.*?span class='rating_num'.*?average'>(.*?).*?(.*?).*?' 'span class='inq'?>(.*?)'# 匹配出所有結(jié)果res = re.findall(regix, text, re.S)for item in res: rank = item[0] down_image(item[1],item[2],headers = headers) name = item[2] ' ' re.sub(' ','',item[3]) actor = re.sub(' ','',item[4].strip()) year = item[5].split('/')[0].strip(' ').strip() country = item[5].split('/')[1].strip(' ').strip() tp = item[5].split('/')[2].strip(' ').strip() tmp = [i for i in item[6] if i.isnumeric()]if len(tmp) == 1: score = tmp[0] '星/' item[7] '分'else: score = tmp[0] '星半/' item[7] '分'rev_num = item[8][:-3] inq = item[9]# 生成字典yield {'電影名稱': name,'導(dǎo)演和演員': actor,'類型': tp,'年份': year,'國家': country,'評分': score, '排名': rank,'評價人數(shù)': rev_num,'評價': inq }</code></pre> 3.保存數(shù)據(jù)上文返回的格式為字典，因此我們調(diào)用json庫的dumps方法將字典編碼為json格式名寫入到top250_douban_film.txt文本文件中。<div><code># 定義輸出函數(shù)def write_movies_file(str):with open('top250_douban_film.txt','a',encoding='utf-8') as f: f.write(json.dumps(str,ensure_ascii=False) '\n')</code></div> 4.循環(huán)結(jié)構(gòu)上文僅僅爬取了一頁共25條數(shù)據(jù)，通過點擊頁面中的下一頁對比發(fā)現(xiàn)，各頁碼的url僅僅是start=后面的參數(shù)不同，且都是25的倍數(shù)。<div><div><img doc360img-src='http://image109.360doc.com/DownloadImg/2020/07/2508/197239698_9_20200725081652599' src=$

鑒于此，我們利用循環(huán)結(jié)構(gòu)和字符串拼接就可以實現(xiàn)多頁爬?。?/p>

# 定義主函數(shù)def main():for offset in range(0, 250, 25):
        url = 'https://movie.douban.com/top250?start='   str(offset)  '&filter='for item in parse_html(url):
            print(item)
            write_movies_file(item)

最終爬取的封面圖片及電影信息結(jié)果如下：

三、爬蟲總結(jié)

至此豆瓣電影top250爬取實戰(zhàn)結(jié)束~爬蟲完整代碼可以私信或點擊下方擴展鏈接獲得。當然一個爬蟲遠遠不能讓筆者和大家熟練，要想舉一反三還需要去反復(fù)體會案例中的分析思路和具體解決途徑。

因此我們對上述爬蟲做一下總結(jié)：在爬蟲中首先對網(wǎng)頁結(jié)構(gòu)及robots協(xié)議進行分析，得到正則匹配表達式，再利用requests庫對目標網(wǎng)頁發(fā)起請求，再用re庫及正則匹配表達式對目標網(wǎng)頁源代碼進行信息提取，之后利用json庫和open函數(shù)將圖片及提取的信息存儲起來，在最再利用循環(huán)結(jié)構(gòu)和字符串拼接實現(xiàn)翻頁爬取。

然而有些網(wǎng)頁比如淘寶或者京東我們會發(fā)現(xiàn)源代碼中無法提取到想要的信息，原因在于這些網(wǎng)站都屬于動態(tài)加載的網(wǎng)站，而豆瓣電影網(wǎng)站屬于靜態(tài)網(wǎng)頁，在后文中將對這些技術(shù)進行進一步的講解。前文涉及的基礎(chǔ)知識可參考下面鏈接：

爬蟲所要了解的基礎(chǔ)知識，這一篇就夠了！Python網(wǎng)絡(luò)爬蟲實戰(zhàn)系列

一文帶你深入了解并學(xué)會Python爬蟲庫！從此數(shù)據(jù)不用愁