以今日頭條視頻播放頁(yè)URL http://www.toutiao.com/a6296462662335201793/ 為例,來(lái)說(shuō)說(shuō)是如何得到視頻的真實(shí)地址的。
在Chrome瀏覽器中打開(kāi)上面的鏈接,然后審查播放器區(qū)域的元素,發(fā)現(xiàn)是這樣的:
<video id="vjs_video_3_html5_api" class="vjs-tech" preload="auto" autoplay="" src="http://v6./video/c/c62f4d4320ea43469b490e54240653ab/?Signature=D2cYsGzKaEXraZQnOf72xgJ94%2Bs%3D&Expires=1469172376&KSSAccessKeyId=qh0h9TdcEMrm1VlR2ad/"><source type="video/mp4" src="http://v6./video/c/c62f4d4320ea43469b490e54240653ab/?Signature=D2cYsGzKaEXraZQnOf72xgJ94%2Bs%3D&Expires=1469172376&KSSAccessKeyId=qh0h9TdcEMrm1VlR2ad/"></video>
原來(lái)是使用了HTML5的video標(biāo)簽,該標(biāo)簽的src屬性值就是視頻的真實(shí)地址。是不是很簡(jiǎn)單?如果我們想寫(xiě)個(gè)腳本來(lái)自動(dòng)解析視頻的真實(shí)地址,會(huì)發(fā)現(xiàn)情況不一樣。
說(shuō)明:以下代碼片段均使用Python語(yǔ)言。
import requestsfrom pyquery import PyQuery as pq
r = requests.get('http://www.toutiao.com/a6296462662335201793/')d = pq(r.content)d('video') # video元素不存在d('#video') # id是video的元素是存在的
當(dāng)我們把播放頁(yè)下載下來(lái),并且嘗試提取video元素的時(shí)候,發(fā)現(xiàn)下載下來(lái)的播放頁(yè)中根本就沒(méi)有video元素。這說(shuō)明video元素可能是js腳本動(dòng)態(tài)生成的,該想想其它辦法了。
通過(guò)觀察加載播放頁(yè)頁(yè)面時(shí)的網(wǎng)絡(luò)請(qǐng)求,我們發(fā)現(xiàn)如下相關(guān)的請(qǐng)求:
http://v7.pstatp.com/b97adb57aaa351e485ed69c5e4852211/5791c279/video/c/c62f4d4320ea43469b490e54240653ab/http://i.snssdk.com/video/urls/v/1/toutiao/mp4/9583cca5fceb4c6b9ca749c214fd1f90?r=18723666135963302&s=3807690062&callback=tt_playerzfndr
其中,第1個(gè)請(qǐng)求就是視頻真實(shí)地址,第2個(gè)請(qǐng)求返回的是一個(gè)JSON字符串,內(nèi)容如下:
{
"code": 0,
"message": "success",
"total": 3,
"data": {
"status": 10,
"video_duration": 0,
"video_id": "9583cca5fceb4c6b9ca749c214fd1f90",
"user_id": "toutiao",
"video_list": {
"video_3": {
"definition": "720p",
"vtype": "mp4",
"main_url": "aHR0cDovL3Y3LnBzdGF0cC5jb20vZmJiZmE2Yjc4ZjM4MThhM2M0OTVhMmRkYjAyOWY5NTAvNTc5\nMWMzODAvdmlkZW8vYy8zNDMwNzcxZjMyNmY0ZDUxOTRiNTYyMzdhNmEyMzFmYy8=\n",
"vwidth": 720,
"backup_url_1": "aHR0cDovL3Y2LnBzdGF0cC5jb20vdmlkZW8vYy8zNDMwNzcxZjMyNmY0ZDUxOTRiNTYyMzdhNmEy\nMzFmYy8/U2lnbmF0dXJlPTMwd25YNHVBYzJ1JTJGdSUyRlNvNjhDM010U1VRVW8lM0QmRXhwaXJl\ncz0xNDY5MTc0MTYwJktTU0FjY2Vzc0tleUlkPXFoMGg5VGRjRU1ybTFWbFIyYWQv\n",
"bitrate": 0,
"vheight": 576,
"size": 0
},
"video_2": {
"definition": "480p",
"vtype": "mp4",
"main_url": "aHR0cDovL3Y0LnBzdGF0cC5jb20vM2ZiYTI0YzVhYzE1NGVlNmIxMGQ4ZTAyZThhNGQxZDMvNTc5\nMWMzODAvdmlkZW8vYy9jNjJmNGQ0MzIwZWE0MzQ2OWI0OTBlNTQyNDA2NTNhYi8=\n",
"vwidth": 600,
"backup_url_1": "aHR0cDovL3Y0LnBzdGF0cC5jb20vM2ZiYTI0YzVhYzE1NGVlNmIxMGQ4ZTAyZThhNGQxZDMvNTc5\nMWMzODAvdmlkZW8vYy9jNjJmNGQ0MzIwZWE0MzQ2OWI0OTBlNTQyNDA2NTNhYi8=\n",
"bitrate": 0,
"vheight": 480,
"size": 0
},
"video_1": {
"definition": "360p",
"vtype": "mp4",
"main_url": "aHR0cDovL3Y2LnBzdGF0cC5jb20vdmlkZW8vYy9iODgwZmI1YzM1NjE0NzJlOThlNGU0Y2U5N2My\nYzg5ZS8/U2lnbmF0dXJlPXBlTWhoNFdLcyUyRkNmRW9pYm4wTVNKUU5tR1lnJTNEJkV4cGlyZXM9\nMTQ2OTE3NDE2MCZLU1NBY2Nlc3NLZXlJZD1xaDBoOVRkY0VNcm0xVmxSMmFkLw==\n",
"vwidth": 450,
"backup_url_1": "aHR0cDovL3Y3LnBzdGF0cC5jb20vNjFhYTJlN2RlN2YxZTgzNGJiNjg3ZDZmMDZjZGFmNzMvNTc5\nMWMzODAvdmlkZW8vYy9iODgwZmI1YzM1NjE0NzJlOThlNGU0Y2U5N2MyYzg5ZS8=\n",
"bitrate": 0,
"vheight": 360,
"size": 0
}
}
}}
看看JSON內(nèi)容,可以看到共有3種清晰度視頻,分別是超清、高清和標(biāo)清。definition表示清晰度,main_url應(yīng)該就是視頻真實(shí)地址了。main_url的值看起來(lái)就是base64編碼后的結(jié)果,用base64解碼main_url,得到的就是視頻真實(shí)地址。
import base64
main_url = "aHR0cDovL3Y3LnBzdGF0cC5jb20vZmJiZmE2Yjc4ZjM4MThhM2M0OTVhMmRkYjAyOWY5NTAvNTc5\nMWMzODAvdmlkZW8vYy8zNDMwNzcxZjMyNmY0ZDUxOTRiNTYyMzdhNmEyMzFmYy8=\n"base64.standard_b64decode(main_url) # output: http://v7./fbbfa6b78f3818a3c495a2ddb029f950/5791c380/video/c/3430771f326f4d5194b56237a6a231fc/
那么接下來(lái)的問(wèn)題就是探究上面的第2個(gè)請(qǐng)求 http://i./video/urls/v/1/toutiao/mp4/9583cca5fceb4c6b9ca749c214fd1f90?r=18723666135963302&s=3807690062&callback=tt_playerzfndr 是如何構(gòu)造的。
在用Chrome的開(kāi)發(fā)者工具監(jiān)視網(wǎng)絡(luò)請(qǐng)求的時(shí)候可以看到該請(qǐng)求是js腳本發(fā)出的,該js腳本是 http://s3./tt_player/player/tt2-player.js?r=customer1
把該js下載下來(lái),prettify一下,使用你最?lèi)?ài)的編輯器看看該js到底做了些什么。
通過(guò)研究該js腳本,發(fā)現(xiàn)請(qǐng)求http://i./video/urls/v/1/toutiao/mp4/9583cca5fceb4c6b9ca749c214fd1f90?r=18723666135963302&s=3807690062&callback=tt_playerzfndr 中的一些參數(shù)的含義如下:
9583cca5fceb4c6b9ca749c214fd1f90:這是視頻的唯一ID
18723666135963302:這是一個(gè)隨機(jī)數(shù)
3807690062:這是CRC32校驗(yàn)值無(wú)符號(hào)右移0位
視頻的唯一ID可以在播放頁(yè)HTML源碼中找到,即id為video的元素的tt-videoid屬性值。
import requestsfrom pyquery import PyQuery as pq
r = requests.get('http://www.toutiao.com/a6296462662335201793/')d = pq(r.content)vid = d('#video').attr('tt-videoid')
參數(shù)r的構(gòu)造如下:
import random
r = str(random.random())[2:]
參數(shù)s的構(gòu)造如下:
import urlparsedef right_shift(val, n):
return val >> n if val >= 0 else (val + 0x100000000) >> n
url = 'http://i./video/urls/v/1/toutiao/mp4/%s' % vid
n = urlparse.urlparse(url).path + '?r=' + r
c = binascii.crc32(n)s = right_shift(c, 0)
參數(shù)callback就不管了吧。到此,獲取JSON內(nèi)容就簡(jiǎn)單了:
r = requests.get(url + '?r=%s&s=%s' % (r, s))print r.json()