pywebkit 爬蟲入門

imelee 2017-03-07

展開全文

pywebkit 爬蟲入門 2014-05-10 23:49:39

分類：大數(shù)據(jù)

最近在搞爬蟲，然后得知webkit是個(gè)爬蟲最終利器，然后開始去了解了一下pywebkit。
發(fā)現(xiàn)國內(nèi)很少這方面的資源，國外的話搜尋了好久也沒搜到比較好的。
官網(wǎng)上的文檔貌似也沒找著，然后看了下官網(wǎng)給的例子，結(jié)果還是沒找著想要的。

然后東湊西拼了一下，找到了一個(gè)返回html的類，如下：

點(diǎn)擊(此處)折疊或打開

class WebView(webkit.WebView):
def get_html(self):
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
html = self.get_main_frame().get_title()
self.execute_script('document.title=oldtitle;')
return html

然后自己弄了幾個(gè)webview的對(duì)象出來，open了一下一個(gè)url，然后調(diào)用get_html()發(fā)現(xiàn)啥都木有

點(diǎn)擊(此處)折疊或打開

web = WebView()
web.open(url)
html = web.get_html()
print html
print str(html)

后來又google了一番，終于找到怎么調(diào)用這個(gè)類的方法：

點(diǎn)擊(此處)折疊或打開

#!/usr/bin/env python
import sys, threads # kudos to Nicholas Herriot (see comments)
import gtk
import webkit
import warnings
from time import sleep
from optparse import OptionParser
warnings.filterwarnings('ignore')
class WebView(webkit.WebView):
def get_html(self):
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
html = self.get_main_frame().get_title()
self.execute_script('document.title=oldtitle;')
return html
class Crawler(gtk.Window):
def __init__(self, url, file):
gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala
gtk.Window.__init__(self)
self._url = url
self._file = file
def crawl(self):
view = WebView()
view.open(self._url)
view.connect('load-finished', self._finished_loading)
self.add(view)
gtk.main()
def _finished_loading(self, view, frame):
with open(self._file, 'w') as f:
f.write(view.get_html())
gtk.main_quit()
def main():
options = get_cmd_options()
crawler = Crawler(options.url, options.file)
crawler.crawl()
def get_cmd_options():
"""
gets and validates the input from the command line
"""
usage = "usage: %prog [options] args"
parser = OptionParser(usage)
parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
(options,args) = parser.parse_args()
if not options.url:
print 'You must specify an URL.',sys.argv[0],'--help for more details'
exit(1)
if not options.file:
print 'You must specify a destination file.',sys.argv[0],'--help for more details'
exit(1)
return options
if __name__ == '__main__':
main()

Ok，html到手

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： imelee > 《網(wǎng)絡(luò)抓取》

舉報(bào)/認(rèn)領(lǐng)