最近在搞爬蟲,然后得知webkit是個(gè)爬蟲最終利器,然后開始去了解了一下pywebkit。
發(fā)現(xiàn)國內(nèi)很少這方面的資源,國外的話搜尋了好久也沒搜到比較好的。
官網(wǎng)上的文檔貌似也沒找著,然后看了下官網(wǎng)給的例子,結(jié)果還是沒找著想要的。
然后東湊西拼了一下,找到了一個(gè)返回html的類,如下:
-
class WebView(webkit.WebView):
-
def get_html(self):
-
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
-
html = self.get_main_frame().get_title()
-
self.execute_script('document.title=oldtitle;')
-
return html
-
然后自己弄了幾個(gè)webview的對(duì)象出來,open了一下一個(gè)url,然后調(diào)用get_html()發(fā)現(xiàn)啥都木有
-
web = WebView()
-
web.open(url)
-
html = web.get_html()
-
-
print html
-
print str(html)
后來又google了一番,終于找到怎么調(diào)用這個(gè)類的方法:
-
#!/usr/bin/env python
-
import sys, threads # kudos to Nicholas Herriot (see comments)
-
import gtk
-
import webkit
-
import warnings
-
from time import sleep
-
from optparse import OptionParser
-
-
warnings.filterwarnings('ignore')
-
-
class WebView(webkit.WebView):
-
def get_html(self):
-
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
-
html = self.get_main_frame().get_title()
-
self.execute_script('document.title=oldtitle;')
-
return html
-
-
class Crawler(gtk.Window):
-
def __init__(self, url, file):
-
gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala
-
gtk.Window.__init__(self)
-
self._url = url
-
self._file = file
-
-
def crawl(self):
-
view = WebView()
-
view.open(self._url)
-
view.connect('load-finished', self._finished_loading)
-
self.add(view)
-
gtk.main()
-
-
def _finished_loading(self, view, frame):
-
with open(self._file, 'w') as f:
-
f.write(view.get_html())
-
gtk.main_quit()
-
-
def main():
-
options = get_cmd_options()
-
crawler = Crawler(options.url, options.file)
-
crawler.crawl()
-
-
def get_cmd_options():
-
"""
-
gets and validates the input from the command line
-
"""
-
usage = "usage: %prog [options] args"
-
parser = OptionParser(usage)
-
parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
-
parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
-
-
(options,args) = parser.parse_args()
-
-
if not options.url:
-
print 'You must specify an URL.',sys.argv[0],'--help for more details'
-
exit(1)
-
if not options.file:
-
print 'You must specify a destination file.',sys.argv[0],'--help for more details'
-
exit(1)
-
-
return options
-
-
if __name__ == '__main__':
-
main()
Ok,html到手
|