乡下人产国偷v产偷v自拍,国产午夜片在线观看,婷婷成人亚洲综合国产麻豆,久久综合给合久久狠狠狠9

  • <output id="e9wm2"></output>
    <s id="e9wm2"><nobr id="e9wm2"><ins id="e9wm2"></ins></nobr></s>

    • 分享

      python爬蟲獲取hsk動態(tài)作文語料庫語料

       岳士君 2015-12-15

      # 將程序、keyword文件放在桌面,執(zhí)行程序。注意:keyword文本文件需每詞一行。

      import requests
      import bs4
      import re

      base_url = 'http://202.112.195.192:8060/hsk/'

      # 獲得關(guān)鍵詞的html頁面-第一頁
      def get_page(keyword, page_num=None):
          headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36',}
          cookies = dict(ASPSESSIONIDSQASCRST='JKENNPPCMNBHKOGDCCLIIHPA', zwyl='rights=2&utime=2015%2F12%2F14+9%3A12%3A21&username=rebellion51')
          params = dict(keyword=keyword.encode('gb2312'),  kind='ci', radiobutton='all', page=page_num)
          response = requests.get(base_url+'googlecom.asp', headers=headers, cookies=cookies, params=params)
          html = response.content.decode('gbk')
          soup = bs4.BeautifulSoup(html, 'html.parser')
          return soup

      # 解析關(guān)鍵詞所在的第一頁,獲得總頁數(shù)
      def get_total_page_num(soup):
          total_page_num = re.findall(r'共\s*(\d+)\s*頁', soup.find('div', id='Layer1').find('div').text.strip())
          return int(total_page_num[0])

      # 解析html,獲得句子
      def get_sentences(html):
              tds = html.find('div', id='Layer1').find('table').find_all('td')[1::2]
              sentences = [td.text.replace('原始語料', '').strip() for td in tds]
              return sentences

      # 將每個關(guān)鍵詞出現(xiàn)的所有句子保存到文件
      def save_sentences(html, keyword):
          with open(keyword+'.txt', 'at') as f:
              for i in html:
                  f.write(i+'\n')

      # 從keyword文件中獲得關(guān)鍵詞列表
      with open('keyword.txt', 'rt') as f:
          keywords = f.readlines()
          keywords = [key.strip() for key in keywords]

      # 循環(huán)打開每一頁,獲得句子并保存到文件
      for keyword in keywords:
          total_page_num = get_total_page_num(get_page(keyword))
          print('詞語:【{}】共{}頁'.format(keyword, total_page_num))
          html = []
          for page_num in range(1, total_page_num+1):
              print('-----------------正在抓取第{}頁------------'.format(page_num))
              html.extend(get_sentences(get_page(keyword, page_num)))
             
          save_sentences(html, keyword)
          print('詞語:【{}】已抓取完成'.format(keyword))
        

        本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點擊一鍵舉報。
        轉(zhuǎn)藏 分享 獻(xiàn)花(0

        0條評論

        發(fā)表

        請遵守用戶 評論公約

        類似文章 更多