python讀取pdf中的文本

老三的休閑書屋 2021-04-10

展開全文

python處理pdf也是常用的技術(shù)了，對(duì)于python3來說，pdfminer3k是一個(gè)非常好的工具。

pip install pdfminer3k

首先，為了滿足大部分人的需求，我先給一個(gè)通用一點(diǎn)的腳本來讀取pdf中的文本：


from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 獲取所有行
    lines = str(content).split('\n')
    return lines
if __name__ == '__main__':
    with open('t1.pdf', 'rb') as my_pdf:
        print(read_pdf(my_pdf))

我主要是想在pdf中抽出自己想要的一些關(guān)鍵信息，所以需要找到這些信息的共同點(diǎn)。幸運(yùn)的是，這些關(guān)鍵信息的行都含有'//'，所以我只需找到含有'//'的行就行了，于是寫了以下腳本。

這樣就可以直接使用了，我們先看腳本：


from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 獲取所有行
    lines = str(content).split('\n')
    units = [1, 2, 3, 5, 7, 8, 9, 11, 12, 13]
    header = '\x0cUNIT '
    # print(lines[0:100])
    count = 0
    flag = False
    text = open('words.txt', 'w+')
    for line in lines:
        if line.startswith(header):
            flag = False
            count += 1
            if count in units:
                flag = True
                print(line)
                text.writelines(line + '\n')
        if '//' in line and flag:
            text_line = line.split('//')[0].split('. ')[-1]
            print(text_line)
            text.writelines(text_line+'\n')
    text.close()
def _main():
    my_pdf = open('t1.pdf', 'rb')
    read_pdf(my_pdf)
    my_pdf.close()
if __name__ == '__main__':
    _main()

其實(shí)看到lines = str(content).split('\n')那一行就夠了，我們可以把lines都print出來，就可以看到pdf里面的內(nèi)容。

這樣我們就可以把pdf文件處理看作簡(jiǎn)單的字符串?dāng)?shù)據(jù)處理了。接下來的腳本操作也不用過多解釋了。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：老三的休閑書屋 > 《PYTHON》

舉報(bào)/認(rèn)領(lǐng)