How to Read PDF in Python
-
Use the
PyPDF2
Module to Read a PDF in Python -
Use the
PDFplumber
Module to Read a PDF in Python -
Use the
textract
Module to Read a PDF in Python -
Use the
PDFminer.six
Module to Read a PDF in Python
A PDF document cannot be modified but can be shared easily and reliably. There can be different elements in a PDF document like text, links, images, tables, forms, and more.
In this tutorial, we will read a PDF file in Python.
Use the PyPDF2
Module to Read a PDF in Python
PyPDF2
is a Python module that we can use to extract a PDF document’s information, merge documents, split a document, crop pages, encrypt or decrypt a PDF file, and more.
We open the PDF document in read binary mode using open('document_path.PDF', 'rb')
. PDFFileReader()
is used to create a PDF reader object to read the document. We can extract text from the pages of the PDF document using getPage()
and extractText()
methods. To get the number of pages in the given PDF document, we use .numPages
.
For example,
from PyPDF2 import PDFFileReader
temp = open("document_path.PDF", "rb")
PDF_read = PDFFileReader(temp)
first_page = PDF_read.getPage(0)
print(first_page.extractText())
The above code will print the text on the first page of the provided PDF document.
Use the PDFplumber
Module to Read a PDF in Python
PDFplumber
is a Python module that we can use to read and extract text from a PDF document and other things. PDFplumber
module is more potent as compared to the PyPDF2
module. Here we also use the open()
function to read a PDF file.
For example,
import PDFplumber
with PDFplumber.open("document_path.PDF") as temp:
first_page = temp.pages[0]
print(first_page.extract_text())
The above code will print the text from the first page of the provided PDF document.
Use the textract
Module to Read a PDF in Python
We can use the function textract.process()
from the textract
module to read a PDF document.
For example,
import textract
PDF_read = textract.process("document_path.PDF", method="PDFminer")
Use the PDFminer.six
Module to Read a PDF in Python
PDFminer.six
is a Python module that we can use to read and extract text from a PDF document. We will use the extract_text()
function from this module to read the text from a PDF.
For example,
from PDFminer.high_level import extract_text
PDF_read = extract_text("document_path.PDF")