The Pdfminer Package in Python
A PDF file is a standard portable document and is one of the most used document formats.
We can work and read different types of files in Python. There are several packages available to work with PDF files.
The pdfminer
is one such package. It has different functionalities to work with PDF files and read text data from such files.
We will discuss some basics of this package below.
Installing the pdfminer
Package in Python
The pdfminer
package does not support Python 3 from recent versions. We can use the fork of this package called pdfminer.six
for Python 3.
We can install this using the following pip
command from the command prompt.
pip install pdfminer.six
Using the pdfminer
Package in Python
We can use the extract_text()
function to extract text from a PDF saved on the device, we can use the extract_text()
function. We can specify the path of the file within the function.
See the following example.
from pdfminer.high_level import extract_text
s = extract_text("sample.pdf")
print(s)
Output:
Sample PDF from device
We can use the same function in different ways.
We can open a PDF file using the open()
function, create a file object, and use this file object to read the data. For this, we need to open the file in the rb
mode.
For example,
from pdfminer.high_level import extract_text
with open("sample.pdf", "rb") as f:
s = extract_text(f)
print(s)
Output:
Sample PDF from device
We can read a file from the web and extract its content using this function.
First, we will read the file using the given URL in the requests.get()
function. Its contents can be retrieved using the content()
function.
We will then load this file into the memory using the io.BytesIO()
function, and extract its text using the extract_pdf()
function.
Check the syntax below.
import io
import requests
r = requests.get(url)
s = extract_text(io.BytesIO(response.content))
print(s)
The pdfminer
package was widely used till Python 2.7 but then lost popularity due to compatibility issues with Python 3.
However, new packages have emerged that provide a faster way to work with PDF files in Python. The pyPDF2
is one such alternative available.
Manav is a IT Professional who has a lot of experience as a core developer in many live projects. He is an avid learner who enjoys learning new things and sharing his findings whenever possible.
LinkedIn