How to Extract Images From PDF Files Using Python
You can perform many operations with external files and sources using Python. One of the operations is extracting images from PDF files in Python, which is very useful whenever the PDF is too long and cannot be managed manually.
This guide shows you how to extract images from PDF files in Python.
Install the PyMuPDF
Library in Python
To perform this operation, one must install the PyMuPDF
library in Python. This library helps the user deal with the files in PDF
,XPS
, FB2
, OpenXPS
, and EPUB
formats. It is a very versatile library known for its high performance and rendering quality. However, it doesn’t come pre-installed in Python. To install this library, run the following command.
pip install PyMuPDF Pillow
Extract Images From a PDF File in Python
Now, to extract images from a PDF file, there is a stepwise procedure:
- First, all the necessary libraries are imported.
import fitz
import io
from PIL import Image
- Then, the path to the file from which the images have to be extracted is defined. The file is opened using the
open()
function from thefitz
module.
file_path = "randomfile.pdf"
open_file = fitz.open(file_path)
- After that, every page of the PDF file is iterated and checked if there are images available on each page.
for page_number in range(len(open_file)):
page = pdf_file[page_number]
list_image = page.getImageList()
if list_image:
print(f"{len(list_image)} images found on page {page_number}")
else:
print("No images found on page", page_number)
In this step, the getImageList()
function is used to extract all the images in the form of image objects, as a list of tuples.
- Then, all the extra information about the image, like the image size and the image extension, are returned by using the
extractImage()
function. This step is carried out as an iteration inside the first iteration itself.
for image_number, img in enumerate(page.getImageList(), start=1):
xref = img[0]
image_base = pdf_file.extractImage(xref)
bytes_image = image_base["image"]
ext_image = base_image["ext"]
After combining all these steps into one single program, you can easily extract all the images from a PDF file.
Now, suppose there are 5 pages in the randomfile.pdf
file. In those 5 pages, there is only 1 image in the last, for example, the 5th page. So, the output will look like this.
0 images found on page 0
0 images found on page 1
0 images found on page 2
0 images found on page 3
0 images found on page 4
1 images found on page 5
Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.
LinkedIn