How to Extract Images From PDF Files Using Python

  1. Method 1: Using PyMuPDF
  2. Method 2: Using pdf2image
  3. Method 3: Using pdfplumber
  4. Conclusion
  5. FAQ
How to Extract Images From PDF Files Using Python

Extracting images from PDF files can be a daunting task, especially if you’re not familiar with the right tools or programming languages. Fortunately, Python makes this process straightforward and efficient.

In this tutorial, we will explore various methods to extract images from PDF files using Python. Whether you’re a beginner or an experienced programmer, you’ll find the step-by-step instructions easy to follow. We’ll cover popular libraries such as PyMuPDF and pdf2image, providing clear code examples and explanations. By the end of this article, you’ll be equipped with the knowledge to extract images from any PDF file effortlessly. Let’s dive in!

Method 1: Using PyMuPDF

One of the most popular libraries for handling PDF files in Python is PyMuPDF. This library allows you to work with PDF documents easily, including extracting images. To get started, you first need to install PyMuPDF. You can do this using pip:

pip install PyMuPDF

Once you have the library installed, you can use the following code to extract images from a PDF file:

import fitz

def extract_images_from_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    for page_number in range(len(pdf_document)):
        page = pdf_document[page_number]
        image_list = page.get_images(full=True)
        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_filename = f"image_page_{page_number + 1}_{img_index + 1}.png"
            with open(image_filename, "wb") as image_file:
                image_file.write(image_bytes)

extract_images_from_pdf("sample.pdf")

Output:

Images extracted from sample.pdf and saved as image_page_1_1.png, image_page_1_2.png, etc.

In this code, we start by importing the fitz module from the PyMuPDF library. We open the PDF file and loop through each page. For each page, we retrieve a list of images and extract each one using its xref. The extracted images are saved as PNG files, named according to their page and index. This method is efficient and works well for most PDF files.

Method 2: Using pdf2image

Another effective way to extract images from PDF files in Python is by using the pdf2image library. This library converts PDF pages into images, which can then be processed to extract the images contained within. First, make sure to install the library:

pip install pdf2image

You will also need to install Poppler for Windows or Mac, as pdf2image relies on it. Once everything is set up, you can use the following code:

from pdf2image import convert_from_path

def extract_images_from_pdf(pdf_path):
    images = convert_from_path(pdf_path)
    for i, image in enumerate(images):
        image.save(f'pdf_page_{i + 1}.png', 'PNG')

extract_images_from_pdf("sample.pdf")

Output:

PDF pages converted and saved as pdf_page_1.png, pdf_page_2.png, etc.

In this example, we use the convert_from_path function to convert each page of the PDF into an image. We then loop through the resulting images and save them as PNG files. This method is particularly useful if you want to convert entire pages into images rather than just extracting embedded images. It provides a simple way to visualize the content of each page in the PDF.

Method 3: Using pdfplumber

pdfplumber is another powerful library for working with PDF files in Python. It allows you to extract text, tables, and images from PDFs. To start using pdfplumber, you need to install it first:

pip install pdfplumber

After installation, you can use the following code to extract images:

import pdfplumber

def extract_images_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page_number in range(len(pdf.pages)):
            page = pdf.pages[page_number]
            for img in page.images:
                image = page.to_image()
                image.save(f'image_page_{page_number + 1}.png', 'PNG')

extract_images_from_pdf("sample.pdf")

Output:

Images extracted from sample.pdf and saved as image_page_1.png, image_page_2.png, etc.

In this approach, we open the PDF file using pdfplumber and iterate through each page. The to_image() method allows us to convert the page into an image format. We then save each image as a PNG file. This method is particularly useful when dealing with complex PDFs that contain a mix of text and images, as pdfplumber provides more granular control over the extraction process.

Conclusion

Extracting images from PDF files using Python can be accomplished with various libraries, each offering unique advantages depending on your specific needs. PyMuPDF, pdf2image, and pdfplumber are all effective tools for this task. By following the examples provided, you can easily extract images from your PDF files, whether you need individual images or entire pages converted to images. With Python’s powerful libraries at your disposal, the possibilities are endless. Happy coding!

FAQ

  1. What libraries are best for extracting images from PDF files in Python?
    PyMuPDF, pdf2image, and pdfplumber are among the best libraries for extracting images from PDF files in Python.

  2. Do I need to install any additional software to use pdf2image?
    Yes, you need to install Poppler for Windows or Mac to use pdf2image effectively.

  3. Can I extract images from scanned PDFs?
    Yes, but you may need to use Optical Character Recognition (OCR) tools in conjunction with the image extraction libraries.

  4. Is it possible to extract images from encrypted PDF files?
    Yes, but you may need to provide the correct password to open and extract images from encrypted PDF files.

  1. Are the images extracted in their original quality?
    Yes, the images are extracted in their original quality, depending on the method and library used.
Enjoying our tutorials? Subscribe to DelftStack on YouTube to support us in creating more high-quality video guides. Subscribe
Lakshay Kapoor avatar Lakshay Kapoor avatar

Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.

LinkedIn

Related Article - Python PDF