How to Vectorize a Function in Pandas

How to Vectorize a Function in Pandas

Vectorization is a way to convert a function into a form that evaluates it more efficiently. It speeds up data processing in Python by converting them into arrays. It speeds up Python code without using a loop.

The Pandas library is a popular tool in Python for data analysis and manipulation. We use Vectorization in Pandas commonly in numerical computing to improve code performance.

A Pandas data frame is a data structure built on top of a data frame, providing the functionality of both R data frames and Python dictionaries. It’s like a Python dictionary but with all the data analysis and manipulation capabilities, such as Excel tables and databases with rows and columns.

Vectorize a Function in Pandas

Let’s install the Python library pandas to import data frames.

PS C:\> pip install pandas

To perform vectorization on a data frame, we import it using the Python library pandas. Let’s run the below code to import a data frame and make it big through concatenation.

Example Code (saved in demo.py):

import pandas as pd

small_df = pd.read_csv("Salaries.csv")
df = pd.concat([small_df] * 100, ignore_index=True)

Now run the code below to calculate the total number of rows of the data frame for data analysis.

Example Code (saved in demo.py):

print(f"No of rows: {len(df)}")

OUTPUT (printed on console):

No of rows: 14865400

Let’s see the consumption time of an operation performed on the data frame without vectorization by running the below code.

Example Code (saved in demo.py):

import time
import numpy

start_time = time.process_time()
pay_with_tax = np.zeros(len(df))
for idx, pay in enumerate(df.TotalPay.values):
    pay_with_tax[idx] = pay * 1.05 + 1
end_time = time.process_time()

print("Without using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")

The function np.zeros() takes size as len(df) and creates an array of zeros of the specified size.for loop iterates over both the pay_with_tax array and the TotalPay column of the data frame as pay.

It calculates tax for each pay and stores it in pay_with_tax.

OUTPUT (printed on console):

vectorize a function in pandas - non vectorized

Vectorization adds flexibility to the operations using SIMD (Single Instruction Multiple Data) approaches. In Pandas, a batch API speeds up the operations without using loops.

Let’s run the below-given code that uses vectorization to calculate the time consumption in calculating salary_with_tax.

Example Code (saved in demo.py):

start_time = time.process_time()
pay_with_tax = df.TotalPay.values * 1.05 + 1
end_time = time.process_time()

print("Using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")

OUTPUT (printed on console):

vectorize a function in pandas - vectorized

You can also apply statistical operations of the numpy library, such as mean, sqrt etc., by adding little changes to the above code.

Example Code (saved in demo.py):

import numpy as np

# non vectorized
for idx, pay in enumerate(df.TotalPay.values):
    pay_with_tax[idx] = np.mean(pay)

# vectorized
pay_with_tax = df["TotalPay"].apply(np.mean)

You can see the difference in time consumption, both with or without vectorization. Industries deal with millions to trillions of rows of big data.

Computing this data with a non-vectorized approach is time-consuming. Thus, the flexible nature of vectorization in Pandas data frames helps in fast data analysis and manipulation.

Enjoying our tutorials? Subscribe to DelftStack on YouTube to support us in creating more high-quality video guides. Subscribe

Related Article - Pandas DataFrame