How to Vectorize a Function in Pandas
Vectorization is a way to convert a function into a form that evaluates it more efficiently. It speeds up data processing in Python by converting them into arrays. It speeds up Python code without using a loop.
The Pandas library is a popular tool in Python for data analysis and manipulation. We use Vectorization in Pandas commonly in numerical computing to improve code performance.
A Pandas data frame is a data structure built on top of a data frame, providing the functionality of both R data frames and Python dictionaries. It’s like a Python dictionary but with all the data analysis and manipulation capabilities, such as Excel tables and databases with rows and columns.
Vectorize a Function in Pandas
Let’s install the Python library pandas
to import data frames.
PS C:\> pip install pandas
To perform vectorization on a data frame, we import it using the Python library pandas
. Let’s run the below code to import a data frame and make it big through concatenation.
Example Code (saved in demo.py
):
import pandas as pd
small_df = pd.read_csv("Salaries.csv")
df = pd.concat([small_df] * 100, ignore_index=True)
Now run the code below to calculate the total number of rows of the data frame for data analysis.
Example Code (saved in demo.py
):
print(f"No of rows: {len(df)}")
OUTPUT (printed on console):
No of rows: 14865400
Let’s see the consumption time of an operation performed on the data frame without vectorization by running the below code.
Example Code (saved in demo.py
):
import time
import numpy
start_time = time.process_time()
pay_with_tax = np.zeros(len(df))
for idx, pay in enumerate(df.TotalPay.values):
pay_with_tax[idx] = pay * 1.05 + 1
end_time = time.process_time()
print("Without using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")
The function np.zeros()
takes size as len(df)
and creates an array of zeros of the specified size.for
loop iterates over both the pay_with_tax
array and the TotalPay
column of the data frame as pay
.
It calculates tax for each pay
and stores it in pay_with_tax
.
OUTPUT (printed on console):
Vectorization adds flexibility to the operations using SIMD (Single Instruction Multiple Data) approaches. In Pandas, a batch API
speeds up the operations without using loops.
Let’s run the below-given code that uses vectorization to calculate the time consumption in calculating salary_with_tax
.
Example Code (saved in demo.py
):
start_time = time.process_time()
pay_with_tax = df.TotalPay.values * 1.05 + 1
end_time = time.process_time()
print("Using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")
OUTPUT (printed on console):
You can also apply statistical operations of the numpy
library, such as mean
, sqrt
etc., by adding little changes to the above code.
Example Code (saved in demo.py
):
import numpy as np
# non vectorized
for idx, pay in enumerate(df.TotalPay.values):
pay_with_tax[idx] = np.mean(pay)
# vectorized
pay_with_tax = df["TotalPay"].apply(np.mean)
You can see the difference in time consumption, both with or without vectorization. Industries deal with millions to trillions of rows of big data.
Computing this data with a non-vectorized approach is time-consuming. Thus, the flexible nature of vectorization in Pandas data frames helps in fast data analysis and manipulation.