Python Apriori Algorithm
- Explanation of the Apriori Algorithm
- Apriori Algorithm in Python
- Implement the Topological Sort Algorithm in Python
This tutorial will discuss the implementation of the apriori algorithm in Python.
Explanation of the Apriori Algorithm
The Apriori Algorithm is widely used for market basket analysis, i.e., to analyze which items are sold and which other items. This is a useful algorithm for shop owners who want to increase their sales by placing the items sold together close to each other or offering discounts.
This algorithm states that if an itemset is frequent, all non-empty subsets must also be frequent. Let’s look at a small example to help illustrate this notion.
Let’s say that in our store, milk, butter, and bread are frequently sold together. This implies that milk, butter, and milk, bread, and butter, bread are also frequently sold together.
The Apriori Algorithm also states that the frequency of an itemset can never exceed the frequency of its non-empty subsets. We can further illustrate this by expanding a little more on our previous example.
In our store, milk, butter, and bread are sold together 3 times. This implies that all of its non-empty subsets like milk, butter, and milk, bread, and butter, bread are sold together at least 3 times or more.
Apriori Algorithm in Python
Before implementing this algorithm, we need to understand how the apriori algorithm works.
At the start of the algorithm, we specify the support threshold. The support threshold is just the probability of the occurrence of an item in a transaction.
$$
Support(A) =(Number of Transactions Containing the item A) / (Total Number of Transactions)
$$
Apart from support, there are other measures like confidence and lift, but we don’t need to worry about those in this tutorial.
The steps we need to follow to implement the apriori algorithm are listed below.
- Our algorithm starts with just a
1-itemset
. Here, 1 means the number of items in our itemset. - Removes all the items from our data that do not meet the minimum support requirement.
- Now, our algorithm increases the number of items (
k
) in our itemset and repeats steps 1 and 2 until the specifiedk
is reached or there are no itemsets that meet the minimum support requirements.
Implement the Topological Sort Algorithm in Python
To implement the Apriori Algorithm, we will be using the apyori
module of Python. It is an external module, and hence we need to install it separately.
The pip
command to install the apyori
module is below.
pip install apyori
We’ll be using the Market Basket Optimization dataset from Kaggle.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori
We have imported all the libraries required for our operations in the code given above. Now, we need to read the dataset using pandas
.
This has been implemented in the following code snippet.
market_data = pd.read_csv("Market_Basket_Optimisation.csv", header=None)
Now, let’s check the total number of transactions in our dataset.
len(market_data)
Output:
7501
The output shows that we have 7501 records in our dataset. There are just two small problems with this data; these transactions are of variable length.
Given the real-world scenarios, this makes a lot of sense.
To perform the apriori algorithm, we need to convert these arbitrary transactions into equi-length transactions. This has been implemented in the following code snippet.
transacts = []
for i in range(0, len(market_data)):
transacts.append([str(market_data.values[i, j]) for j in range(0, 20)])
In the above code, we initialized the list transacts
and stored our transactions of length 20 in it. The issue here is that we insert null values inside transactions with fewer than 20 items.
But we don’t have to worry about it because the apriori
module handles null values automatically.
We now generate association rules from our data with the apriori
class constructor. This is demonstrated in the following code block.
rules = apriori(
transactions=transacts,
min_support=0.003,
min_confidence=0.2,
min_lift=3,
min_length=2,
max_length=2,
)
We specified our thresholds for the constructor’s minimum support, confidence, and lift thresholds. We also specified the minimum and the maximum number of items in an itemset to be 2, i.e., we want to generate pairs of items that were frequently sold together.
The apriori algorithm’s association rules are stored inside the rules
generator object. We now need a mechanism to convert this rules
into a pandas
dataframe.
The following code snippet shows a function inspect()
that takes the generator object rules
returned by our apriori()
constructor and converts it into a pandas
dataframe.
def inspect(output):
Left_Hand_Side = [tuple(result[2][0][0])[0] for result in output]
support = [result[1] for result in output]
confidence = [result[2][0][2] for result in output]
lift = [result[2][0][3] for result in output]
Right_Hand_Side = [tuple(result[2][0][1])[0] for result in output]
return list(zip(Left_Hand_Side, support, confidence, lift, Right_Hand_Side))
output = list(rules)
output_data = pd.DataFrame(
inspect(output),
columns=["Left_Hand_Side", "Support", "Confidence", "Lift", "Right_Hand_Side"],
)
print(output_data)
Output:
Left_Hand_Side Support Confidence Lift Right_Hand_Side
0 light cream 0.004533 0.290598 4.843951 chicken
1 mushroom cream sauce 0.005733 0.300699 3.790833 escalope
2 pasta 0.005866 0.372881 4.700812 escalope
3 fromage blanc 0.003333 0.245098 5.164271 honey
4 herb & pepper 0.015998 0.323450 3.291994 ground beef
5 tomato sauce 0.005333 0.377358 3.840659 ground beef
6 light cream 0.003200 0.205128 3.114710 olive oil
7 whole wheat pasta 0.007999 0.271493 4.122410 olive oil
8 pasta 0.005066 0.322034 4.506672 shrimp
We can now sort this dataframe by support level and display the top 5 records in our dataset with the following code.
print(output_data.nlargest(n=5, columns="Lift"))
Output:
Left_Hand_Side Support Confidence Lift Right_Hand_Side
3 fromage blanc 0.003333 0.245098 5.164271 honey
0 light cream 0.004533 0.290598 4.843951 chicken
2 pasta 0.005866 0.372881 4.700812 escalope
8 pasta 0.005066 0.322034 4.506672 shrimp
7 whole wheat pasta 0.007999 0.271493 4.122410 olive oil
Apriori is a very basic and simple algorithm for market basket analysis. It can provide helpful insides to increase sales of items in a market or a store.
The only disadvantage of this algorithm is that it takes a lot of memory for large datasets. This is because it creates a lot of combinations of frequent items.
We also experienced this limitation as this tutorial was meant to work with the UCI online retail data set, but due to memory limitations, we had to change our dataset to market basket optimization.
Maisam is a highly skilled and motivated Data Scientist. He has over 4 years of experience with Python programming language. He loves solving complex problems and sharing his results on the internet.
LinkedIn