How to Perform Stratified Sampling in Pandas
The following tutorial will teach how to perform stratified sampling in pandas on a data frame.
Stratified Sampling in Statistics
Stratified sampling is a strategy for obtaining samples representative of the population. Separating the population into homogeneous groupings called strata and randomly sampling data from each stratum decreases bias in sample selection.
When the mean values of each stratum differ, stratified sampling is employed in Statistics. Stratified sampling is frequently used in machine learning to construct test datasets for evaluating models, mainly when a dataset is vast and uneven.
Perform Stratified Sampling in Pandas
The first step in performing the stratified sampling would be importing the Pandas library.
import pandas as pd
Let us now learn the steps involved in stratified sampling.
- Separate the population into strata. The population is sorted into strata based on comparable traits in this stage, and each individual must belong to exactly one stratum.
- Determine the sample size. We need to decide whether our sample will be large or small at this stage.
- Randomly sample each stratum. Disproportionate sampling, in which the sample size of each stratum is equal regardless of its population size, or Proportionate sampling, in which the sample size of every stratum is proportional to its population size, is used to select random samples from each stratum.
We will now consider a sample and perform disproportionate and proportionate stratified sampling. Out of 10 students, we will sample 6 students based on their grades.
Let us first create a sample data frame to work on. Here we will take 4 columns, including name, id, grade, and category.
We will create this data frame using the code below.
students = {
"Name": [
"sanay",
"shivesh",
"rutwik",
"preet",
"yash",
"mann",
"pritesh",
"hritesh",
"raj",
"tarun",
],
"ID": ["001", "002", "003", "004", "005", "006", "007", "008", "009", "010"],
"Grade": ["A", "A", "C", "B", "B", "B", "C", "A", "A", "A"],
"Category": [2, 3, 1, 3, 2, 3, 3, 1, 2, 1],
}
df = pd.DataFrame(students)
print(df)
Output:
Name ID Grade Category
0 sanay 001 A 2
1 shivesh 002 A 3
2 rutwik 003 C 1
3 preet 004 B 3
4 yash 005 B 2
5 mann 006 B 3
6 pritesh 007 C 3
7 hritesh 008 A 1
8 raj 009 A 2
9 tarun 010 A 1
It’s worth noting that 50 percent of the kids are in grade A, 30 percent are in grade B, and 20 percent are in grade C. We will now perform disproportionate sampling, creating a sample of 6 students.
For disproportionate sampling, separate the students into groups depending on their grade, i.e., A, B, C, then use the sample function to sample 2 students from each grade group randomly. We do this using the below code.
df.groupby("Grade", group_keys=False).apply(lambda x: x.sample(2))
Output:
Name ID Grade Category
0 sanay 001 A 2
7 hritesh 008 A 1
5 mann 006 B 3
4 yash 005 B 2
2 rutwik 003 C 1
6 pritesh 007 C 3
For proportionate sampling, separate the students into groups depending on their grade, i.e., A, B, C, then take a random sample from each group based on population percentage using Pandas groupby()
. The overall sample size is 60% of the population (0.6).
We perform this using the below code.
df.groupby("Grade", group_keys=False).apply(lambda x: x.sample(frac=0.6))
Output:
Name ID Grade Category
7 hritesh 008 A 1
9 tarun 010 A 1
0 sanay 001 A 2
3 preet 004 B 3
5 mann 006 B 3
6 pritesh 007 C 3
Therefore, we can successfully perform proportionate and disproportionate sampling on a data frame in Pandas using the above approaches.