How to Pandas cut() vs qcut() Functions
-
Pandas
cut()
Function -
Pandas
qcut()
Function -
Difference Between
cut()
andqcut()
Functions - Conclusion
Binding continuous numeric data into various buckets for additional analysis is frequently useful when dealing with such data. Binning can also be called bucketing, discrete binning, discretization, or quantization.
Pandas cut()
Function
The array elements are divided into various bins using the Pandas cut()
function. The cut
function is primarily utilized for scalar data statistical analysis.
Syntax:
cut(
x,
bins,
right=True,
labels=None,
retbins=False,
precision=3,
include_lowest=False,
duplicates="raise",
)
Pandas qcut()
Function
qcut()
is a Quantile-based discretization function, according to the Pandas’ description. Meaning that qcut
makes an effort to create equal-sized bins from the underlying data. Instead of using the bins’ actual numerical edges, the function determines them using percentiles depending on how the data is distributed.
Syntax:
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates="raise")
Difference Between cut()
and qcut()
Functions
In short, is the key distinction between cut()
and qcut()
. Use qcut()
to ensure that the items in your bins are distributed equally, and use cut()
to create your own customized numeric bin ranges.
We are going to learn this difference in the example given below.
Code Example:
# import libraries
import numpy as np
import pandas as pd
# create a data frame
df = pd.DataFrame(
{
"column_x": np.random.randint(1, 50, size=50),
"column_y": np.random.randint(20, 100, size=50),
"column_z": np.random.random(size=50).round(2),
}
)
df.head()
Output:
column_x column_y column_z
0 6 68 0.70
1 30 83 0.50
2 35 64 0.41
3 28 98 0.73
4 5 24 0.79
In the first 2
columns, there are numbers in the ranges of 1
to 50
and 20
to 100
, respectively. Floats in the third column range from 0
to 1
, and we randomly generated these values using numpy
routines.
Now, as we know that the cut()
function distributes the entire value range into small bins, and the range covered by each bin will be the same. As a result, we assign different integers between 1
and 50
to the first column (column x)
. Let’s check this column’s lowest and highest values first.
Code Example:
df.column_x.max(), df.column_x.min()
Output:
(49, 3)
If we divide this column into 5 equal parts, for instance, we will get the size of each bin as 9.2
, like the following.
$$
(49 - 3) / 5 = 9.2
$$
This binning process is carried out by the cut()
function, which places each value in the appropriate bin.
Code Example:
df["column_x_binned"] = pd.cut(df.column_x, bins=5)
df.column_x_binned.value_counts()
Output:
(21.4, 30.6] 16
(39.8, 49.0] 14
(12.2, 21.4] 8
(30.6, 39.8] 6
(2.954, 12.2] 6
As you can see, every bin is exactly 9.2
inches in size, except for the tiniest. The bottom limits do not include anything.
To include it, the smallest bin’s lower bound must be somewhat less than the lowest value, 3
.
By manually specifying the bin boundaries, you can alter the appearance of the bins. The bins argument receives the edge values as a list.
Code Example:
pd.cut(df.column_x, bins=[0, 10, 40, 50]).value_counts()
Output:
(10, 40] 33
(40, 50] 13
(0, 10] 4
By default, the right edges are inclusive. However, this can be modified.
Code Example:
pd.cut(df.column_x, bins=[0, 10, 40, 50], right=False).value_counts()
Output:
[10, 40) 33
[40, 50) 13
[0, 10) 4
The values that fall into each bin values that fall into each bin while using the cut()
function are completely out of your control. You are limited to defining the bin edges.
You must become familiar with the qcut()
function at this point. The values can be divided into buckets so that roughly the same values are in each bucket.
Code Example:
pd.qcut(df.column_x, q=4).value_counts()
Output:
(40.75, 49.0] 13
(19.5, 25.0] 13
(2.999, 19.5] 13
(25.0, 40.75] 11
Each of our 4
buckets holds approximately the same values. The buckets are sometimes known as quartiles when there are four.
The first quartile contains one-fourth of the entire number of values, and the first two buckets contain fifty percent, and so on.
We do not control the bin edges with the qcut()
function. They are automatically calculated.
Consider a column that contains 40
values (40 rows), and we wish to have 4
buckets. The upper range of the first bucket will be chosen so that it contains 10
values starting from the smallest value.
Conclusion
A set of continuous values can be transformed into a discrete or categorical variable using either the cut()
or qcut()
functions.
The cut()
function concerns the bins’ value range. The difference between the smallest and largest numbers is used to establish the whole range.
The entire range is then divided into the desired number of bins. By default, each bin is roughly the same size, and the only variable is the distance between the edges of the lower and upper bins.
The amount of values in each bin is the main focus of the qcut()
function. The values are arranged in decreasing order of value.
Zeeshan is a detail oriented software engineer that helps companies and individuals make their lives and easier with software solutions.
LinkedIn