Bin Data Using SciPy, NumPy and Pandas in Python
With the exponential growth of data and use cases, data binning or categorizing becomes necessary to make sense of this data.
Regarding data binning, different techniques are available, like data clustering or more classical statistical techniques like regression analysis.
We will see why you need data binning and which technique is best suited for which context.
Binning in Python
Binning is one of the most powerful analytical techniques to infer the relationship of different variables.
Binning is a non-parametric and highly flexible technique where the variables are categorized into different sets to reveal patterns and trends. It is widely applicable to various data sets and tiny sample sizes.
Binning is a process of grouping data into bins. It can be done for various purposes, such as to group data points by range, group data points by density, or group data points by similarity.
There are various ways to bin data in python, such as using the numpy.digitize()
function, pandas.cut()
function, and using the scipy.stats.binned_statistic()
function.
Every method has pros and cons, so choosing the suitable method for the task is essential.
Importance of Data Binning
Data binning is a simple concept: classifying data for more straightforward analysis. For example, you might have several large data tables in a CSV, and you want to break the data into smaller chunks.
Data binning allows you to put the data into different groups so you can better analyze it, and we can also use it to create pretty visualizations.
So, why is data binning necessary? First, data binning is essential because it helps you analyze your data better. For example, you can split an entire data table into smaller chunks that are easier to understand or visualize.
Data binning can help you find patterns in the data and make it easier to identify outliers. It allows you to take a massive data set and make it more manageable to get to the meat of the problem.
Data binning is a process of subdividing a continuous variable into discrete bins. As a rough example, if you have a patient’s temperature variable, you can bin the temperature into five bins (say, < 36.5, 36.5–37.5, 37.5–38.5, 38.5–39.5 and > 39.5
).
This advantage is that you can visualize the variable in a histogram or box plot using the bin ranges.
Different Ways to Bin Data in Python
There are several ways to bin data in Python, but using the SciPy
and NumPy
libraries is arguably the most efficient.
Use SciPy
and NumPy
to Bin Data in Python
To start with SciPy
and NumPy
, let’s say you have a list of data points you want to bin. The first step is to import the SciPy
and NumPy
libraries:
import numpy as np
import scipy as sp
Next, you’ll need to define the edges of the bins. It can be done using the linspace
function:
bin_edges = np.linspace(start, stop, num=num_bins)
Where start
& stop
are the minimum & maximum values of the data, respectively, and num_bins
is the bins’ number you want to create. Finally, you can use the SciPy
histogram function to bin the data:
binned_data = sp.histogram(data, bin_edges)
The binned_data
variable will now contain a tuple with two elements. The first element is an array of the binned data, and the second is an array of the bin edges.
Use Numpy
to Bin Data in Python
Code Example:
# import Numpy library
import numpy
# define the edges of bin
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
# finally, bin the data using numpy
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
[data[digitized == i].mean() for i in range(1, len(bins))]
Output:
[0.05308461260140375,
0.16559348769870028,
0.28950800899648155,
0.3874228665181473,
0.5046647094141071,
0.6254841134474202,
0.7216935463408317,
0.8374773268113803,
0.9421576008815353]
Use Pandas
to Bin Data in Python
Code Example:
# import libraries
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100) + 10})
# will Bin the data frame by "a" in 10 bins
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))
# Get the b mean that the values will bin
print(groups.mean().b)
Output:
a
(0.00762, 0.117] 10.576639
(0.117, 0.226] 10.319629
(0.226, 0.335] 10.633805
(0.335, 0.444] 10.404979
(0.444, 0.553] 10.551616
(0.553, 0.662] 10.420306
(0.662, 0.771] 10.434091
(0.771, 0.88] 10.402038
(0.88, 0.989] 10.537547
Name: b, dtype: float64
Use SciPy
to Bin Data in Python
Code Example:
# import libraries
import numpy as np
from scipy import stats
# define array
arr = [20, 2, 7, 1, 34]
print("\narr : \n", arr)
# start binning
print(
"\nbinned_statistic for median : \n",
stats.binned_statistic(arr, np.arange(5), statistic="median", bins=4),
)
Output:
Array = [20, 2, 7, 1, 34]
Binned statistics for median
BinnedStatisticResult(statistic=array([ 2., nan, 0., 4.]), bin_edges=array([ 1. , 9.25, 17.5 , 25.75, 34. ]), binnumber=array([3, 1, 1, 1, 4], dtype=int64))
Zeeshan is a detail oriented software engineer that helps companies and individuals make their lives and easier with software solutions.
LinkedIn