SciPy stats.zscore Function
-
the
scipy.stats.zscore
Function -
Calculating the
z-score
for aOne-dimensional
Array in Python -
Calculating the
z-score
for a Multi-Dimensional Array in Python -
Calculating the
z-score
for aPandas Dataframe
in Python
z-score
is a statistic method that helps calculate how many values standard deviation away is a particular value away from the mean value. The z-score
is calculated with the help of the following formula.
z = (X – μ) / σ
In which,
- X is a particular value from the data
- μ is the mean value
- σ is the standard deviation
This tutorial will show how to calculate the z-score
value of any data in Python using the SciPy
library.
the scipy.stats.zscore
Function
The scipy.stats.zscore
function of the SciPy
library helps to calculate the relative z-score
of the given input raw data along with the data’s mean and standard deviation. It is defined as scipy.stats.zscore(a, axis, ddof, nan_policy)
.
Following are the parameters of the scipy.stats.zscore
function.
a (array) |
An array-like object of the raw input data. |
axis (int) |
It defines the axis along which the function computes the z-score value. The default value is 0 i.e, the function computes over the whole array. |
ddof (int) |
It defines the degree of freedom correction in the whole computation of the standard deviation. |
nan_policy |
This parameter decides how to deal when there are NaN values in the input data. There are three decision parameters in the parameter, propagate , raise , omit . propagate simply returns the NaN value, raise returns an error and omit simply ignores the NaN values and the function continues with computation. These decision parameters are defined in single quotes '' . Also, NaN values never affect the z-score value that is calculated for the other values present in the input data. |
All the parameters except the a (array)
parameter are optional. That means it is not necessary to define them every time while using the scipy.stats.zscore
function.
Now, let us use the scipy.stats.zscore
function on one-dimensional array
, multi dimensional array
, and Pandas Dataframe
.
Calculating the z-score
for a One-dimensional
Array in Python
import numpy as np
import scipy.stats as stats
input_data = np.array([5, 10, 20, 35, 25, 22, 19, 19, 50, 45, 62])
stats.zscore(input_data)
Output:
array([-1.3916106 , -1.09379511, -0.49816411, 0.39528239, -0.20034861,
-0.37903791, -0.55772721, -0.55772721, 1.28872889, 0.99091339,
2.00348608])
Note that each z-score
value tells that how many standard deviation values away is its corresponding value away from the mean value. Here, the negative
sign represents that that value is that many standard deviations below
the mean value, and the positive sign represents that that value is that many standard deviations above
the mean value. If a z-score
value comes out to be 0
, then that value is 0
standard deviation values away from the mean value.
Calculating the z-score
for a Multi-Dimensional Array in Python
import numpy as np
import scipy.stats as stats
data = np.array([[5, 10, 20, 35], [25, 22, 19, 19], [50, 45, 62, 28], [24, 45, 15, 30]])
stats.zscore(input_data)
Output:
array([-1.3916106 , -1.09379511, -0.49816411, 0.39528239, -0.20034861,
-0.37903791, -0.55772721, -0.55772721, 1.28872889, 0.99091339,
2.00348608])
Calculating the z-score
for a Pandas Dataframe
in Python
In this, we will use the randint()
function of the NumPy
library. This function is used to generate random sample numbers and store them in the form of a NumPy
array. After creating the NumPy
array, we will use that array as a Pandas Dataframe
.
import pandas as pd
import numpy as np
import scipy.stats as stats
input_data = pd.DataFrame(
np.random.randint(0, 30, size=(4, 4)), columns=["W", "X", "Y", "Z"]
)
print(input_data)
W X Y Z
0 7 9 2 15
1 11 23 15 28
2 28 11 25 2
3 11 19 14 15
input_data.apply(stats.zscore)
Output:
W X Y Z
0 -0.894534 -1.135815 -1.471534 0.000000
1 -0.400998 1.310556 0.122628 1.414214
2 1.696529 -0.786334 1.348907 -1.414214
3 -0.400998 0.611593 0.000000 0.000000
Note that apply()
function of the Pandas
library is used to calculate the z-score
value for each value in the given dataframe. This function is used to apply a specific function defined as a function argument of the apply()
function to each value of the Pandas series or dataframe.
Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.
LinkedIn