Scipy scipy.stats.pearsonr Method
-
Syntax of
scipy.stats.pearsonr()
: -
Example Codes :
scipy.stats.pearsonr()
Method to Find Corelation Coefficient -
Example Codes : Using
scipy.stats.pearsonr()
Method to Find Correlation Between Variables within a CSV File
Python Scipy scipy.stats.pearsonr()
method is used to find Pearson correlation coefficient, which represents linear relationships between two variables. It also gives the p-value
for testing non-correlation.
The value of the Pearson correlation coefficient ranges between -1
to +1
. If it is near -1
, there is a strong negative linear relationship between variables. If it is 0
, there is no linear relation, and at +1
, there is a strong relationship between variables.
A positive relationship indicates that if one variable’s value increases or goes up, another’s value also increases.
Syntax of scipy.stats.pearsonr()
:
scipy.stats.pearsonr(x, y)
Parameters
x |
It is the input array elements of the first variable or attribute. |
y |
It is the input array elements of the second variable or attribute. Length should be equal to x. |
Return
It returns a tuple of two values :
r
: It is the Pearson correlation coefficient. It shows the degree of relationship betweenx
andy
.p
value: It is the probability significance value. It checks whether to accept or reject the null hypothesis.
The null hypothesis means that there is no relationship between variables under consideration.
Example Codes : scipy.stats.pearsonr()
Method to Find Corelation Coefficient
import scipy
from scipy import stats
arr1 = [3, 6, 9, 12]
arr2 = [12, 10, 11, 11]
r, p = scipy.stats.pearsonr(arr1, arr2)
print("The pearson correlation coefficient is:", r)
print("The p-value is:", p)
Output:
The pearson correlation coefficient is: -0.31622776601683794
The p-value is: 0.683772233983162
Here, two arrays having equal elements are considered, and they are passed as an argument into the pearsonr
function. Here we see the negative correlation coefficient as an output because the first array has linearly increasing valued elements, whereas elements are taken randomly in the second array.
Since p-value
(0.683772233983162
) is greater than 0.05
, therefore null hypothesis is True
.
Example Codes : Using scipy.stats.pearsonr()
Method to Find Correlation Between Variables within a CSV File
import numpy as np
import pandas as pd
import scipy
from scipy import stats
data = pd.read_csv("dataset.csv")
newdata = data[["price", "mileage"]].dropna()
r, p = scipy.stats.pearsonr(newdata["price"], newdata["mileage"])
print("The pearson correlation coefficient between price and mileage is:", r)
print("The p-value is:", p)
Output:
The pearson correlation coefficient between price and mileage is: -0.4008381863293672
The p-value is: 4.251481046096957e-97
Here, we use the pandas library to load data as a pandas data frame. The dataset.csv
file is read. The file contains car data having columns name
, price
, mileage
, brand
, and year of manufacture
. Then, we use the dropna()
method to drop down every column except price
and mileage
to check the strength of their relationship.
On analyzing the output value, we can see that the Pearson correlation coefficient is negative, meaning price and mileage have a relatively strong negative linear relationship. Those cars whose price is less will provide the higher mileage, and once the price of the car increases, the mileage value starts to decrease.
Since p
is very minute (approx 0), thus test hypothesis is false
and should be rejected.