Pandas DataFrame DataFrame.sample() Function
-
Syntax of
pandas.DataFrame.sample()
-
Example Codes:
DataFrame.sample()
-
Example Codes:
DataFrame.sample()
to Extract the Columns -
Example Codes:
DataFrame.sample()
to Generate a Fraction of Data -
Example Codes:
DataFrame.sample()
to Oversample the DataFrame -
Example Codes:
DataFrame.sample()
Withweights
Python Pandas DataFrame.sample()
function generates a sample of a random row or a column from a DataFrame
. The sample can contain more than one row or column.
Syntax of pandas.DataFrame.sample()
DataFrame.sample(
n=None, frac=None, replace=False, weights=None, random_state=None, axis=None
)
Parameters
n |
It is an integer. It represents the random number of the rows or columns to be selected from the DataFrame . |
frac |
It is a float value. It specifies the percentage of random rows or columns to be extracted from the DataFrame . For example, frac=0.45 means that the random rows or columns selected will be 45% of the original data. |
replace |
It is a boolean value. If it is set to True then it returns the sample with the replacement of data. |
weights |
It is a string or an N-dimensional array-like structure. If it is called on a DataFrame then it accepts the name of a column when the axis is 0. The rows with values greater in weights column are more likely to be returned as the sample data. |
random_state |
It is an integer or numpy.random.RandomState function. If it is an integer then it returns the same number of rows or columns in every iteration. Otherwise, it returns a numpy.random.RandomState object. |
axis |
It is an integer or a string. It specifies the target axis either rows or columns. It can be 0 or index and 1 or columns . |
Return
It returns a Series
or a DataFrame
. The returned Series
or DataFrame
is a caller that contains n items selected randomly from the original DataFrame
.
Example Codes: DataFrame.sample()
By default, the function returns a sample containing rows i.e axis=1
.
import pandas as pd
dataframe=pd.DataFrame({'Attendance': {0: 60, 1: 100, 2: 80,3: 75, 4: 95},
'Name': {0: 'Olivia', 1: 'John', 2: 'Laura',3: 'Ben',4: 'Kevin'},
'Obtained Marks': {0: 56, 1: 75, 2: 82, 3: 64, 4: 67}})
print(dataframe)
Our DataFrame
is as below.
Attendance Name Obtained Marks
0 60 Olivia 56
1 100 John 75
2 80 Laura 82
3 75 Ben 64
4 95 Kevin 67
All the parameters of this function are optional. If we execute this function without passing any parameter, it returns a single random row as an output.
import pandas as pd
dataframe=pd.DataFrame({'Attendance': {0: 60, 1: 100, 2: 80,3: 75, 4: 95},
'Name': {0: 'Olivia', 1: 'John', 2: 'Laura',3: 'Ben',4: 'Kevin'},
'Obtained Marks': {0: 56, 1: 75, 2: 82, 3: 64, 4: 67}})
dataframe1 = dataframe.sample()
print(dataframe1)
Output1:
Attendance Name Obtained Marks
3 75 Ben 64
Output2:
Attendance Name Obtained Marks
4 95 Kevin 67
Outpt1 and output2 show the execution of the same program twice. Every time this function generates a random sample of rows from the given DataFrame
.
Example Codes: DataFrame.sample()
to Extract the Columns
To generate columns in a sample we will simply change our axis to 1.
import pandas as pd
dataframe = pd.DataFrame(
{
"Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
"Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
"Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
}
)
dataframe1 = dataframe.sample(n=1, axis=1)
print(dataframe1)
Output:
Name
0 Olivia
1 John
2 Laura
3 Ben
4 Kevin
The function has generated a sample of a single column as an output. The number of columns was set by the parameter n=1
.
Example Codes: DataFrame.sample()
to Generate a Fraction of Data
import pandas as pd
dataframe = pd.DataFrame(
{
"Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
"Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
"Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
}
)
dataframe1 = dataframe.sample(frac=0.5)
print(dataframe1)
Output:
Attendance Name Obtained Marks
3 75 Ben 64
4 95 Kevin 67
1 100 John 75
The returned sample is 50% of the original data.
Example Codes: DataFrame.sample()
to Oversample the DataFrame
If frac>1
, then the parameter replace
should be True
to allow the same row could be sampled more than once; otherwise, it will raise a ValueError
.
import pandas as pd
dataframe = pd.DataFrame(
{
"Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
"Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
"Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
}
)
dataframe1 = dataframe.sample(frac=1.5, replace=True)
print(dataframe1)
Output:
Attendance Name Obtained Marks
3 75 Ben 64
0 60 Olivia 56
1 100 John 75
2 80 Laura 82
1 100 John 75
2 80 Laura 82
0 60 Olivia 56
4 95 Kevin 67
If replace
is set to be False
meanwhile frac
is larger than 1, than it raises a ValueError
.
import pandas as pd
dataframe = pd.DataFrame(
{
"Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
"Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
"Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
}
)
dataframe1 = dataframe.sample(frac=1.5, replace=False)
print(dataframe1)
Output:
Traceback (most recent call last):
File "..\test.py", line 6, in <module>
dataframe1 = dataframe.sample(frac=1.5, replace=False)
File "..\lib\site-packages\pandas\core\generic.py", line 5044, in sample
raise ValueError(
ValueError: Replace has to be set to `True` when upsampling the population `frac` > 1.
Example Codes: DataFrame.sample()
With weights
import pandas as pd
dataframe = pd.DataFrame(
{
"Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
"Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
"Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
}
)
dataframe1 = dataframe.sample(n=2, weights="Attendance")
print(dataframe1)
Output:
Attendance Name Obtained Marks
1 100 John 75
4 95 Kevin 67
Here, the rows with greater values in the Attendance
column are selected in the returned sample.