Pandas DataFrame DataFrame.sample() Function

  1. Syntax of pandas.DataFrame.sample()
  2. Example Codes: DataFrame.sample()
  3. Example Codes: DataFrame.sample() to Extract the Columns
  4. Example Codes: DataFrame.sample() to Generate a Fraction of Data
  5. Example Codes: DataFrame.sample() to Oversample the DataFrame
  6. Example Codes: DataFrame.sample() With weights
Pandas DataFrame DataFrame.sample() Function

Python Pandas DataFrame.sample() function generates a sample of a random row or a column from a DataFrame. The sample can contain more than one row or column.

Syntax of pandas.DataFrame.sample()

DataFrame.sample(
    n=None, frac=None, replace=False, weights=None, random_state=None, axis=None
)

Parameters

n It is an integer. It represents the random number of the rows or columns to be selected from the DataFrame.
frac It is a float value. It specifies the percentage of random rows or columns to be extracted from the DataFrame. For example, frac=0.45 means that the random rows or columns selected will be 45% of the original data.
replace It is a boolean value. If it is set to True then it returns the sample with the replacement of data.
weights It is a string or an N-dimensional array-like structure. If it is called on a DataFrame then it accepts the name of a column when the axis is 0. The rows with values greater in weights column are more likely to be returned as the sample data.
random_state It is an integer or numpy.random.RandomState function. If it is an integer then it returns the same number of rows or columns in every iteration. Otherwise, it returns a numpy.random.RandomState object.
axis It is an integer or a string. It specifies the target axis either rows or columns. It can be 0 or index and 1 or columns.

Return

It returns a Series or a DataFrame. The returned Series or DataFrame is a caller that contains n items selected randomly from the original DataFrame.

Example Codes: DataFrame.sample()

By default, the function returns a sample containing rows i.e axis=1.

import pandas as pd

dataframe=pd.DataFrame({'Attendance': {0: 60, 1: 100, 2: 80,3: 75, 4: 95},
                    'Name': {0: 'Olivia', 1: 'John', 2: 'Laura',3: 'Ben',4: 'Kevin'},
                    'Obtained Marks': {0: 56, 1: 75, 2: 82, 3: 64, 4: 67}})
print(dataframe)

Our DataFrame is as below.

   Attendance    Name  Obtained Marks
0          60  Olivia              56
1         100    John              75
2          80   Laura              82
3          75     Ben              64
4          95   Kevin              67

All the parameters of this function are optional. If we execute this function without passing any parameter, it returns a single random row as an output.

import pandas as pd

dataframe=pd.DataFrame({'Attendance': {0: 60, 1: 100, 2: 80,3: 75, 4: 95},
                    'Name': {0: 'Olivia', 1: 'John', 2: 'Laura',3: 'Ben',4: 'Kevin'},
                    'Obtained Marks': {0: 56, 1: 75, 2: 82, 3: 64, 4: 67}})
dataframe1 = dataframe.sample()
print(dataframe1)

Output1:

   Attendance Name  Obtained Marks
3          75  Ben              64

Output2:

   Attendance   Name  Obtained Marks
4          95  Kevin              67

Outpt1 and output2 show the execution of the same program twice. Every time this function generates a random sample of rows from the given DataFrame.

Example Codes: DataFrame.sample() to Extract the Columns

To generate columns in a sample we will simply change our axis to 1.

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(n=1, axis=1)
print(dataframe1)

Output:

     Name
0  Olivia
1    John
2   Laura
3     Ben
4   Kevin

The function has generated a sample of a single column as an output. The number of columns was set by the parameter n=1.

Example Codes: DataFrame.sample() to Generate a Fraction of Data

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(frac=0.5)
print(dataframe1)

Output:

   Attendance   Name  Obtained Marks
3          75    Ben              64
4          95  Kevin              67
1         100   John              75

The returned sample is 50% of the original data.

Example Codes: DataFrame.sample() to Oversample the DataFrame

If frac>1, then the parameter replace should be True to allow the same row could be sampled more than once; otherwise, it will raise a ValueError.

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(frac=1.5, replace=True)
print(dataframe1)

Output:

   Attendance   Name  Obtained Marks
3          75     Ben              64
0          60  Olivia              56
1         100    John              75
2          80   Laura              82
1         100    John              75
2          80   Laura              82
0          60  Olivia              56
4          95   Kevin              67

If replace is set to be False meanwhile frac is larger than 1, than it raises a ValueError.

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(frac=1.5, replace=False)
print(dataframe1)

Output:

Traceback (most recent call last):
  File "..\test.py", line 6, in <module>
    dataframe1 = dataframe.sample(frac=1.5, replace=False)
  File "..\lib\site-packages\pandas\core\generic.py", line 5044, in sample
    raise ValueError(
ValueError: Replace has to be set to `True` when upsampling the population `frac` > 1.

Example Codes: DataFrame.sample() With weights

import pandas as pd

dataframe = pd.DataFrame(
    {
        "Attendance": {0: 60, 1: 100, 2: 80, 3: 75, 4: 95},
        "Name": {0: "Olivia", 1: "John", 2: "Laura", 3: "Ben", 4: "Kevin"},
        "Obtained Marks": {0: 56, 1: 75, 2: 82, 3: 64, 4: 67},
    }
)
dataframe1 = dataframe.sample(n=2, weights="Attendance")
print(dataframe1)

Output:

   Attendance   Name  Obtained Marks
1         100   John              75
4          95  Kevin              67

Here, the rows with greater values in the Attendance column are selected in the returned sample.

Enjoying our tutorials? Subscribe to DelftStack on YouTube to support us in creating more high-quality video guides. Subscribe

Related Article - Pandas DataFrame