How to Mask in Pandas
Pandas is an advanced data analysis tool or a package extension in Python. Many companies and organizations require high-quality data analysis to use this tool on a large scale.
A data analyst must decide whether to use pandas based on the data type. It is highly recommended to use Pandas when we have data in a SQL table, a spreadsheet or heterogenous columns.
The data can be ordered or unordered, and time-series data is also supported. In this tutorial, let us understand how to mask data in pandas.
Masking is essentially a way to filter data based on one or more than one condition. The output of this masking is generally an object that is returned as true
or false
based on the condition.
Use the dates_data
to Create a Dummy Dataframe in Pandas
It can be understood as an advanced If-Else
scheme for a data frame. However, we will first create a dummy data frame using dates_data
, along with a few rows.
import pandas as pd
index = pd.date_range("2013-1-1", periods=100, freq="30Min")
dates_data = pd.DataFrame(data=list(range(100)), columns=["value"], index=index)
dates_data["value2"] = "Alpha"
dates_data["value2"].loc[0:10] = "Beta"
The code block creates a data frame with rows with dates and two columns named value
and value2
. To view the entries in the data, we use the following code:
print(dates_data)
Output:
value value2
2013-01-01 00:00:00 0 Beta
2013-01-01 00:30:00 1 Beta
2013-01-01 01:00:00 2 Beta
2013-01-01 01:30:00 3 Beta
2013-01-01 02:00:00 4 Beta
... ... ...
2013-01-02 23:30:00 95 Alpha
2013-01-03 00:00:00 96 Alpha
2013-01-03 00:30:00 97 Alpha
2013-01-03 01:00:00 98 Alpha
2013-01-03 01:30:00 99 Alpha
As we can see, we have 100 different entries with time set up equally after intervals of 30 minutes each.
Two additional columns named value
and value2
are created where we have some values set as numbers and others as either Alpha
or Beta
.
Use Masking
to Filter Data in Pandas
Masking is an advanced concept in Pandas where the analyst tries to filter data based on a particular condition.
It is possible to filter this data based on one or more than one condition. We will try to explore each one of these cases in detail here.
Let us begin by filtering data such that we only wish to fetch entries from our data frame dates_data
.
mask = dates_data["value2"] == "Beta"
print(dates_data[mask])
Output:
value value2
2013-01-01 00:00:00 0 Beta
2013-01-01 00:30:00 1 Beta
2013-01-01 01:00:00 2 Beta
2013-01-01 01:30:00 3 Beta
2013-01-01 02:00:00 4 Beta
2013-01-01 02:30:00 5 Beta
2013-01-01 03:00:00 6 Beta
2013-01-01 03:30:00 7 Beta
2013-01-01 04:00:00 8 Beta
2013-01-01 04:30:00 9 Beta
We have entries related to only the Beta
values in the value2
column of the dates_data
data frame.
In this way, we can create a mask and then superimpose that mask on our data to filter data. This mask can also be understood as a stencil to filter out certain data.
We will filter data with a certain range of values from the value
column and only the Beta
value from the value2
column in the dates_data
data frame.
mask = (dates_data["value2"] == "Beta") & (dates_data["value"] > 3)
print(dates_data[mask])
Output:
value value2
2013-01-01 02:00:00 4 Beta
2013-01-01 02:30:00 5 Beta
2013-01-01 03:00:00 6 Beta
2013-01-01 03:30:00 7 Beta
2013-01-01 04:00:00 8 Beta
2013-01-01 04:30:00 9 Beta
As we can see in the code block above, we have successfully filtered data such that we have only values greater than 3 in the value
column and the value Beta
only in the value2
column.
Therefore, with the help of the Masking
technique in Pandas, we can efficiently filter data based on our requirement and based on one condition or more than.