How to Reshape Pandas DataFrame From Wide to Long
- Wide Data Format vs. Long Data Format in Pandas
- Uses of the Wide Data Format
- Uses of the Long Data Format
-
Use
pd.melt()
to Reshape Pandas DataFrame From Wide to Long Format -
Use
pd.unstack()
to Reshape Pandas DataFrame From Wide to Long Format -
Use
pd.wide_to_long()
to Reshape Pandas DataFrame From Wide to Long Format
Reshaping the Pandas dataframe is one of the most used data-wrangling tasks in data analysis. It is also addressed as transporting, unpivoting/pivoting a table from wide to long.
In this tutorial, we will learn the difference between wide and long data formats, which will lead to their uses, followed by different code examples demonstrating how to reshape the Pandas dataframe from wide to long.
Wide Data Format vs. Long Data Format in Pandas
We can have a dataset in two formats - either it would be wide or long. The primary difference between wide and long data formats is given below.
- Wide Data Format - The values in the first column do not repeat.
- Long Data Format - The values in the first column do repeat.
Let’s take the example data frames containing lab
and theory
exam marks for groups A
, B
, C
, and D
to understand both formats clearly.
As we can see, both data frames have the same information but in different formats.
When to use which data format to represent the dataset? It depends on what we want to do with the data.
Uses of the Wide Data Format
The wide data format is used to record real-world data because it is easy to understand for our brains. We also use this data format if we are analyzing data.
Let’s take the previous example of having marks for lab
and theory
exams.
If we want to calculate the average of lab
and theory
exams per group, it would be easier to keep data in the wide format because it would be easy to read lab
and theory
values for each group in the same record (row).
See the following screenshot.
Uses of the Long Data Format
We primarily use long data format when we visualize multiple variables in one plot using some statistical tool, for instance, R programming language.
We must have to convert wide to long data format to let the software create the plot, for instance, plot multiple columns, create heatmap, etc. Sometimes, we also need to reshape datasets for data wrangling using Python.
Use pd.melt()
to Reshape Pandas DataFrame From Wide to Long Format
Example Code:
import pandas as pd
df = pd.DataFrame(
{
"Groups": ["A", "B", "C", "D"],
"lab": [25, 21, 14, 22],
"theory": [60, 55, 49, 69],
}
)
print("Wide Data Frame:\n")
print(df)
df = pd.melt(df, id_vars="Groups", value_vars=["lab", "theory"])
print("\n\nLong Data Frame:\n")
print(df)
OUTPUT:
Wide Data Frame:
Groups lab theory
0 A 25 60
1 B 21 55
2 C 14 49
3 D 22 69
Long Data Frame:
Groups variable value
0 A lab 25
1 B lab 21
2 C lab 14
3 D lab 22
4 A theory 60
5 B theory 55
6 C theory 49
7 D theory 69
Here, we have a data frame with three columns, Groups
, lab
, and theory
in a wide data format which we convert to a long data format using the pd.melt()
function.
pd.melt()
is used to unpivot a dataframe from wide to long data format (optionally, leaving the identifiers set). It transforms a dataframe into a format where one or multiple columns/fields are identifier variables (id_vars
).
Excluding the id_vars
, all other columns are considered measured variables (value_vars
). These are unpivoted to the row axis, leaving two non-identifier columns (variable
and value
).
Use pd.unstack()
to Reshape Pandas DataFrame From Wide to Long Format
Example Code:
import pandas as pd
df = pd.DataFrame(
{
"Groups": ["A", "B", "C", "D"],
"lab": [25, 21, 14, 22],
"theory": [60, 55, 49, 69],
}
)
print("Wide Data Frame:\n")
print(df)
df = df.unstack()
print("\n\nLong Data Frame:\n")
print(df)
OUTPUT:
Wide Data Frame:
Groups lab theory
0 A 25 60
1 B 21 55
2 C 14 49
3 D 22 69
Long Data Frame:
Groups 0 A
1 B
2 C
3 D
lab 0 25
1 21
2 14
3 22
theory 0 60
1 55
2 49
3 69
dtype: object
This example uses the same data frame having three columns, Groups
, lab
, and theory,
but here, we are using DataFrame.unstack
to reshape the data frame from wide to long data format.
It pivots the level of (necessarily hierarchical) index labels and returns a dataframe containing a new level of field/column labels whose inner-most level has the pivoted index labels.
Remember, we will get a series as an output if an index is not the MultiIndex
. Also, if flexible and fast solution is required for a particular problem then, prefer pd.melt()
instead of DataFrame.unstack()
.
Use pd.wide_to_long()
to Reshape Pandas DataFrame From Wide to Long Format
Example Code:
import pandas as pd
df = pd.DataFrame(
{
"Groups": ["A", "B", "C", "D"],
"lab1": [25, 21, 14, 22],
"lab2": [25, 21, 14, 22],
"theory1": [60, 55, 49, 69],
"theory2": [60, 55, 49, 69],
}
)
print("Wide Data Frame:\n")
print(df)
df = pd.wide_to_long(df, stubnames=["lab", "theory"], i="Groups", j="Exams")
print("\n\nLong Data Frame:\n")
print(df)
OUTPUT:
Wide Data Frame:
Groups lab1 lab2 theory1 theory2
0 A 25 25 60 60
1 B 21 21 55 55
2 C 14 14 49 49
3 D 22 22 69 69
Long Data Frame:
Groups Exams lab theory
A 1 25 60
B 1 21 55
C 1 14 49
D 1 22 69
A 2 25 60
B 2 21 55
C 2 14 49
D 2 22 69
Here, we have a dataframe with lab1
, lab2
, theory1
, and theory2
marks for four groups: A
, B
, C
, and D
. Before understanding the pd.wide_to_long()
function, let’s look at the following visual to understand how the wide data format is reshaped into the long data format.
The pd.wide_to_long()
functions in a very particular manner, it actually uses pd.melt()
function under the hood. It accepts four mandatory parameters we will understand below, but the essential thing is how the column names are formed and displayed.
See the following column format for wide_to_long()
function.
We are passing four parameters to the pd.wide_to_long()
function to reshape the data frame from wide to long data format. The four parameters that we used are listed below:
df
- This is the dataframe we want to reshape.stubnames
- We can also say group names (prefixes) that need to be grouped. In our case, these arelab
andtheory
.i
- It is the identifier variable(s) that is not supposed to be stacked.j
- The column’s name that contains the suffixes, or we can say the labels of columns.
The optional parameters are sep
(separator) and suffix
. You can read more about pd.melt()
, pd.wide_to_long()
, and DataFrame.unstack()
here.