How to Split a Pandas DataFrame Column by a Delimiter
The ability to easily parse the provided data and create new columns containing more specific information is valuable. One such scenario is when we split a Pandas dataframe column by a delimiter and make multiple columns from it.
For example, we have a column containing values separated by a delimiter like A, D, G
, and we want to split these values into multiple columns holding A
, D
, and G
separately.
This tutorial will teach us how to use the str.split()
method to split a Pandas dataframe column by a delimiter.
Split a Pandas DataFrame Column by a Delimiter
Before moving towards various examples demonstrating the use of the str.split()
method, it is important to understand its syntax and possible parameters.
Syntax:
Series.str.split(pat=None, n=-1, expand=False, regex=None)
Following is a brief explanation of the parameters.
Parameter | Explanation |
---|---|
pat |
It must be a string-type value or a regular expression that will be used as a separator to split the string. Whitespace will be used as a delimiter/separator if it is not specified. |
n |
It denotes the maximum number of splits in the output. The n=-1 , n=0 , and n=None means return all splits; by default, n is set to -1 . |
expand |
It takes a Boolean value, and if True , it results in a dataframe holding different values in different columns; otherwise, a series with a list of strings. By default, expand is set to False . |
regex (New in version 1.4.0) |
It accepts a Boolean value that tells if the specified pattern is a regex or not; by default, regex is set to None . |
True means the specified pattern is a regular expression, while False means a string literal. It treats pat as a string literal if set to None and pat length is 1 . On the other hand, if regex=None and pat length is not 1 , it treats pat as a regular expression. |
We will get a ValueError
if pat
is the compiled regular expression and regex
is set to False
. You may read more on that here.
Now, let’s jump into code examples to practice it.
Split the full_name
Column Into Multiple Columns (first_name
& last_name
)
Example Code:
import pandas as pd
df = pd.DataFrame(
{
"full_name": [
"Mehvish Ashiq",
"Hania Arif",
"Zobia Shakeel",
"Tahir Mukhtar",
"Mazhar Ali",
]
}
)
print("Before Splitting:\n")
print(df)
df[["first_name", "last_name"]] = df["full_name"].str.split(" ", expand=True)
print("\nAfter Splitting:\n")
print(df)
OUTPUT:
Before Splitting:
full_name
0 Mehvish Ashiq
1 Hania Arif
2 Zobia Shakeel
3 Tahir Mukhtar
4 Mazhar Ali
After Splitting:
full_name first_name last_name
0 Mehvish Ashiq Mehvish Ashiq
1 Hania Arif Hania Arif
2 Zobia Shakeel Zobia Shakeel
3 Tahir Mukhtar Tahir Mukhtar
4 Mazhar Ali Mazhar Ali
We can see that a dataframe in the above code fence contains a column (full_name
) with random first names and last names separated by a single space.
We split full_name
into two columns (first_name
and last_name
) containing the first name and last name separately. To do this, we used vectorized str.split()
method with expand=True
.
Split the full_name
Column Into a New Column Having a List of Split Values
Example Code:
import pandas as pd
df = pd.DataFrame(
{
"full_name": [
"Mehvish Ashiq",
"Hania Arif",
"Zobia Shakeel",
"Tahir Mukhtar",
"Mazhar Ali",
]
}
)
print("Before Splitting:\n")
print(df)
df["[first_name,last_name]"] = df["full_name"].str.split(" ", expand=False)
print("\nAfter Splitting:\n")
print(df)
OUTPUT:
Before Splitting:
full_name
0 Mehvish Ashiq
1 Hania Arif
2 Zobia Shakeel
3 Tahir Mukhtar
4 Mazhar Ali
After Splitting:
full_name [first_name,last_name]
0 Mehvish Ashiq [Mehvish, Ashiq]
1 Hania Arif [Hania, Arif]
2 Zobia Shakeel [Zobia, Shakeel]
3 Tahir Mukhtar [Tahir, Mukhtar]
4 Mazhar Ali [Mazhar, Ali]
This code fence is similar to the previous one except for one difference. Here, we set the expand
to False
to create a column containing a list of split values.
Split the full_name
Column by Specifying the Maximum Number of Splits
Example Code:
import pandas as pd
df = pd.DataFrame(
{
"full_name": [
"Mehvish Ashiq Hussain",
"Hania Arif Ali",
"Zobia Shakeel Amjad",
"Tahir Mukhtar Ahmed",
"Mazhar Ali Azhar",
]
}
)
print("Before Splitting:\n")
print(df)
print("\nAfter Splitting:\n")
print(df["full_name"].str.split(" ", n=1, expand=True))
OUTPUT:
Before Splitting:
full_name
0 Mehvish Ashiq Hussain
1 Hania Arif Ali
2 Zobia Shakeel Amjad
3 Tahir Mukhtar Ahmed
4 Mazhar Ali Azhar
After Splitting:
0 1
0 Mehvish Ashiq Hussain
1 Hania Arif Ali
2 Zobia Shakeel Amjad
3 Tahir Mukhtar Ahmed
4 Mazhar Ali Azhar
Here, we set n
to 1
to get a maximum of one split. In this example, we are focused on getting the first names only.