How to Convert Pandas DataFrame to Spark DataFrame
-
Use the
createDataFrame()
Function to Convert Pandas DataFrame to Spark DataFrame -
Use the
createDataFrame()
Withschema
Function to Convert Pandas DataFrame to Spark DataFrame -
Use the
createDataFrame()
Function Withapache arrow
Enabled to Convert Pandas DataFrame to Spark DataFrame
The DataFrame is two dimensional and mutable data structure. The data in DataFrame is stored in labeled axes called rows and columns.
Both Pandas and Spark have DataFrames. This tutorial will discuss different methods to convert Pandas dataframe to Spark dataframe.
Use the createDataFrame()
Function to Convert Pandas DataFrame to Spark DataFrame
The createDataFrame()
function is used to create a Spark DataFrame from an RDD or a pandas.DataFrame
. The createDataFrame()
takes the data and scheme as arguments.
We will discuss the schema more shortly.
Syntax of createDataFrame()
:
createDataFrame(data, schema=None)
Parameters:
data
= The dataframe to be passedschema
=str
or list, optional
Returns: DataFrame
Approach:
- Import the
pandas
library and create a Pandas Dataframe using theDataFrame()
method. - Create a
spark
session by importing theSparkSession
from thepyspark
library. - Pass the Pandas dataframe to the
createDataFrame()
method of theSparkSession
object. - Print the DataFrame.
The following code uses the createDataFrame()
function to convert Pandas dataframe to Spark dataframe.
# import the pandas
import pandas as pd
# from pyspark library import sql
from pyspark import sql
# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
{
"Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
"Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
"price$": [199, 299, 99, 250, 399],
}
)
# Converting the pandas dataframe in to spark dataframe
spark_DataFrame = spark_session.createDataFrame(data)
# printing the dataframe
spark_DataFrame.show()
Output:
+----------+---------+------+
| Course| Mentor|price$|
+----------+---------+------+
| Python| Robert| 199|
| Spark|Elizibeth| 299|
| Java| Nolan| 99|
|JavaScript| Chris| 250|
| C#| johnson| 399|
+----------+---------+------+
Use the createDataFrame()
With schema
Function to Convert Pandas DataFrame to Spark DataFrame
We discussed the createDataFrame()
method in the previous example. Now we will see how to change the schema while converting the DataFrame.
This example will use the schema to change the column names, Course
to Technology
, Mentor
to developer
and price
to Salary
.
Schema:
The schema defines the field names and their data types. In Spark, the schema is the structure of the DataFrame, the schema of DataFrame can be defined using the StructType
class, which is a collection of StructField
.
The StructField
takes a field or column’s name, data type and nullable. The nullable parameter defines if that field can be null or not.
Approach:
- Import the
pandas
library and create a Pandas Dataframe using theDataFrame()
method. - Create a
Spark
session by importing theSparkSession
from thepyspark
library. - Create schema by passing the collection of
StructField
to theStructType
class; theStructField
object is created by passing the name, data type, and nullable of a field. - Pass the Pandas dataframe and the schema to the
createDataFrame()
method of theSparkSession
object. - Print the DataFrame.
The following code uses the createDataFrame()
and schema
to convert Pandas dataframe to Spark dataframe.
# import the pandas
import pandas as pd
# from pyspark library import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
{
"Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
"Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
"price$": [199, 299, 99, 250, 399],
}
)
# Creating/Changing schema
dfSchema = StructType(
[
StructField("Technology", StringType(), True),
StructField("developer", StringType(), True),
StructField("Salary", IntegerType(), True),
]
)
# Converting the pandas dataframe in to spark dataframe
spark_DataFrame = spark_session.createDataFrame(data, schema=dfSchema)
# printing the dataframe
spark_DataFrame.show()
Output:
+----------+---------+------+
|Technology|developer|Salary|
+----------+---------+------+
| Python| Robert| 199|
| Spark|Elizibeth| 299|
| Java| Nolan| 99|
|JavaScript| Chris| 250|
| C#| johnson| 399|
+----------+---------+------+
Use the createDataFrame()
Function With apache arrow
Enabled to Convert Pandas DataFrame to Spark DataFrame
The Apache Arrow is a language-independent columnar memory format for flat and hierarchical data or any structured data format. Apache Arrow improves the efficiency of data analytics by creating a standard columnar memory format.
The apache arrow is disabled by default; we can enable it explicitly using the following code.
SparkSession.conf.set("spark.sql.execution.arrow.enabled", "true")
Approach:
- Import the
pandas
library and create a Pandas Dataframe using theDataFrame()
method. - Create a
spark
session by importing theSparkSession
from thepyspark
library. - Enable the apache arrow using the
conf
property. - Pass the Pandas dataframe to the
createDataFrame()
method of theSparkSession
object. - Print the DataFrame.
The following code uses the createDataFrame()
function by enabling the apache arrow to convert Pandas dataframe to Spark dataframe.
# import the pandas
import pandas as pd
# from pyspark library import sql
from pyspark import sql
# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
{
"Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
"Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
"price$": [199, 299, 99, 250, 399],
}
)
spark_session.conf.set("spark.sql.execution.arrow.enabled", "true")
# Converting the pandas dataframe in to spark dataframe
sprak_arrow = spark_session.createDataFrame(data)
# printing the dataframe
sprak_arrow.show()
Output:
+----------+---------+------+
| Course| Mentor|price$|
+----------+---------+------+
| Python| Robert| 199|
| Spark|Elizibeth| 299|
| Java| Nolan| 99|
|JavaScript| Chris| 250|
| C#| johnson| 399|
+----------+---------+------+