How to Import Multiple CSV Files Into Pandas and Concatenate Into One DataFrame
- What is Pandas
-
How to Read Single
.csv
File Using Pandas - Read Multiple CSV Files in Python
- Concatenate Multiple DataFrames in Python
This tutorial is about how to read multiple .csv
files and concatenate all DataFrames into one.
This tutorial will use Pandas to read the data files and create and combine the DataFrames.
What is Pandas
This package comes with a wide array of functions to read a variety of data files as well as perform data manipulation techniques.
To install the pandas
package on your machine, you must open the Command Prompt/Terminal and run pip install pandas
.
How to Read Single .csv
File Using Pandas
The pandas
package provides a function to read a .csv
file.
>>> import pandas as pd
>>> df = pd.read_csv(filepath_or_buffer)
Given the file path, the pandas
function read_csv()
will read the data file and return the object.
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
Read Multiple CSV Files in Python
There’s no explicit function to perform this task using only the pandas
module. However, we can devise a rational method for performing the following.
Firstly, we need to have the path of all the data files. It will be easy if all the files are situated in one particular folder.
Creating a list where all the files’ paths and names will be stored.
>>> import pandas as pd
>>> import glob
>>> import os
>>> # This is a raw string containing the path of files
>>> path = r'D:\csv files'
>>> all_files = glob.glob(os.path.join(path, '*.csv'))
>>> all_files
['D:\\csv files\\FILE_1.csv', 'D:\\csv files\\FILE_2.csv']
In the above code, a list is created containing the file path.
glob
Module
Use the glob
module to find files or pathnames matching a pattern. The glob
follows Standard Unix path expansion rules to match patterns.
There’s no need to install this module externally because it is already included with Python. However, if you do not have this package, type pip install glob2
, and you should be good to go.
To retrieve paths recursively from within directories/files and subdirectories/subfiles, we can utilize the glob
module’s functions glob.glob()
and glob.iglob()
.
Syntax:
glob.glob(pathname, *, recursive=False)
glob.iglob(pathname, *, recursive=False)
The function will return a list containing the paths of all the files.
For example, to retrieve all file names from a given path, use the asterisk symbol *
at the end of the path, passing it as a string to the glob.glob('')
function.
>>> for files in glob.glob(r'D:\csv files\*'):
print(files)
D:\csv files\FILE_1.csv
D:\csv files\FILE_2.csv
D:\csv files\textFile1.txt
D:\csv files\textFile2.txt
Moreover, specify the file extension after the asterisk symbol to perform a more focused search.
>>> for files in glob.glob(r'D:\csv files\*.csv'):
print(files)
D:\csv files\FILE_1.csv
D:\csv files\FILE_2.csv
What are Raw Strings
In Python, a raw string is formed by adding r
or R
to a literal string. The backslash (\
) is a literal character in Python raw string.
This is useful when we want a string with a backslash but don’t want it to be considered an escape character.
For instance:
To represent special characters such as tabs and newlines, we use the backslash (\
) to signify the start of an escape sequence.
>>> print("This\tis\nnormal\tstring")
This is
normal string
However, raw strings treat the backslash (\
) as a literal character. For example:
>>> print(r"This\tis\nnormal\tstring")
This\tis\nnormal\tstring
os
Module
Python’s os
module contains methods for dealing with the operating system. os
is included in the basic utility modules for Python.
This module offers a portable method of using functionality dependent on the operating system. Python’s os.path
module, a sub-module of the os
module, is used to manipulate common pathnames.
Python’s os.path.join()
function intelligently joins one or more path components. Except for the last path component, this approach concatenates different path components by placing exactly one directory separator ("/")
after each non-empty portion.
A directory separator ("/")
is added at the end of the final path component to be linked is empty.
All previously connected components are deleted if a path component represents an absolute path and joining moves on to the component representing the absolute path.
Syntax:
os.path.join(path, *path)
To merge different path components, use the os.path.join()
function.
import os
path = "Users"
os.path.join(path, "Desktop", "data.csv")
Output:
"Users\\Desktop\\data.csv"
Concatenate Multiple DataFrames in Python
Moving further, use the paths returned from the glob.glob()
function to pull data and create dataframes. Subsequently, we will also append the Pandas dataframe objects to the list.
Code:
dataframes = list()
for dfs in all_files:
data = pd.read_csv(dfs)
dataframes.append(data)
A list of dataframes is created.
>>> dataframes
[dataframe1, dataframe2]
Concatenating the dataframes.
Note: Before concatenating the dataframes, all the dataframe must have similar columns.
pd.concat(dataframes, ignore_index=True)
The pandas.concat()
method handles all the intensive concatenation operations together with a Pandas object axis, with set logic operations (union or intersection) of the indexes on the other axis as an optional extra.
Full code:
# importing the required modules
import pandas as pd
import os
import glob
# Path of the files
path = r"D:\csv files"
# joining the path and creating list of paths
all_files = glob.glob(os.path.join(path, "*.csv"))
dataframes = list()
# reading the data and appending the dataframe
for dfs in all_files:
data = pd.read_csv(dfs)
dataframes.append(data)
# Concatenating the dataframes
df = pd.concat(dataframes, ignore_index=True)