How to Get Pandas DataFrame Column Headers as a List
Pandas is an open-source package for data analysis in Python. pandas.DataFrame is the primary Pandas data structure. It is a two-dimensional tabular data structure with labeled axes (rows and columns).
A widespread use case is to get a list of column headers from a DataFrame
object.
We will reuse the DataFrame
object, that we define below, in all other code examples of this tutorial.
>>> import pandas
>>> cities = {
... 'name': ['New York', 'Los Angeles', 'Chicago'],
... 'population': [8601186, 4057841, 2679044],
... 'state': ['NY', 'CA', 'IL'],
... }
>>> data_frame = pandas.DataFrame(cities)
One way to get a hold of DataFrame
column names is to iterate over a DataFrame
object itself. DataFrame
iterator returns column names in the order of definition.
>>> for column in data_frame:
... print(column)
...
name
population
state
When there is a necessity to convert an iterable into a list, you can call Python’s built-in list
function on it.
>>> list(data_frame)
['name', 'population', 'state']
However, the performance of this method is sluggish.
>>> from timeit import timeit
>>> timeit(lambda: list(data_frame))
7.818843764999997
We can also traverse deeper into a DataFrame
object to access its columns from a DataFrame.columns
property.
>>> list(data_frame.columns)
['name', 'population', 'state']
Otherwise, we can use the DataFrame.columns.tolist()
function to achieve the same thing.
>>> data_frame.columns.tolist()
['name', 'population', 'state']
The performance of both of these methods is not much better.
>>> timeit(lambda: list(data_frame.columns))
7.143133517000024
>>> timeit(lambda: data_frame.columns.tolist())
6.064925153999866
Things change a lot when traversing even further into DataFrame.columns.values
property. Similarly, as with DataFrame
object and DataFrame.columns
property, we can use it to get a sequence of DataFrame
column names.
>>> list(data_frame.columns.values)
['name', 'population', 'state']
The performance of this approach is 5 to 6 times better when compared to the previous methods.
>>> timeit(lambda: list(data_frame.columns.values))
1.301724927000123
Still, the best runtime can be achieved if we use the built-in DataFrame.columns.values.tolist()
method.
>>> data_frame.columns.values.tolist()
['name', 'population', 'state']
>>> timeit(lambda: data_frame.columns.values.tolist())
0.6860591469999235
As we can see, the performance of this approach is more than ten times better than if we had iterated directly over the DataFrame
object. Most engineers will be curious about the reasons behind such a discrepancy in performance.
The answer hides in the data type of DataFrame.columns.values
property. It’s a NumPy array. NumPy is a Python package for scientific computing, and maintainers optimize it highly for performance.
Pandas is built on top of NumPy and provides convenient high-level abstractions. Thus, performing direct operations on lower-level NumPy data structures will almost always be faster than performing similar operations on Pandas higher-level data structures.
Related Article - Pandas DataFrame
- How to Delete Pandas DataFrame Column
- How to Convert Pandas Column to Datetime
- How to Convert a Float to an Integer in Pandas DataFrame
- How to Sort Pandas DataFrame by One Column's Values
- How to Get the Aggregate of Pandas Group-By and Sum