Create pandas DataFrame in a loop
pandas + namedtuple = ❤️
Pandas has many ways to read data — from CSVs, JSONs, databases etc. However, every so often we need to create a new DataFrame row by row in a loop. For example when iterating over responses of some web API. What is the best way to create such DataFrame. There are many ways that work, but I found one to be the cleanest and least error-prone — using namedtuple.
Have a look at this code:
First I am defining a namedtuple. It’s a form of a tuple that has unique names for each element. Essentially an adhoc structure or class without methods. Then, inside the loop, when iterating over API responses I create instances of this tuple using field names to avoid any order based errors. Finally, I feed the list of those instances to DataFrame constructor which parses it into a DataFrame using field names as column names. Simple!
If you want to you can take this trick to another level, by adding types to your namedtuple — here’s how:
Now your code is not only protected from the order in which you define fields but also checks their expected types allowing you to catch mistakes faster.
Hope this helps!
Update: As kind people of the internet pointed out, static type checking is currently not built into the python parser. However, if you use Pycharm you will get notifications about incompatible variables. You can also use pytype or mypy to check your files explicitly.