JSON object inside Pandas DataFrame

Home / Uncategorized / JSON object inside Pandas DataFrame

I have a JSON object inside a pandas dataframe column, which I want to pull apart and put into other columns. In the dataframe, the JSON object looks like a string containing an array of dictionaries. The array can be of variable length, including zero, or the column can even be null. I’ve written some code, shown below, which does what I want. The column names are built from two components, the first being the keys in the dictionaries, and the second being a substring from a key value in the dictionary.

This code works okay, but it is very slow when running on a big dataframe. Can anyone offer a faster (and probably simpler) way to do this? Also, feel free to pick holes in what I have done if you see something which is not sensible/efficient/pythonic. I’m still a relative beginner. Thanks heaps.
# Import libraries
import pandas as pd
from IPython.display import display # Used to display df’s nicely in jupyter notebook.
import json

# Set some display options

# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({‘ColA’: {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},
‘ColB’: {0: ‘[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]’,
1: ‘[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]’,
2: ‘[{"key":"keyValue=4","valA":"48","valC":"58"}]’,
3: ‘[]’,
4: None}})

# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()

# Step through all rows in the dataframe
for i in range(df.shape[0]):# Check whether record is null, or doesn’t contain any real data
if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
# The last bit of this string (after the last =) will be used as a key for the column labels
x[‘key’] = x[‘key’].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index(‘key’)
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = [‘{1}_{0}’.format(*c) for c in y.columns]
# Give the single record the same index number as the parent dataframe (for the merge to work)
y.index = [df.index[i]]
# Append this dataframe on sequentially for each row as we go through the loop
dfTemp = dfTemp.append(y)

# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = ‘left’, left_index = True, right_index = True)

print("Processed df:")


Read more

Leave a Reply

Your email address will not be published. Required fields are marked *