Melt and Merge on Substring – Python & Pandas

Home / Uncategorized / Melt and Merge on Substring – Python & Pandas

Question:
I have data which has data like
id name model_# ms bp1 cd1 sf1 sa1 rq1 bp2 cd2 sf2 sa2 rq2 …
1 John 23984 1 23 234 124 25 252 252 62 194 234 234 …
2 John 23984 2 234 234 242 62 262 622 262 622 26 262 …
for hundreds of models with up to 10 ms and variables counting up to 21.

I have usually used pd.melt for doing my analysis where i look at bp1:bp21 or whatever. I currently have a need to create a melt where I look at bp1 values along with rq 1 values.

I am looking to effectively create something like this:
id model_# ms variable_x value_x variable_y value_y
0 113 77515 1 bp1 23 rq1 252
1 113 77515 1 bp2 252 rq2 262
2 113 77515 1 bp3 26 rq3 311
Right now the best I have been able to do is:id model_# ms variable_x value_x variable_y value_y
0 113 77515 1 bp1 23 rq1 252
1 113 77515 1 bp1 23 rq2 262
2 113 77515 1 bp1 23 rq3 311
3 113 77515 1 bp1 23 rq4 246
via:
df = pd.melt(dat, id_vars=[‘id’, ‘mod_req’, ‘ms’], value_vars=bp)
df1 = pd.melt(dat, id_vars=[‘id’, ‘mod_req’, ‘ms’], value_vars=rq)
df2 = pd.merge(df,df1, on=[‘id’, ‘mod_req’, ‘ms’])
Is there an easy way to merge on substring such that bp1 will connect with rq1 and so forth? This would mean taking a melted dataframe which only looks at bp1:bp21 and a other melted dataframe rq1:rq21 and merging based on the substring values( bp1 rq1, not bp1 rq2)


Answer:
Solution

Set the index…
Use a clever column groupby…
Another clever function to apply…
d1 = df.set_index([‘id’, ‘name’, ‘model_#’, ‘ms’])

def melt_(df):id_vars = df.index.names
return df.reset_index().melt(id_vars=id_vars).set_index(id_vars)

d2 = d1.groupby(d1.columns.str.extract(‘(D+)’, expand=False), axis=1).apply(melt_)

d2.columns = d2.columns.swaplevel(0, 1).map(‘_’.join)
d2.reset_index()

id name model_# ms variable_bp value_bp variable_cd value_cd variable_rq value_rq variable_sa value_sa variable_sf value_sf
0 1 John 23984 1 bp1 23 cd1 234 rq1 252 sa1 25 sf1 124
1 2 John 23984 2 bp1 234 cd1 234 rq1 262 sa1 62 sf1 242
2 1 John 23984 1 bp2 252 cd2 62 rq2 234 sa2 234 sf2 194
3 2 John 23984 2 bp2 622 cd2 262 rq2 262 sa2 26 sf2 622

Old Answer

This is far from pretty and I’m not even sure this is what you want.
d1 = df.set_index([‘id’, ‘name’, ‘model_#’, ‘ms’])

cidx = pd.MultiIndex.from_tuples(
d1.columns.to_series().str.extract(‘(D+)(d+)’, expand=False).values.tolist(),
names=[None, ‘variable’]
)

d1.columns = cidx

d2 = d1.sort_index(axis=1).stack()

variables = pd.DataFrame(
(d2.columns + d2.index.get_level_values(‘variable’)[:, None]).tolist(),
d2.index, d2.columns
)

d3 = pd.concat(
[variables, d2], axis=1, keys=[‘variable’, ‘value’]
).reset_index(‘variable’, drop=True).sort_index(axis=1, level=1, sort_remaining=False)

d3.columns = d3.columns.map(‘_’.join)

d3.reset_index()

id name model_# ms variable_bp value_bp variable_cd value_cd variable_rq value_rq variable_sa value_sa variable_sf value_sf
0 1 John 23984 1 bp1 23 cd1 234 rq1 252 sa1 25 sf1 124
1 1 John 23984 1 bp2 252 cd2 62 rq2 234 sa2 234 sf2 194
2 2 John 23984 2 bp1 234 cd1 234 rq1 262 sa1 62 sf1 242
3 2 John 23984 2 bp2 622 cd2 262 rq2 262 sa2 26 sf2 622
Read more

Leave a Reply

Your email address will not be published. Required fields are marked *