在Pandas中合并数据框，但避免重复列并保持顺序。

Question

所有数据帧（dfs）都包含一个名为“id”的键列。
即使使用后缀选项，pd.merge也不是一个可行的选择。
每个数据帧中都有超过4万个列，因此先绑定列再删除多余列（如通过suffix_x选项）不是一个可选方案。每个数据帧通过“id”列确定的共有行数恰好为5万条。

下面是一个包含两个共有列的最小示例：

df1 = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col3': [786, 787, 777],
})

df2 = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col4': [11, 44, 77],
})

df3 = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col5': [1786, 1787, 1777],
})

最终合并后的结果应为：

finaldf = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col3': [786, 787, 777],
  'col4': [11, 44, 77],
  'col5': [1786, 1787, 1777],
})

翻译成中文：

所有数据帧均有名为“id”的关键列。
即使配合后缀选项，pd.merge也无法作为一个有效的解决方案。
由于每个数据帧中存在超过4万个列，所以在绑定列之后再删除多余的列（例如通过suffix_x选项处理）不是一种可选项。各数据帧中均通过“id”列标识出完全相同的5万行记录。

以下是包含两个共有列的最简化示例：

df1 = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col3': [786, 787, 777],
})

df2 = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col4': [11, 44, 77],
})

df3 = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col5': [1786, 1787, 1777],
})

最后合并得到的数据帧应该是：

finaldf = pd.DataFrame({
  'id': ['a', 'b', 'c'],
  'col1': [123, 121, 111],
  'col2': [456, 454, 444],
  'col3': [786, 787, 777],
  'col4': [11, 44, 77],
  'col5': [1786, 1787, 1777],
})

Aaron Bertrand · Answer

尝试按照以下方式进行操作：

import pandas as pd

df1 = pd.DataFrame({
    'id': ['a', 'b', 'c'],
    'col1': [123, 121, 111],
    'col2': [456, 454, 444],
    'col3': [786, 787, 777],
})

df2 = pd.DataFrame({
    'id': ['a', 'b', 'c'],
    'col1': [123, 121, 111],
    'col2': [456, 454, 444],
    'col4': [11, 44, 77],
})

df3 = pd.DataFrame({
    'id': ['a', 'b', 'c'],
    'col1': [123, 121, 111],
    'col2': [456, 454, 444],
    'col5': [1786, 1787, 1777],
})

# 首先根据'id'列将df1与df2合并，然后将合并后的结果与df3根据'id'列再次合并
final_df = df1.set_index('id').combine_first(df2.set_index('id')).combine_first(df3.set_index('id')).reset_index()

print(final_df)

输出结果：

  id  col1  col2  col3  col4  col5
0  a   123   456   786    11  1786
1  b   121   454   787    44  1787
2  c   111   444   777    77  1777

将上述代码翻译为中文：

import pandas as pd

df1 = pd.DataFrame({
    'id': ['a', 'b', 'c'],
    'col1': [123, 121, 111],
    'col2': [456, 454, 444],
    'col3': [786, 787, 777],
})

df2 = pd.DataFrame({
    'id': ['a', 'b', 'c'],
    'col1': [123, 121, 111],
    'col2': [456, 454, 444],
    'col4': [11, 44, 77],
})

df3 = pd.DataFrame({
    'id': ['a', 'b', 'c'],
    'col1': [123, 121, 111],
    'col2': [456, 454, 444],
    'col5': [1786, 1787, 1777],
})

# 首先依据'id'列将df1与df2进行合并，接着将合并后得到的数据框与df3再次依据'id'列进行合并
final_df = df1.set_index('id').combine_first(df2.set_index('id')).combine_first(df3.set_index('id')).reset_index()

print(final_df)

运行结果：

  id  col1  col2  col3  col4  col5
0  a   123   456   786    11  1786
1  b   121   454   787    44  1787
2  c   111   444   777    77  1777

devoured elysium · Answer

如果内存有限制，并且数据帧已经对齐，你可以尝试先设置输出并使用 update方法进行更新：

from functools import reduce

dfs = [df1, df2, df3]
# 获取所有数据帧列的并集，不排序
cols = reduce(lambda a, b: a.union(b, sort=False),
              (x.columns for x in dfs))

# 创建一个空的输出数据帧，索引与首个数据帧相同，列包含所有列的并集
out = pd.DataFrame(index=dfs[0].index,
                   columns=cols)

# 遍历数据帧列表并依次更新输出数据帧
for x in dfs:
    out.update(x)

另一种实现最后一部操作的变体：

# 使用首个数据帧的内容初始化输出数据帧，并设置所有列的并集
out = pd.DataFrame(dfs[0],
                   columns=cols)

# 遍历从第二个数据帧开始的剩余数据帧，并依次更新输出数据帧
for x in dfs[1:]:
    out.update(x)

输出结果：

  id col1 col2 col3 col4  col5
0  a  123  456  786   11  1786
1  b  121  454  787   44  1787
2  c  111  444  777   77  1777