你可以将其简化为类似下面的代码。首先计算所需的统计信息并将其赋值给struct(以避免重复使用for循环)。然后,使用stack
方法将列转置为行,并从struct中提取出统计信息列。以下是一个示例:
from pyspark.sql.functions import count, struct, round
data = [
(1, "John", 25, None, "Male"),
(2, "Jane", None, 5000, "Female"),
(3, "Bob", 30, 6000, None),
(4, None, 35, 7000, "Male"),
(5, "Alice", 40, 8000, "Female"),
]
# 定义DataFrame的列名
columns = ["ID", "Name", "Age", "Salary", "Gender"]
# 创建一个包含空值的DataFrame
df = spark.createDataFrame(data, columns)
# 计算每列的空值数量和百分比,并存储在struct中
agg_result = df.agg(
*[struct(
(count('*') - count(c)).alias('null_count'),
round(((count('*') - count(c)) / count('*')) * 100, 2).alias('null_percentage')
).alias(f'{c}') for c in df.columns]
)
# 使用stack方法将列转置为行,并提取统计信息列
final_df = agg_result.selectExpr(
f"stack({len(df.columns)}, {' , '.join([f'\'{c}\', {c}' for c in df.columns])}) as (column_name, stats)"
).select(
"column_name",
"stats.null_count",
"stats.null_percentage"
)
# 显示最终结果
final_df.show()
输出:
+-----------+----------+---------------+
|column_name|null_count|null_percentage|
+-----------+----------+---------------+
| ID| 0| 0.0|
| Name| 1| 0.2|
| Age| 1| 0.2|
| Salary| 1| 0.2|
| Gender| 1| 0.2|
+-----------+----------+---------------+