您是否考虑过使用csv模块中的DictReader?它会读取表头,然后将每一行数据读入一个字典中,其中键名来自表头的字段名称。假设有一个简单的CSV文件:
input.csv
---------
Col1,Col2,Col3
r1c1,r1c2,r1c3
r2c1,r2c2,r2c3
r3c1,r3c2,r3c3
您可以这样使用DictReader:
with open("input.csv", newline="") as f:
reader = csv.DictReader(f)
print(list(reader))
这段代码将会输出:
[
{"Col1": "r1c1", "Col2": "r1c2", "Col3": "r1c3"},
{"Col1": "r2c1", "Col2": "r2c2", "Col3": "r2c3"},
{"Col1": "r3c1", "Col2": "r3c2", "Col3": "r3c3"},
]
根据您跳过前两行以及关于CSV文件格式的注释,我猜测您的表头可能像这样,其中"TDCJ Number"后面有一个换行符:
Execution,Date of Birth,Date of Offence,Highest Education Level,Last Name,First Name,"TDCJ
Number",Age at Execution,Date Received,Execution Date,Race,County,Eye Color,Weight,Height,Native County,Native State,Last Statement
我建议手动编辑CSV文件,去掉TDCJ后面的换行符,或者在Python中预处理文件,如下所示:
with open("input_tx_dr.csv", newline="") as f:
text = f.read()
# 尝试用换行符或CRLF替换
text = text.replace("TDCJ\nNumber", "TDCJ Number")
text = text.replace("TDCJ\r\nNumber", "TDCJ Number")
with open("input_tx_dr_fixed.csv", "w") as f:
f.write(text)
现在有了正确的表头,可以让DictReader完成大部分工作:
from datetime import datetime
samples = []
with open("input_tx_dr_fixed.csv", newline="") as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
if i == 5:
break
for k in [
"Execution",
"Highest Education Level",
"TDCJ Number",
"Age at Execution",
"Weight",
]:
row[k] = int(row[k])
for k in [
"Date of Birth",
"Date of Offence",
"Date Received",
"Execution Date",
]:
row[k] = datetime.strptime(row[k], "%Y-%m-%d")
samples.append(row)