我有一个这样的文件格式。
# Jon Doe# 27212000-C# Calorina, 06/03 1993# South Calorina Jaka Km 1# Num 009.006# Calorina. 11710, Tp.108437347343# joe.st'a gmail.com# 20-09-2016 Akn# 36412506/E.15262# Jakarta, 13/10/1994# II, Let.jend, Soeprapto Gang Siaga# V RT 005/03# Jakarta, 10640. Tp.# 22-09-2016/T Info# Jenny Doe# 5641141 2/E.15263# Zimbabwe, 05/06/1993# Mujair Street Iv No.185 # Mujair, 15116. Tp.04545454# [email protected]# 22-09-2016/T Info# Igor Kart# 36412777/E,15264# Kongo, 30/10/1994# Kp. Pintu Air Kel. Pabuaran Kec.Boj# onggede Kab.Bogor RT 04/09# Bogor, 16320. Tp,107262626# [email protected]# 22-09-2016T Info
如何从输出中获得最佳结构化数据?我希望得到这样的CSV结果。Good_format.csv
Name Code Bday Address Phone Email InfoJon Doe 27212000-C Calorina, 06/03 1993 South Calorina Jaka Km 1Num 009.006 Calorina. 11710 108437347343 joe.st'a gmail.com 20-09-2016 AknJenny Doe 5641141 2/E.15263 Zimbabwe, 05/06/1993 Mujair Street Iv No.185 Mujair, 15116. 04545454 [email protected] 22-09-2016/T InfoIgor Kart 36412777/E,15264 Kongo, 30/10/1993 Kp. Pintu Air Kel. Pabuaran Kec.Bojonggede Kab.Bogor RT 04/09Bogor, 16320. 107262626 [email protected] 22-09-2016T Info
并将格式错误的记录保存到log.txt中。我需要这些错误格式的记录,以便我可以再次修复它们。
# 36412506/E.15262# Jakarta, 13/10/1994# II, Let.jend, # V RT 005/03# Jakarta, 10640. Tp.# 22-09-2016/T Info
回答:
import pandas as pdfrom tabulate import tabulatefilepath = "SO.txt"colList = ['Name', 'Code', 'Bday', 'Address', 'Phone', 'Email', 'Info']df_full = pd.DataFrame(columns = colList) with open(filepath) as fp: contents = fp.read() #print(contents) groups = [[line.split("#")[1].strip() for line in group.split("\n") if line != ""] for group in contents.split("\n\n")] #print(groups) for groupInd, group in enumerate(groups): df_temp = pd.DataFrame(columns = colList, index = [groupInd]) #If first line of each group contains at least a number, then the above code returns True if not(any(chr.isdigit() for chr in group[0])): df_temp.Name = group[0] df_temp.Code = group[1] df_temp.Bday = group[2] ##### #Concatenate a list of address and phone lines into one string temp = ' '.join(group[3:-2]).split('Tp') df_temp.Address = temp[0] #Extract digit string means remove commas, dots, ... df_temp.Phone = ''.join(filter(lambda i: i.isdigit(), temp[1])) ##### df_temp.Email = group[-2] df_temp.Info = group[-1] df_full = pd.concat([df_full, df_temp], axis=0) print(tabulate(df_full, headers='keys', tablefmt='psql'))
输出结果:
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+| | Name | Code | Bday | Address | Phone | Email | Info ||----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------|| 0 | Jon Doe | 27212000-C | Calorina, 06/03 1993 | South Calorina Jaka Km 1 Num 009.006 Calorina. 11710, | 108437347343 | joe.st'a gmail.com | 20-09-2016 Akn || 2 | Jenny Doe | 5641141 2/E.15263 | Zimbabwe, 05/06/1993 | Mujair Street Iv No.185 Mujair, 15116. | 04545454 | [email protected] | 22-09-2016/T Info || 3 | Igor Kart | 36412777/E,15264 | Kongo, 30/10/1994 | Kp. Pintu Air Kel. Pabuaran Kec.Boj onggede Kab.Bogor RT 04/09 Bogor, 16320. | 107262626 | [email protected] | 22-09-2016T Info |+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+