我是一名护士,懂得Python,但不是专家,只是用它来处理DNA序列
我们有用人类语言书写的医院记录,我需要将这些数据插入到数据库或CSV文件中,但这些记录超过5000行,这可能非常困难。所有数据都以一致的格式书写,让我给你展示一个例子
11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later
我应该得到以下数据
Sex: MaleSymptoms: Nausea VomitingDeath: TrueDeath Time: 11/11/2010 - 01:00pm
另一个例子
11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room
我得到
Sex: FemaleSymptoms: Heart burn Vomiting of bloodDeath: TrueDeath Time: 11/11/2010 - 10:00am
顺序并不一致,当我说在…….时,”在”是一个关键词,后面的所有文本都是地点,直到我找到另一个关键词
在开头”He”或”She”决定性别,”got”后面的任何内容都是一组症状,我应该根据分隔符(可以是逗号、连字符或其他,但对于同一行是一致的)来分割
“died”后面的”……小时后”也应该得到小时数,有时病人仍然活着并出院……等等
也就是说,我们有很多约定俗成的规则,我认为如果我能用关键词和模式来标记文本,我就能完成这项工作。所以,如果你知道有用的函数/模块/教程/工具,最好是Python的(如果不是Python,那么一个GUI工具会很好)
一些少量信息:
有很多规则来表达各种医学数据,但这里是一些例子- 以相同的日期/时间格式开头,后跟一个空格,然后是一个冒号,再跟一个空格,然后是He/She,再跟一个空格,然后是用and分隔的规则- 规则: * got <symptoms>,<symptoms>,.... * investigations were done <investigation>,<investigation>,<investigation>,...... * received <drug or procedure>,<drug or procedure>,..... * discharged <digit> (hour|hours) later * kept under observation * died <digit> (hour|hours) later * died <digit> (hour|hours) later in <place>其他规则也存在,但它们遵循相同的思路
回答:
这使用了dateutil来解析日期(例如’11/11/2010 – 09:00am’),以及parsedatetime来解析相对时间(例如’4 hours later’):
import dateutil.parser as dparserimport parsedatetime.parsedatetime as pdtimport parsedatetime.parsedatetime_consts as pdcimport timeimport datetimeimport reimport pprintpdt_parser = pdt.Calendar(pdc.Constants()) record_time_pat=re.compile(r'^(.+)\s+:')sex_pat=re.compile(r'\b(he|she)\b',re.IGNORECASE)death_time_pat=re.compile(r'died\s+(.+hours later).*$',re.IGNORECASE)symptom_pat=re.compile(r'[,-]')def parse_record(astr): match=record_time_pat.match(astr) if match: record_time=dparser.parse(match.group(1)) astr,_=record_time_pat.subn('',astr,1) else: sys.exit('Can not find record time') match=sex_pat.search(astr) if match: sex=match.group(1) sex='Female' if sex.lower().startswith('s') else 'Male' astr,_=sex_pat.subn('',astr,1) else: sys.exit('Can not find sex') match=death_time_pat.search(astr) if match: death_time,date_type=pdt_parser.parse(match.group(1),record_time) if date_type==2: death_time=datetime.datetime.fromtimestamp( time.mktime(death_time)) astr,_=death_time_pat.subn('',astr,1) is_dead=True else: death_time=None is_dead=False astr=astr.replace('and','') symptoms=[s.strip() for s in symptom_pat.split(astr)] return {'Record Time': record_time, 'Sex': sex, 'Death Time':death_time, 'Symptoms': symptoms, 'Death':is_dead}if __name__=='__main__': tests=[('11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later', {'Sex':'Male', 'Symptoms':['got nausea', 'vomiting'], 'Death':True, 'Death Time':datetime.datetime(2010, 11, 11, 13, 0), 'Record Time':datetime.datetime(2010, 11, 11, 9, 0)}), ('11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room', {'Sex':'Female', 'Symptoms':['got heart burn', 'vomiting of blood'], 'Death':True, 'Death Time':datetime.datetime(2010, 11, 11, 10, 0), 'Record Time':datetime.datetime(2010, 11, 11, 9, 0)}) ] for record,answer in tests: result=parse_record(record) pprint.pprint(result) assert result==answer print
产生的结果是:
{'Death': True, 'Death Time': datetime.datetime(2010, 11, 11, 13, 0), 'Record Time': datetime.datetime(2010, 11, 11, 9, 0), 'Sex': 'Male', 'Symptoms': ['got nausea', 'vomiting']}{'Death': True, 'Death Time': datetime.datetime(2010, 11, 11, 10, 0), 'Record Time': datetime.datetime(2010, 11, 11, 9, 0), 'Sex': 'Female', 'Symptoms': ['got heart burn', 'vomiting of blood']}
注意:解析日期时要小心。’8/9/2010’是指8月9日,还是9月8日?所有记录员是否使用相同的约定?如果你选择使用dateutil(我真的认为如果日期字符串不是严格结构化的,这是最好的选择),请务必阅读dateutil文档中的“格式优先级”部分,以便(希望)正确解析’8/9/2010’。如果你不能保证所有记录员在指定日期时使用相同的约定,那么这个脚本的结果将需要手动检查。无论如何,这可能是明智的做法。