如果这里不是提问的合适地方,提前道歉。如果有更合适的Stack Exchange网站,请告诉我。
目前正在开发一个犯罪预测算法,该算法在城市上覆盖一个网格,并预测每个网格在未来30天内是否会成为热点(至少发生一次袭击犯罪)。
我目前使用的是纳什维尔市,覆盖了3446个网格。我有一个网格数据集,其中包含显示网格所需的所有数据,每个网格的地图坐标以及周围的邻近网格(下方的邻居、右侧的邻居等)。
在这种情况下,绿色表示正确预测,红色表示假阴性,紫色表示假阳性,这是机器学习算法的结果。
这里的热点是目标值(1或0)。周、月、年是过去一年犯罪事件的犯罪统计(过去一周、过去一个月和过去一年的犯罪)。我的问题是创建这些特征集需要大量的时间(脚本运行超过6小时)。
#Loop through each grid in the datasetfor grid_index, grid_row in grid.iterrows(): print("On grid number: ", grid_row['id']) near=0 #Loop through all of the crimes for crime_index, crime_row in crime.iterrows(): #Parse out the month, day, and year date = crime_row['Incident Occurred'] date_pars = date.split('/') month = int(date_pars[0]) day= int(date_pars[1]) year =int(date_pars[2].split(' ')[0]) if grid_row['top '] == crime_row['grid']: near +=1 if grid_row['bottom '] == crime_row['grid']: near +=1 if grid_row['left '] == crime_row['grid']: near +=1 if grid_row['right '] == crime_row['grid']: near +=1 if grid_row['topleft'] == crime_row['grid']: near +=1 if grid_row['topright'] == crime_row['grid']: near +=1 if grid_row['bottomright'] == crime_row['grid']: near +=1 if grid_row['bottomleft'] == crime_row['grid']: near +=1 if month == 12 and grid_row['id'] == crime_row['grid']: countMonth = countMonth+1 if day >= 25 and month == 12 and grid_row['id'] == crime_row['grid']: countWeek = countWeek + 1 if year == 2017 and grid_row['id'] == crime_row['grid']: countYear=countYear+1 #Update the output for the specific grid output = output.append({'Grid': grid_row['id'], 'Hotspot': 0, 'week': countWeek, 'month': countMonth, 'year': countYear,'near': near}, ignore_index=True) countMonth = 0 countYear = 0 countWeek = 0
目前这段代码会遍历每个网格(总共3446个),在每个网格内遍历每个犯罪事件(约18,000个),统计并将结果添加到pandas数据框中…3446*18000大约是6200万次计算来创建这个数据集。我觉得这不会花太长时间,但实际上花的时间远比理想情况长得多。
有什么方法可以有效地加速这个过程吗?我需要针对过去三年的每个月运行这个算法,所以要运行36次,每次运行超过5小时,对我的时间限制来说实在是太长了。
提前感谢任何见解。
编辑:为澄清,’grid_row’ 是网格CSV文件中的每条记录,我上面列出了列名(每个网格的位置和邻近网格),’crime_row’ 是过去一年内发生的每起犯罪事件:
回答:
你做事情的方式可以简化为
forall grid forall crimes if crime.cell == grid.cell do something
这种复杂度是 O(|grid| * |crimes|)
如果你有3k个犯罪事件和5k个网格,这将导致1500万次迭代
更好的方法是遍历犯罪事件,并将它们推送到相关联的网格中,将所有具有相同网格索引的犯罪事件堆叠到…同一个位置
gridIdxToCrimes = {} // to a grid_index you associate all the crimesfor crime_row in crime.iterrows(): grid_index = crime_row['grid'] if grid_index not in gridIdxToCrimes: gridIdxToCrimes[grid_index] = [] gridIdxToCrimes[grid_index].push(crime_row)forall grid_index, grid_row in grid.iterrows(): topIndex = grid_row['top '] if topIndex in gridIdxToCrimes: # you get all the crimes above your current grid near += count(gridIdxToCrimes[topIndex])
这样你就做了 O(|crimes|+|grid|) = 5k 次迭代