我有一大批数据(大约20,000行),如下所示:
Caller1 5:30AM Mexico USA 2-22-19Caller2 1:30AM Mexico USA 2-22-19Caller3 2:30AM Mexico USA 2-22-19Caller1 5:30AM Mexico USA 2-22-19Caller5 3:30AM Mexico USA 2-22-19Caller3 4:30AM Mexico USA 2-22-19Caller2 5:30AM Mexico USA 2-22-19Caller1 7:30AM Mexico USA 2-22-19Caller12 9:39AM Mexico USA 2-22-19Caller14 8:36AM Mexico USA 2-22-19Caller15 2:39AM Mexico USA 2-22-19Caller16 3:32AM Mexico USA 2-22-19
我希望按CallerID
将数据分组,如下所示:
Caller1 5:30AM Mexico USA 2-22-19Caller1 5:30AM Mexico USA 2-22-19Caller1 7:30AM Mexico USA 2-22-19---------------------------------Caller2 1:30AM Mexico USA 2-22-19Caller2 5:30AM Mexico USA 2-22-1---------------------------------..
最初我将这些数据存储在一个dictionary
中,任何新数据都会被添加到这个字典里。
由于初始参数CallerID
也是可变的,我在分组时遇到了困难。
我的代码:
>>> input = [('caller1', 'data....'),('caller2','data,,,,,)>>> from collections import defaultdict>>> res = defaultdict(list)>>> for v, k in input: res[k].append(v)
由于数据集太大,我无法使用这种方法。
有没有Python包可以根据句子的第一个单词来分组数据?
回答:
你可以尝试这种方法,将数据存储在一个列表字典中,键为你希望分组的字符串,即Caller1、Caller2等。
data = ["Caller1 5:30AM Mexico USA 2-22-19", "Caller2 1:30AM Mexico USA 2-22-19", "Caller3 2:30AM Mexico USA 2-22-19", "Caller1 5:30AM Mexico USA 2-22-19", "Caller5 3:30AM Mexico USA 2-22-19", "Caller3 4:30AM Mexico USA 2-22-19", "Caller2 5:30AM Mexico USA 2-22-19", "Caller1 7:30AM Mexico USA 2-22-19", "Caller12 9:39AM Mexico USA 2-22-19", "Caller14 8:36AM Mexico USA 2-22-19", "Caller15 2:39AM Mexico USA 2-22-19", "Caller16 3:32AM Mexico USA 2-22-19"] grouped_data = {} # 迭代输入并将数据存储在列表字典中 for x in data: temp: list = [] key = x.split(' ')[0] if key in grouped_data: temp = grouped_data.get(key) temp.append(x) grouped_data[key] = temp # 按分组打印数据 for k, v in grouped_data.items(): print(f"data for {k}") for d in v: print(d)