识别CSV文件中2个或多个列的最常见值组合

如何查找CSV文件中行内2个或多个列的最常见值组合。示例：

event,rack,role,dcnetwork,north,mobile,africanetwork,east,mobile,asiaoom,south,desktop,europecpu,east,web,northamericaoom,north,mobile,europecpu,south,web,northamericacpu,west,web,northamerica

我已经尝试为我关注的一些可能的组合创建列表，然后使用Collections.Counter中的most_common()方法来查找常见模式。但我需要一个算法来查找任何可能的2个或更多列组合的常见记录。

到目前为止我的代码如下：

这是上述记录的输出结果：

('northamerica-web-cpu', 3) # 有3行匹配northamerica, web和cpu的值('northamerica-web', 3) # 有3行仅匹配northamerica和web的值('northamerica-cpu', 3) # 有3行匹配northamerica和cpu的值('europe-oom', 2) # 有2行匹配europe和oom的值('africa-mobile-network', 1)('asia-mobile-network', 1)('europe-desktop-oom', 1)('europe-mobile-oom', 1)('africa-mobile-north', 1)('asia-mobile-east', 1)('europe-desktop-south', 1)('northamerica-web-east', 1)('europe-mobile-north', 1)('northamerica-web-south', 1)('northamerica-web-west', 1)('africa-mobile', 1)('asia-mobile', 1)('europe-desktop', 1)('europe-mobile', 1)('africa-network', 1)('asia-network', 1)

回答：

让我先从定义数据结构开始，因为读取csv与真正的问题是无关的：

lines = [line.split(',') for line in """\event,rack,role,dcnetwork,north,mobile,africanetwork,east,mobile,asiaoom,south,desktop,europecpu,east,web,northamericaoom,north,mobile,europecpu,south,web,northamericacpu,west,web,northamerica""".splitlines()]for line in lines:    print line

这将打印：

['event', 'rack', 'role', 'dc']['network', 'north', 'mobile', 'africa']['network', 'east', 'mobile', 'asia']['oom', 'south', 'desktop', 'europe']['cpu', 'east', 'web', 'northamerica']['oom', 'north', 'mobile', 'europe']['cpu', 'south', 'web', 'northamerica']['cpu', 'west', 'web', 'northamerica']

现在，让我们从每行中创建所有可能的2个或更多单词的组合。从4个中选择2、3或4的方式有11种（4C2 + 4C3 + 4C4 == 6 + 4 + 1 == 11）。

我用来查找组合的算法是查看4位的二进制数（即0000, 0001, 0010, 0011, 0100等），对于每个这样的数字，根据相应的二进制位是否为1来创建单词的组合。例如，对于0101，选择第二和第四个单词：

def find_combinations(line):    combinations = []    for i in range(2**len(line)):        bits = bin(i)[2:].zfill(len(line))        if bits.count('1') < 2:  # 跳过少于两个1位的数字            continue        combination = set()        for bit, word in zip(bits, line):            if bit == '1':                combination.add(word)        combinations.append('-'.join(sorted(combination)))    return combinations

现在我们可以遍历所有组合并统计它们的频率：

from collections import defaultdictcounter = defaultdict(int)for line in lines:    for c in find_combinations(line):        counter[c] += 1

最后，我们可以按频率（降序）排序

for combination_freq in sorted(counter.items(), key=lambda item: item[1], reverse=True):    print combination_freq

得到的结果是：

('cpu-northamerica', 3)('northamerica-web', 3)('cpu-northamerica-web', 3)('cpu-web', 3)('mobile-north', 2)('mobile-network', 2)('europe-oom', 2)('east-network', 1)('asia-east-mobile', 1)('asia-east-network', 1)('cpu-south-web', 1)('east-northamerica-web', 1)('europe-north', 1)('cpu-east', 1)...等。

学技术

识别CSV文件中2个或多个列的最常见值组合

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复