这个问题与我在Cross Validated上提出的问题相关,尽管这个问题更专注于在Python中寻找特定的解决方案,因此我在这里发布这个问题。
我正在尝试根据事件发生的频率对事件进行分类。我的数据集大致如下所示:
month_year,geographic_zone,event_type,count_of_occurrences'2016-01',1,'A',50'2016-01',1,'B',20'2016-01',2,'A',10'2016-01',2,'B',18'2016-02',1,'A',62'2016-02',1,'B',29'2016-02',2,'A',14'2016-02',2,'B',22'2016-03',1,'A',59'2016-03',1,'B',27'2016-03',2,'A',16'2016-03',2,'B',23
数据按月收集,覆盖n
个区域和m
种事件类型(在这种简化的案例中分别为2和2)。我知道这些事件在特定时间和地点发生的频率。
我希望根据给定的[month_year, geographic_zone]
预测这些事件在未来的发生可能性。我不确定如何利用count_of_occurrences
列来训练分类器。问题在于我无法知道未见数据的事件数量,因此我无法使用类似clf.predict([month_year, geographic_zone, count_of_occurrences])
的方式查询模型。也许概率分类器会更适合这种情况?
这是我当前代码的简化版本,包括我在挣扎的地方的注释:
from sklearn import svmfrom sklearn.model_selection import train_test_splitX = [ # [month_year, geographic_zone, count_of_occurrences] after encoding [1, 1, 50], [1, 1, 20], [1, 2, 10], [1, 2, 18], [2, 1, 62], [2, 1, 29], [2, 2, 14], [2, 2, 22], [3, 1, 59], [3, 1, 27], [3, 2, 16], [3, 2, 23],]# event_types, 1=A, 2=By = [ 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)clf = svm.SVC(probability=True)# I am fitting the model using the count_of_occurrences feature, however# I won't have knowledge about this value for unseen data all I will really# know is the month_year and geographic_zone for which I want to make predictionsclf.fit(X_train, y_train)print(clf.predict_proba(X_test))
我如何在分类器中使用事件的发生/频率计数?
回答:
你可以将相应的事件数量放入训练集中,让模型自己计算它们的相对概率。这样在数据预处理和模型使用过程中就不需要处理count_of_occurrences
了 :-).
顺便说一下,虽然这个问题没有直接提到,但如果你的数据有季节性,你应该记得将月份和年份拆分为单独的特征。
from sklearn import svmdata = [ # year, month, geo, type, count [2016, 1, 1, 'A', 50], [2016, 1, 1, 'B', 20], [2016, 1, 2, 'A', 10], [2016, 1, 2, 'B', 18], [2016, 2, 1, 'A', 62], [2016, 2, 1, 'B', 29], [2016, 2, 2, 'A', 14], [2016, 2, 2, 'B', 22], [2016, 3, 1, 'A', 59], [2016, 3, 1, 'B', 27], [2016, 3, 2, 'A', 16], [2016, 3, 2, 'B', 23],]X = []y = []for year, month, geo, t, count in data: for i in range(count): X.append([year, month, geo]) y.append(t)clf = svm.SVC(probability=True)clf.fit(X, y)test = [ [year, month, geo] for year in [2016, 2017] for month in range(1, 13) for geo in [1, 2]]prediction = clf.predict_proba(test)for (year, month, geo), proba in zip(test, prediction): s = " ".join("%s=%.2f" % (cls, p) for cls, p in zip(clf.classes_, proba)) print("%d-%02d geo=%d: %s" % (year, month, geo, s))
结果:
2016-01 geo=1: A=0.69 B=0.312016-01 geo=2: A=0.39 B=0.612016-02 geo=1: A=0.69 B=0.312016-02 geo=2: A=0.39 B=0.612016-03 geo=1: A=0.69 B=0.312016-03 geo=2: A=0.39 B=0.612016-04 geo=1: A=0.65 B=0.352016-04 geo=2: A=0.43 B=0.572016-05 geo=1: A=0.59 B=0.412016-05 geo=2: A=0.50 B=0.502016-06 geo=1: A=0.55 B=0.452016-06 geo=2: A=0.54 B=0.462016-07 geo=1: A=0.55 B=0.452016-07 geo=2: A=0.54 B=0.462016-08 geo=1: A=0.55 B=0.452016-08 geo=2: A=0.54 B=0.462016-09 geo=1: A=0.55 B=0.452016-09 geo=2: A=0.55 B=0.452016-10 geo=1: A=0.55 B=0.452016-10 geo=2: A=0.55 B=0.452016-11 geo=1: A=0.55 B=0.452016-11 geo=2: A=0.55 B=0.452016-12 geo=1: A=0.55 B=0.452016-12 geo=2: A=0.55 B=0.452017-01 geo=1: A=0.65 B=0.352017-01 geo=2: A=0.43 B=0.572017-02 geo=1: A=0.65 B=0.352017-02 geo=2: A=0.43 B=0.572017-03 geo=1: A=0.65 B=0.352017-03 geo=2: A=0.43 B=0.572017-04 geo=1: A=0.62 B=0.382017-04 geo=2: A=0.46 B=0.542017-05 geo=1: A=0.58 B=0.422017-05 geo=2: A=0.51 B=0.492017-06 geo=1: A=0.55 B=0.452017-06 geo=2: A=0.54 B=0.462017-07 geo=1: A=0.55 B=0.452017-07 geo=2: A=0.54 B=0.462017-08 geo=1: A=0.55 B=0.452017-08 geo=2: A=0.54 B=0.462017-09 geo=1: A=0.55 B=0.452017-09 geo=2: A=0.55 B=0.452017-10 geo=1: A=0.55 B=0.452017-10 geo=2: A=0.55 B=0.452017-11 geo=1: A=0.55 B=0.452017-11 geo=2: A=0.55 B=0.452017-12 geo=1: A=0.55 B=0.452017-12 geo=2: A=0.55 B=0.45