在Mahout 0.9中关于K-Means文本聚类的建议

我正在尝试对教育进行聚类。数据条目包含名称和描述,如下所示:

航空工程硕士

林雪平大学的航空工程硕士课程提供了一个关于飞机设计的整体视角。飞机是一个复杂的、集成的、紧密连接的系统,由各种技术和学科组成,如:空气动力学、结构、推进、执行系统和其他机载系统。所有这些学科都需要优化,以实现飞机所需的功能和效率。该课程在后期涉及一个项目,所有这些学科都将汇聚在一起,挑战学生设计、建造和飞行一架飞机,或其缩小版本。

我从我的Lucene索引中创建了序列文件:

LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(conf,         Arrays.asList(indexFilesPath), sequenceFilesPath, "name",        Arrays.asList("name", "description"));SequenceFilesFromLuceneStorage sequenceFilefromLuceneStorage = new SequenceFilesFromLuceneStorage();sequenceFilefromLuceneStorage.run(luceneStorageConf);

然后我生成了稀疏向量。我将args设置为正确的路径,MaxDFSigma设置为5,并将sequential设置为true。我不知道这些参数是否适合我的目的。

ToolRunner.run(new SparseVectorsFromSequenceFiles(), args);

然后我运行CanopyDriver来生成K-means的输入聚类。我使用Tanimoto距离,因为我读到它对文本聚类有好处。T1距离度量设置为3.1,T2距离度量设置为2.1,运行聚类设置为false,聚类分类阈值设置为0,运行sequential设置为true。

CanopyDriver.run(conf,                 tfidfVectorsPath,                 outputPath,                 new TanimotoDistanceMeasure(),                 3.1,                 2.1,                 false,                 0.0,                 true);

最后我运行K-Means,收敛增量设置为0.001,最大迭代次数为10,运行聚类设置为true,聚类分类阈值设置为0,运行sequential设置为true:

KMeansDriver.run(conf,                  tfidfVectorsPath,                 new Path(outputPath,"clusters-0-final"),                 kmeansOutput,                 0.001,                 10,                 true,                 0.0,                 false);

我可以这样打印出我的聚类结果:

IntWritable key = new IntWritable();WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();while (reader.next(key, value)) {    System.out.println("Cluster " + key.toString() + " got the following vector " + value.toString());}reader.close();

这会打印出以下内容:

Cluster 0 got the following vector wt: 1.0 distance: 0.861373565304716  vec: Acting = [1:2.735, 4:4.441, 5:4.441, 13:2.165, 25:2.224, 26:2.224, 30:1.754, 35:2.447, 36:2.735, 51:2.447, 116:2.735, 118:1.887, 174:2.480, 178:2.447, 179:2.735, 187:2.735, 205:2.447, 224:2.735, 240:2.735, 242:3.460, 248:2.447, 260:2.041, 267:1.887]Cluster 0 got the following vector wt: 1.0 distance: 0.868019533374171  vec: Adult Learning and Global Change = [30:1.754, 34:2.447, 43:2.735, 56:2.447, 72:2.447, 80:2.735, 105:3.460, 106:2.735, 117:2.735, 142:2.447, 143:2.447, 148:2.447, 173:2.735, 176:2.447, 181:2.735, 199:2.735, 203:2.224, 214:2.447, 233:2.447, 247:2.735, 262:2.735, 268:3.460]Cluster 0 got the following vector wt: 1.0 distance: 0.8630506879479874  vec: Agricultural Economics and Management = [8:5.469, 9:4.736, 21:2.447, 28:2.735, 29:2.735, 31:2.447, 33:2.735, 34:2.447, 39:2.735, 60:2.447, 70:2.735, 71:5.439, 94:2.447, 108:2.447, 111:2.447, 136:2.447, 149:3.460, 152:1.754, 167:2.735, 171:2.735, 189:2.447, 203:2.224, 206:2.224, 210:4.441, 242:2.447, 249:3.460, 257:2.735, 273:2.480]Cluster 0 got the following vector wt: 1.0 distance: 0.8382953832498294  vec: Agroecology = [2:4.441, 3:2.447, 8:2.735, 9:4.736, 10:4.441, 12:2.735, 25:3.852, 26:3.145, 27:2.041, 30:1.754, 32:4.441, 44:2.447, 56:2.447, 61:2.735, 64:3.460, 69:2.735, 70:2.735, 79:2.447, 82:2.735, 83:4.441, 85:2.735, 86:4.441, 87:2.041, 93:2.447, 94:2.447, 105:2.447, 110:2.447, 118:1.887, 121:2.224, 128:2.735, 131:2.735, 133:1.887, 137:2.735, 139:1.636, 143:3.460, 144:4.441, 148:3.460, 152:1.754, 155:2.447, 165:2.735, 166:2.447, 167:2.735, 170:2.447, 171:2.735, 178:2.447, 182:2.735, 187:3.867, 189:4.894, 192:3.814, 198:3.867, 199:2.735, 202:1.636, 203:3.852, 206:2.224, 214:2.447, 215:2.447, 216:4.441, 226:3.775, 227:2.447, 228:2.041, 229:4.441, 230:2.224, 231:3.145, 237:2.224, 243:3.460, 248:2.447, 252:2.447, 254:2.735, 260:2.041, 263:2.224, 264:2.735, 267:1.887, 269:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8546104020199703  vec: Analytical Finance = [14:2.447, 15:3.867, 65:2.735, 72:2.447, 78:2.447, 89:5.439, 90:4.441, 97:3.145, 100:3.145, 133:2.669, 142:2.447, 149:2.447, 151:2.735, 183:2.447, 184:2.735, 192:2.335, 212:2.735, 268:2.447, 273:3.038]Cluster 0 got the following vector wt: 1.0 distance: 0.8525453440258359  vec: Animal Science = [13:1.531, 16:7.021, 17:7.021, 27:2.041, 30:1.754, 42:2.735, 74:2.041, 85:2.735, 94:2.447, 98:2.447, 107:2.447, 108:2.447, 110:2.447, 111:2.447, 112:2.735, 119:4.441, 121:2.224, 133:1.887, 174:1.754, 175:5.471, 183:2.447, 189:2.447, 191:2.224, 202:2.314, 205:2.447, 207:2.669, 208:2.735, 217:4.441, 243:2.447, 251:4.441, 255:2.735, 259:3.460, 260:2.041, 261:2.447, 266:2.735, 267:1.887, 273:1.754]Cluster 0 got the following vector wt: 1.0 distance: 0.8691568007982957  vec: Animated Film = [0:2.735, 13:1.531, 18:6.280, 35:2.447, 48:2.735, 52:3.867, 63:2.735, 117:2.735, 129:2.735, 164:2.224, 190:2.447, 232:2.735, 242:2.447, 247:2.735, 252:2.447, 253:2.735, 257:2.735, 259:3.460, 264:2.735, 267:3.269, 273:2.480]Cluster 0 got the following vector wt: 1.0 distance: 0.845250503777627  vec: Applied English linguistics = [6:2.447, 13:1.531, 20:2.887, 23:1.754, 29:2.735, 53:2.224, 54:2.735, 63:2.735, 74:2.041, 78:5.994, 81:2.735, 88:2.447, 93:2.447, 101:2.735, 103:2.735, 129:2.735, 138:2.735, 139:1.636, 140:6.115, 146:5.439, 154:2.735, 159:2.041, 164:2.224, 170:2.447, 174:1.754, 192:2.335, 196:2.447, 200:2.735, 202:1.636, 214:2.447, 215:2.447, 223:4.441, 228:2.041, 246:2.735, 254:2.735, 263:3.145, 269:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8441577500264077  vec: Applied Mathematics Programme = [13:1.531, 20:2.887, 23:1.754, 26:2.224, 37:2.735, 47:2.447, 53:3.852, 59:5.439, 68:2.735, 72:3.460, 77:2.041, 78:2.447, 87:2.041, 88:2.447, 103:2.735, 104:4.441, 107:2.447, 114:2.735, 116:2.735, 139:2.314, 142:2.447, 152:1.754, 156:4.441, 157:8.881, 158:2.735, 159:2.887, 161:2.735, 163:2.735, 168:2.735, 182:2.735, 184:2.735, 191:4.973, 192:1.348, 193:3.460, 207:1.887, 218:2.735, 221:4.441, 227:4.894, 228:2.887, 241:2.735, 252:3.460, 260:2.041, 272:2.735, 273:4.960]Cluster 0 got the following vector wt: 1.0 distance: 0.8214250552767353  vec: Applied Mechanics = [6:2.447, 13:1.531, 14:2.447, 20:2.887, 31:2.447, 42:2.735, 43:2.735, 46:2.735, 64:2.447, 66:2.447, 74:2.887, 77:3.536, 84:4.238, 93:2.447, 97:2.224, 107:2.447, 120:2.735, 121:2.224, 125:2.735, 127:2.735, 134:2.735, 136:2.447, 139:1.636, 155:2.447, 158:3.867, 162:3.145, 168:2.735, 174:1.754, 183:2.447, 186:2.447, 190:2.447, 192:1.907, 196:2.447, 202:1.636, 204:4.736, 213:2.735, 226:1.887, 235:4.441, 243:2.447, 256:2.735, 267:1.887]Cluster 0 got the following vector wt: 1.0 distance: 0.8705019419490072  vec: Applied Physics = [20:2.041, 23:1.754, 25:2.224, 46:2.735, 57:2.447, 76:2.735, 77:2.041, 82:2.735, 84:2.447, 100:2.224, 162:2.224, 177:2.735, 180:6.280, 181:2.735, 192:1.348, 219:3.867, 232:2.735, 237:2.224, 238:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8549884296474971  vec: Aquatic Ecology Master Programme = [3:2.447, 6:3.460, 13:1.531, 19:2.735, 21:2.447, 22:7.692, 23:1.754, 27:2.041, 30:1.754, 33:2.735, 41:4.441, 44:2.447, 54:2.735, 55:4.441, 57:2.447, 65:2.735, 68:2.735, 69:6.698, 73:2.735, 74:2.887, 79:2.447, 87:2.887, 88:2.447, 91:2.735, 96:4.441, 97:3.145, 98:2.447, 100:2.224, 114:3.867, 118:1.887, 123:4.238, 132:2.735, 133:1.887, 135:4.441, 139:2.314, 149:3.460, 150:7.021, 152:2.480, 154:2.735, 160:4.441, 162:2.224, 164:2.224, 169:2.735, 172:2.735, 186:2.447, 192:2.335, 194:2.224, 200:2.735, 202:2.834, 203:4.448, 207:2.669, 209:2.735, 213:2.735, 219:2.735, 226:2.669, 231:3.145, 236:2.735, 239:4.441, 246:2.735, 250:2.735, 258:4.441, 271:2.735, 272:3.867]Cluster 0 got the following vector wt: 1.0 distance: 0.8441981452499265  vec: Astronomy: Master's Degree Project = [3:2.447, 13:1.531, 27:2.041, 30:1.754, 37:2.735, 52:3.867, 57:2.447, 60:2.447, 102:4.441, 139:1.636, 147:2.735, 153:2.224, 159:2.041, 165:2.735, 172:2.735, 185:5.439, 194:3.145, 202:1.636, 204:3.867, 224:2.735, 228:2.041, 237:2.224, 245:2.735, 248:2.447, 263:2.224, 271:2.735, 273:3.508, 274:3.867]Cluster 0 got the following vector wt: 1.0 distance: 0.8312263297298608  vec: Atmospheric Science, Master's Programme = [13:1.531, 15:2.735, 23:2.480, 24:7.021, 26:2.224, 34:4.238, 39:3.867, 56:2.447, 58:2.447, 61:2.735, 62:2.224, 64:2.447, 66:2.447, 73:3.867, 75:4.441, 79:2.447, 81:2.735, 84:2.447, 87:2.041, 99:2.735, 105:3.460, 110:2.447, 112:2.735, 118:2.669, 121:2.224, 122:2.447, 123:4.238, 130:2.447, 131:2.735, 133:1.887, 136:2.447, 139:1.636, 147:2.735, 153:2.224, 155:2.447, 159:2.041, 161:2.735, 174:2.480, 175:2.447, 179:2.735, 188:4.441, 191:2.224, 192:1.348, 195:2.735, 202:1.636, 207:3.775, 220:2.735, 225:2.735, 228:2.041, 230:2.224, 244:4.441, 249:4.238, 253:2.735, 260:2.041, 266:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8410581023430623  vec: Computer Science, Master's Programme = [1:3.867, 14:2.447, 19:2.735, 20:2.041, 23:3.038, 27:2.041, 28:3.867, 30:1.754, 31:2.447, 35:2.447, 36:2.735, 38:4.441, 40:2.735, 45:4.441, 47:7.341, 48:3.867, 49:5.469, 50:2.735, 51:2.447, 53:3.145, 58:2.447, 60:2.447, 62:4.973, 67:4.441, 74:2.887, 76:2.735, 77:2.041, 92:4.441, 95:4.441, 97:3.852, 99:2.735, 100:2.224, 101:2.735, 106:2.735, 108:3.460, 109:4.441, 111:3.460, 113:4.441, 115:4.441, 118:1.887, 122:2.447, 123:2.447, 124:3.140, 125:3.867, 126:2.735, 130:2.447, 132:2.735, 133:1.887, 134:2.735, 137:2.735, 139:1.636, 141:2.735, 143:2.447, 145:5.439, 148:2.447, 151:2.735, 152:1.754, 153:3.145, 159:2.041, 162:2.224, 164:4.448, 169:2.735, 175:3.460, 177:2.735, 178:2.447, 186:2.447, 190:3.460, 191:2.224, 192:3.015, 193:5.471, 194:2.224, 195:2.735, 196:2.447, 197:4.441, 198:2.735, 201:2.735, 202:3.272, 206:2.224, 207:5.968, 211:2.735, 212:2.735, 215:3.460, 218:3.867, 220:2.735, 225:2.735, 226:1.887, 227:2.447, 230:2.224, 231:4.973, 234:2.735, 236:4.736, 240:2.735, 249:2.447, 259:3.460, 261:2.447, 262:2.735, 263:3.145, 265:5.439, 267:2.669, 270:5.439, 273:5.546, 274:3.867]Cluster 0 got the following vector wt: 1.0 distance: 0.8527476806601426  vec: MSc Aeronautical Engineering = [7:4.441, 11:6.280, 12:3.867, 40:2.735, 44:2.447, 50:2.735, 62:3.145, 66:4.238, 77:2.887, 98:2.447, 128:2.735, 152:1.754, 166:2.447, 170:2.447, 173:2.735, 174:1.754, 176:2.447, 192:1.907, 194:2.224, 201:2.735, 206:2.224, 226:1.887, 230:2.224, 231:3.145, 233:2.447, 245:2.735, 250:2.735, 255:2.735, 256:2.735, 261:2.447]Cluster 0 got the following vector wt: 1.0 distance: 0.8189710533774026  vec: Master Programme in Computer Science = [0:2.735, 21:2.447, 23:1.754, 25:2.224, 47:7.341, 49:3.867, 51:2.447, 53:3.145, 58:2.447, 62:3.145, 80:2.735, 87:2.041, 91:2.735, 118:1.887, 120:2.735, 122:2.447, 126:2.735, 127:2.735, 130:2.447, 138:2.735, 140:2.735, 141:2.735, 152:2.480, 153:2.224, 163:2.735, 166:2.447, 174:1.754, 176:2.447, 192:2.335, 193:3.460, 205:2.447, 207:4.220, 208:2.735, 209:2.735, 211:2.735, 222:3.140, 226:1.887, 233:2.447, 234:4.736, 237:3.852, 238:2.735, 241:2.735, 268:2.447]

所有的教育项目都归入同一个聚类。我尝试了不同的距离度量方法,但没有帮助,我还尝试了不同的最大迭代次数(最多100次),但结果相同。正如我们所见,它们的距离都在81到87之间,这是它们归入同一个聚类的原因吗?使用其他距离度量方法时,距离范围更大,但它们仍然都在同一个聚类中。我是否需要删除更多对课程特征无关紧要的术语?我如何查看我的顶级术语?

这是一个很长的帖子,问题也很多,我非常希望能得到一些帮助。

谢谢!


回答:

在阅读了我的方法后,我发现我对CanopyDriver的参数调整还不够。从《Mahout in Action》这本书中我读到:

“Canopy聚类不需要你指定聚类中心的数量作为参数。形成的中心数量仅取决于距离度量T1和T2的选择。”

结果证明这给我带来了非常好的效果!

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注