我正在尝试对教育进行聚类。数据条目包含名称和描述,如下所示:
航空工程硕士
林雪平大学的航空工程硕士课程提供了一个关于飞机设计的整体视角。飞机是一个复杂的、集成的、紧密连接的系统,由各种技术和学科组成,如:空气动力学、结构、推进、执行系统和其他机载系统。所有这些学科都需要优化,以实现飞机所需的功能和效率。该课程在后期涉及一个项目,所有这些学科都将汇聚在一起,挑战学生设计、建造和飞行一架飞机,或其缩小版本。
我从我的Lucene索引中创建了序列文件:
LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(conf, Arrays.asList(indexFilesPath), sequenceFilesPath, "name", Arrays.asList("name", "description"));SequenceFilesFromLuceneStorage sequenceFilefromLuceneStorage = new SequenceFilesFromLuceneStorage();sequenceFilefromLuceneStorage.run(luceneStorageConf);
然后我生成了稀疏向量。我将args
设置为正确的路径,MaxDFSigma设置为5,并将sequential设置为true。我不知道这些参数是否适合我的目的。
ToolRunner.run(new SparseVectorsFromSequenceFiles(), args);
然后我运行CanopyDriver来生成K-means的输入聚类。我使用Tanimoto距离,因为我读到它对文本聚类有好处。T1距离度量设置为3.1,T2距离度量设置为2.1,运行聚类设置为false,聚类分类阈值设置为0,运行sequential设置为true。
CanopyDriver.run(conf, tfidfVectorsPath, outputPath, new TanimotoDistanceMeasure(), 3.1, 2.1, false, 0.0, true);
最后我运行K-Means,收敛增量设置为0.001,最大迭代次数为10,运行聚类设置为true,聚类分类阈值设置为0,运行sequential设置为true:
KMeansDriver.run(conf, tfidfVectorsPath, new Path(outputPath,"clusters-0-final"), kmeansOutput, 0.001, 10, true, 0.0, false);
我可以这样打印出我的聚类结果:
IntWritable key = new IntWritable();WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();while (reader.next(key, value)) { System.out.println("Cluster " + key.toString() + " got the following vector " + value.toString());}reader.close();
这会打印出以下内容:
Cluster 0 got the following vector wt: 1.0 distance: 0.861373565304716 vec: Acting = [1:2.735, 4:4.441, 5:4.441, 13:2.165, 25:2.224, 26:2.224, 30:1.754, 35:2.447, 36:2.735, 51:2.447, 116:2.735, 118:1.887, 174:2.480, 178:2.447, 179:2.735, 187:2.735, 205:2.447, 224:2.735, 240:2.735, 242:3.460, 248:2.447, 260:2.041, 267:1.887]Cluster 0 got the following vector wt: 1.0 distance: 0.868019533374171 vec: Adult Learning and Global Change = [30:1.754, 34:2.447, 43:2.735, 56:2.447, 72:2.447, 80:2.735, 105:3.460, 106:2.735, 117:2.735, 142:2.447, 143:2.447, 148:2.447, 173:2.735, 176:2.447, 181:2.735, 199:2.735, 203:2.224, 214:2.447, 233:2.447, 247:2.735, 262:2.735, 268:3.460]Cluster 0 got the following vector wt: 1.0 distance: 0.8630506879479874 vec: Agricultural Economics and Management = [8:5.469, 9:4.736, 21:2.447, 28:2.735, 29:2.735, 31:2.447, 33:2.735, 34:2.447, 39:2.735, 60:2.447, 70:2.735, 71:5.439, 94:2.447, 108:2.447, 111:2.447, 136:2.447, 149:3.460, 152:1.754, 167:2.735, 171:2.735, 189:2.447, 203:2.224, 206:2.224, 210:4.441, 242:2.447, 249:3.460, 257:2.735, 273:2.480]Cluster 0 got the following vector wt: 1.0 distance: 0.8382953832498294 vec: Agroecology = [2:4.441, 3:2.447, 8:2.735, 9:4.736, 10:4.441, 12:2.735, 25:3.852, 26:3.145, 27:2.041, 30:1.754, 32:4.441, 44:2.447, 56:2.447, 61:2.735, 64:3.460, 69:2.735, 70:2.735, 79:2.447, 82:2.735, 83:4.441, 85:2.735, 86:4.441, 87:2.041, 93:2.447, 94:2.447, 105:2.447, 110:2.447, 118:1.887, 121:2.224, 128:2.735, 131:2.735, 133:1.887, 137:2.735, 139:1.636, 143:3.460, 144:4.441, 148:3.460, 152:1.754, 155:2.447, 165:2.735, 166:2.447, 167:2.735, 170:2.447, 171:2.735, 178:2.447, 182:2.735, 187:3.867, 189:4.894, 192:3.814, 198:3.867, 199:2.735, 202:1.636, 203:3.852, 206:2.224, 214:2.447, 215:2.447, 216:4.441, 226:3.775, 227:2.447, 228:2.041, 229:4.441, 230:2.224, 231:3.145, 237:2.224, 243:3.460, 248:2.447, 252:2.447, 254:2.735, 260:2.041, 263:2.224, 264:2.735, 267:1.887, 269:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8546104020199703 vec: Analytical Finance = [14:2.447, 15:3.867, 65:2.735, 72:2.447, 78:2.447, 89:5.439, 90:4.441, 97:3.145, 100:3.145, 133:2.669, 142:2.447, 149:2.447, 151:2.735, 183:2.447, 184:2.735, 192:2.335, 212:2.735, 268:2.447, 273:3.038]Cluster 0 got the following vector wt: 1.0 distance: 0.8525453440258359 vec: Animal Science = [13:1.531, 16:7.021, 17:7.021, 27:2.041, 30:1.754, 42:2.735, 74:2.041, 85:2.735, 94:2.447, 98:2.447, 107:2.447, 108:2.447, 110:2.447, 111:2.447, 112:2.735, 119:4.441, 121:2.224, 133:1.887, 174:1.754, 175:5.471, 183:2.447, 189:2.447, 191:2.224, 202:2.314, 205:2.447, 207:2.669, 208:2.735, 217:4.441, 243:2.447, 251:4.441, 255:2.735, 259:3.460, 260:2.041, 261:2.447, 266:2.735, 267:1.887, 273:1.754]Cluster 0 got the following vector wt: 1.0 distance: 0.8691568007982957 vec: Animated Film = [0:2.735, 13:1.531, 18:6.280, 35:2.447, 48:2.735, 52:3.867, 63:2.735, 117:2.735, 129:2.735, 164:2.224, 190:2.447, 232:2.735, 242:2.447, 247:2.735, 252:2.447, 253:2.735, 257:2.735, 259:3.460, 264:2.735, 267:3.269, 273:2.480]Cluster 0 got the following vector wt: 1.0 distance: 0.845250503777627 vec: Applied English linguistics = [6:2.447, 13:1.531, 20:2.887, 23:1.754, 29:2.735, 53:2.224, 54:2.735, 63:2.735, 74:2.041, 78:5.994, 81:2.735, 88:2.447, 93:2.447, 101:2.735, 103:2.735, 129:2.735, 138:2.735, 139:1.636, 140:6.115, 146:5.439, 154:2.735, 159:2.041, 164:2.224, 170:2.447, 174:1.754, 192:2.335, 196:2.447, 200:2.735, 202:1.636, 214:2.447, 215:2.447, 223:4.441, 228:2.041, 246:2.735, 254:2.735, 263:3.145, 269:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8441577500264077 vec: Applied Mathematics Programme = [13:1.531, 20:2.887, 23:1.754, 26:2.224, 37:2.735, 47:2.447, 53:3.852, 59:5.439, 68:2.735, 72:3.460, 77:2.041, 78:2.447, 87:2.041, 88:2.447, 103:2.735, 104:4.441, 107:2.447, 114:2.735, 116:2.735, 139:2.314, 142:2.447, 152:1.754, 156:4.441, 157:8.881, 158:2.735, 159:2.887, 161:2.735, 163:2.735, 168:2.735, 182:2.735, 184:2.735, 191:4.973, 192:1.348, 193:3.460, 207:1.887, 218:2.735, 221:4.441, 227:4.894, 228:2.887, 241:2.735, 252:3.460, 260:2.041, 272:2.735, 273:4.960]Cluster 0 got the following vector wt: 1.0 distance: 0.8214250552767353 vec: Applied Mechanics = [6:2.447, 13:1.531, 14:2.447, 20:2.887, 31:2.447, 42:2.735, 43:2.735, 46:2.735, 64:2.447, 66:2.447, 74:2.887, 77:3.536, 84:4.238, 93:2.447, 97:2.224, 107:2.447, 120:2.735, 121:2.224, 125:2.735, 127:2.735, 134:2.735, 136:2.447, 139:1.636, 155:2.447, 158:3.867, 162:3.145, 168:2.735, 174:1.754, 183:2.447, 186:2.447, 190:2.447, 192:1.907, 196:2.447, 202:1.636, 204:4.736, 213:2.735, 226:1.887, 235:4.441, 243:2.447, 256:2.735, 267:1.887]Cluster 0 got the following vector wt: 1.0 distance: 0.8705019419490072 vec: Applied Physics = [20:2.041, 23:1.754, 25:2.224, 46:2.735, 57:2.447, 76:2.735, 77:2.041, 82:2.735, 84:2.447, 100:2.224, 162:2.224, 177:2.735, 180:6.280, 181:2.735, 192:1.348, 219:3.867, 232:2.735, 237:2.224, 238:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8549884296474971 vec: Aquatic Ecology Master Programme = [3:2.447, 6:3.460, 13:1.531, 19:2.735, 21:2.447, 22:7.692, 23:1.754, 27:2.041, 30:1.754, 33:2.735, 41:4.441, 44:2.447, 54:2.735, 55:4.441, 57:2.447, 65:2.735, 68:2.735, 69:6.698, 73:2.735, 74:2.887, 79:2.447, 87:2.887, 88:2.447, 91:2.735, 96:4.441, 97:3.145, 98:2.447, 100:2.224, 114:3.867, 118:1.887, 123:4.238, 132:2.735, 133:1.887, 135:4.441, 139:2.314, 149:3.460, 150:7.021, 152:2.480, 154:2.735, 160:4.441, 162:2.224, 164:2.224, 169:2.735, 172:2.735, 186:2.447, 192:2.335, 194:2.224, 200:2.735, 202:2.834, 203:4.448, 207:2.669, 209:2.735, 213:2.735, 219:2.735, 226:2.669, 231:3.145, 236:2.735, 239:4.441, 246:2.735, 250:2.735, 258:4.441, 271:2.735, 272:3.867]Cluster 0 got the following vector wt: 1.0 distance: 0.8441981452499265 vec: Astronomy: Master's Degree Project = [3:2.447, 13:1.531, 27:2.041, 30:1.754, 37:2.735, 52:3.867, 57:2.447, 60:2.447, 102:4.441, 139:1.636, 147:2.735, 153:2.224, 159:2.041, 165:2.735, 172:2.735, 185:5.439, 194:3.145, 202:1.636, 204:3.867, 224:2.735, 228:2.041, 237:2.224, 245:2.735, 248:2.447, 263:2.224, 271:2.735, 273:3.508, 274:3.867]Cluster 0 got the following vector wt: 1.0 distance: 0.8312263297298608 vec: Atmospheric Science, Master's Programme = [13:1.531, 15:2.735, 23:2.480, 24:7.021, 26:2.224, 34:4.238, 39:3.867, 56:2.447, 58:2.447, 61:2.735, 62:2.224, 64:2.447, 66:2.447, 73:3.867, 75:4.441, 79:2.447, 81:2.735, 84:2.447, 87:2.041, 99:2.735, 105:3.460, 110:2.447, 112:2.735, 118:2.669, 121:2.224, 122:2.447, 123:4.238, 130:2.447, 131:2.735, 133:1.887, 136:2.447, 139:1.636, 147:2.735, 153:2.224, 155:2.447, 159:2.041, 161:2.735, 174:2.480, 175:2.447, 179:2.735, 188:4.441, 191:2.224, 192:1.348, 195:2.735, 202:1.636, 207:3.775, 220:2.735, 225:2.735, 228:2.041, 230:2.224, 244:4.441, 249:4.238, 253:2.735, 260:2.041, 266:2.735]Cluster 0 got the following vector wt: 1.0 distance: 0.8410581023430623 vec: Computer Science, Master's Programme = [1:3.867, 14:2.447, 19:2.735, 20:2.041, 23:3.038, 27:2.041, 28:3.867, 30:1.754, 31:2.447, 35:2.447, 36:2.735, 38:4.441, 40:2.735, 45:4.441, 47:7.341, 48:3.867, 49:5.469, 50:2.735, 51:2.447, 53:3.145, 58:2.447, 60:2.447, 62:4.973, 67:4.441, 74:2.887, 76:2.735, 77:2.041, 92:4.441, 95:4.441, 97:3.852, 99:2.735, 100:2.224, 101:2.735, 106:2.735, 108:3.460, 109:4.441, 111:3.460, 113:4.441, 115:4.441, 118:1.887, 122:2.447, 123:2.447, 124:3.140, 125:3.867, 126:2.735, 130:2.447, 132:2.735, 133:1.887, 134:2.735, 137:2.735, 139:1.636, 141:2.735, 143:2.447, 145:5.439, 148:2.447, 151:2.735, 152:1.754, 153:3.145, 159:2.041, 162:2.224, 164:4.448, 169:2.735, 175:3.460, 177:2.735, 178:2.447, 186:2.447, 190:3.460, 191:2.224, 192:3.015, 193:5.471, 194:2.224, 195:2.735, 196:2.447, 197:4.441, 198:2.735, 201:2.735, 202:3.272, 206:2.224, 207:5.968, 211:2.735, 212:2.735, 215:3.460, 218:3.867, 220:2.735, 225:2.735, 226:1.887, 227:2.447, 230:2.224, 231:4.973, 234:2.735, 236:4.736, 240:2.735, 249:2.447, 259:3.460, 261:2.447, 262:2.735, 263:3.145, 265:5.439, 267:2.669, 270:5.439, 273:5.546, 274:3.867]Cluster 0 got the following vector wt: 1.0 distance: 0.8527476806601426 vec: MSc Aeronautical Engineering = [7:4.441, 11:6.280, 12:3.867, 40:2.735, 44:2.447, 50:2.735, 62:3.145, 66:4.238, 77:2.887, 98:2.447, 128:2.735, 152:1.754, 166:2.447, 170:2.447, 173:2.735, 174:1.754, 176:2.447, 192:1.907, 194:2.224, 201:2.735, 206:2.224, 226:1.887, 230:2.224, 231:3.145, 233:2.447, 245:2.735, 250:2.735, 255:2.735, 256:2.735, 261:2.447]Cluster 0 got the following vector wt: 1.0 distance: 0.8189710533774026 vec: Master Programme in Computer Science = [0:2.735, 21:2.447, 23:1.754, 25:2.224, 47:7.341, 49:3.867, 51:2.447, 53:3.145, 58:2.447, 62:3.145, 80:2.735, 87:2.041, 91:2.735, 118:1.887, 120:2.735, 122:2.447, 126:2.735, 127:2.735, 130:2.447, 138:2.735, 140:2.735, 141:2.735, 152:2.480, 153:2.224, 163:2.735, 166:2.447, 174:1.754, 176:2.447, 192:2.335, 193:3.460, 205:2.447, 207:4.220, 208:2.735, 209:2.735, 211:2.735, 222:3.140, 226:1.887, 233:2.447, 234:4.736, 237:3.852, 238:2.735, 241:2.735, 268:2.447]
所有的教育项目都归入同一个聚类。我尝试了不同的距离度量方法,但没有帮助,我还尝试了不同的最大迭代次数(最多100次),但结果相同。正如我们所见,它们的距离都在81到87之间,这是它们归入同一个聚类的原因吗?使用其他距离度量方法时,距离范围更大,但它们仍然都在同一个聚类中。我是否需要删除更多对课程特征无关紧要的术语?我如何查看我的顶级术语?
这是一个很长的帖子,问题也很多,我非常希望能得到一些帮助。
谢谢!
回答:
在阅读了我的方法后,我发现我对CanopyDriver的参数调整还不够。从《Mahout in Action》这本书中我读到:
“Canopy聚类不需要你指定聚类中心的数量作为参数。形成的中心数量仅取决于距离度量T1和T2的选择。”
结果证明这给我带来了非常好的效果!