我有一个任务,需要使用sklearn创建三个分类器(两个“开箱即用”,一个“优化”)来进行情感分析预测。
指令如下:
- 摄入训练集,训练分类器
- 将分类器保存到磁盘
- 在另一个程序中,从磁盘加载分类器
- 使用测试集进行预测
步骤1-3没有问题,实际上运行得很好,问题在于使用model.predict()
。我使用的是sklearn的TfidfVectorizer
,它从文本中创建特征向量。我的问题在于,为训练集创建的特征向量与为测试集创建的训练向量不同,因为提供的文本是不同的。
以下是train.tsv
文件中的一个示例…
4|z8DDztUxuIoHYHddDL9zQ|So let me set the scene first, My church social group took a trip here last saturday. We are not your mothers church. The churhc is Community Church of Hope, We are the valleys largest GLBT church so when we desended upon Organ stop Pizza, in LDS land you know we look a little out of place. We had about 50 people from our church come and boy did we have fun. There was a baptist church a couple rows down from us who didn't see it coming. Now we aren't a bunch of flamers frolicking around or anything but we do tend to get a little loud and generally have a great time. I did recognized some of the music so I was able to sing along with those. This is a great place to take anyone over 50. I do think they might be washing dirtymob money or something since the business is cash only.........which I think caught a lot of people off guard including me. The show starts at 530 so dont be late !!!!!!:-----:|:-----:|:-----:2|BIeDBg4MrEd1NwWRlFHLQQ|Decent but terribly inconsistent food. I've had some great dishes and some terrible ones, I love chaat and 3 out of 4 times it was great, but once it was just a fried greasy mess (in a bad way, not in the good way it usually is.) Once the matar paneer was great, once it was oversalted and the peas were just plain bad. I don't know how they do it, but it's a coinflip between good food and an oversalted overcooked bowl. Either way, portions are generous.4|NJHPiW30SKhItD5E2jqpHw|Looks aren't everything....... This little divito looks a little scary looking, but like I've said before "you can't judge a book by it's cover". Not necessarily the kind of place you will take your date (unless she's blind and hungry), but man oh man is the food ever good! We have ordered breakfast, lunch, & dinner, and it is all fantastico. They make home-made corn tortillas and several salsas. The breakfast burritos are out of this world and cost about the same as a McDonald's meal. We are a family that eats out frequently and we are frankly tired of pretty places with below average food. This place is sure to cure your hankerin for a tasty Mexican meal.2|nnS89FMpIHz7NPjkvYHmug|Being a creature of habit anytime I want good sushi I go to Tokyo Lobby. Well, my group wanted to branch out and try something new so we decided on Sakana. Not a fan. And what's shocking to me is this place was packed! The restaurant opens at 5:30 on Saturday and we arrived at around 5:45 and were lucky to get the last open table. I don't get it... Messy rolls that all tasted the same. We ordered the tootsie roll and the crunch roll, both tasted similar, except of course for the crunchy captain crunch on top. Just a mushy mess, that was hard to eat. Bland tempura. No bueno. I did, however, have a very good tuna poke salad, but I would not go back just for that. If you want good sushi on the west side, or the entire valley for that matter, say no to Sakana and yes to Tokyo Lobby.2|FYxSugh9PGrX1PR0BHBIw|I recently told a friend that I cant figure out why there is no good Mexican restaurants in Tempe. His response was what about MacAyo's? I responded with "why are there no good Mexican food restaurants in Tempe?" Seriously if anyone out there knows of any legit Mexican in Tempe let me know. And don't say restaurant Mexico!
这是train.py
文件:
...
这是Tester.py
文件:
...
最终我得到的是一个错误:
...
我知道这个错误与特征向量大小差异有关——因为这些向量是从数据中的文本创建的。我对NLP或机器学习了解不够,无法想出解决这个问题的办法。如何让模型使用测试数据中的特征集进行预测呢?
我尝试根据下面的答案进行编辑以保存特征向量:
Train.py
现在看起来像这样:
...
而Test.py
现在看起来像这样:
...
但这会导致:
...
回答:
你不应该对测试数据集使用fit_transform()
。你应该只使用从训练数据集中学到的词汇表。
这是一个示例解决方案,
...
当你使用transform()
时,它只考虑从训练语料库中学到的词汇,忽略在测试集中发现的任何新词。