我刚开始学习“机器学习”,尝试实现这个问题,但对我来说不太清楚。我已经尝试了两个月,请帮助我解决我的错误。
实际上,我试图做的是:
- “训练SVM分类器”在从TRAIN_dataset提取的TRAIN_features和TRAIN_labels上,数据集形状为(98962,),大小为98962
- “测试SVM分类器”在从另一个数据集TEST_dataset提取的TEST_features上,该数据集的形状和大小与TRAIN_dataset相同,即(98962,)和98962。
在对“TRAIN_features”和“TEST_features”进行“预处理”后,我使用“TfidfVectorizer”对这两个特征进行了向量化。之后,我再次计算了这两个特征的形状和大小,即
vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)
“processed_TRAIN_features”的大小变为1032665,“形状”变为(98962, 9434)
vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)
“processed_TEST_features”的大小变为1457961,“形状”变为(98962, 10782)
我知道当我使用processed_TRAIN_features“训练”SVM分类器,并使用相同的分类器“预测”“processed_TEST_features”时,会产生错误,因为两个特征的“形状”和“大小”已经不同了。
我认为解决这个问题的唯一方法是“重塑”稀疏矩阵(numpy.float64),要么是processed_TEST_features,要么是processed_TRAIN_features…我认为只能重塑processed_TRAIN_features,因为其大小小于“processed_TEST_features”,或者有没有其他方法来实现我上面提到的点(1,2)。我无法实现这个问题来解决我的问题,仍然在寻找如何使其在形状和大小上与“processed_TEST_features”相等的方法。
如果你们中的任何人能帮我做这个,我将不胜感激。提前谢谢你们。
完整代码如下:
DataPath2 = ".../train.csv"TRAIN_dataset = pd.read_csv(DataPath2)DataPath1 = "..../completeDATAset.csv"TEST_dataset = pd.read_csv(DataPath1)TRAIN_features = TRAIN_dataset.iloc[:, 1 ].valuesTRAIN_labels = TRAIN_dataset.iloc[:,0].valuesTEST_features = TEST_dataset.iloc[:, 1 ].valuesTEST_labeels = TEST_dataset.iloc[:,0].valueslab_enc = preprocessing.LabelEncoder()TEST_labels = lab_enc.fit_transform(TEST_labeels)processed_TRAIN_features = []for sentence in range(0, len(TRAIN_features)): # Remove all the special characters processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence])) # remove all single characters processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature) #remove special symbols processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature) # remove special symbols processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature) # remove special symbols processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature) # Remove single characters from the start processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) # Substituting multiple spaces with single space processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I) #remove links processed_feature = re.sub(r"http\S+", "", processed_feature) # Removing prefixed 'b' processed_feature = re.sub(r'^b\s+', '', processed_feature) #removing rt processed_feature = re.sub(r'^rt\s+', '', processed_feature) # Converting to Lowercase processed_feature = processed_feature.lower() processed_TRAIN_features.append(processed_feature)vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)processed_TEST_features = []for sentence in range(0, len(TEST_features)): # Remove all the special characters processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence])) # remove all single characters processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1) #remove special symbols processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1) # remove special symbols processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1) # remove special symbols processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1) # Remove single characters from the start processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1) # Substituting multiple spaces with single space processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I) #remove links processed_feature1 = re.sub(r"http\S+", "", processed_feature1) # Removing prefixed 'b' processed_feature1 = re.sub(r'^b\s+', '', processed_feature1) #removing rt processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1) # Converting to Lowercase processed_feature1 = processed_feature1.lower() processed_TEST_features.append(processed_feature1)vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)text_classifier.fit(X_train_data, y_train_data)text_classifier.predict(processed_TEST_features)
标题编辑:预测数据集的分类 => 预测数据集
回答:
processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))