机器学习:基于第一个数据集训练的分类器预测第二个数据集

我刚开始学习“机器学习”,尝试实现这个问题,但对我来说不太清楚。我已经尝试了两个月,请帮助我解决我的错误。

实际上,我试图做的是:

  1. “训练SVM分类器”在从TRAIN_dataset提取的TRAIN_featuresTRAIN_labels上,数据集形状为(98962,),大小为98962
  2. “测试SVM分类器”在从另一个数据集TEST_dataset提取的TEST_features上,该数据集的形状和大小与TRAIN_dataset相同,即(98962,)98962

在对“TRAIN_features”“TEST_features”进行“预处理”后,我使用“TfidfVectorizer”对这两个特征进行了向量化。之后,我再次计算了这两个特征的形状和大小,即

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)

“processed_TRAIN_features”的大小变为1032665“形状”变为(98962, 9434)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

“processed_TEST_features”的大小变为1457961“形状”变为(98962, 10782)

我知道当我使用processed_TRAIN_features“训练”SVM分类器,并使用相同的分类器“预测”“processed_TEST_features”时,会产生错误,因为两个特征的“形状”“大小”已经不同了。

我认为解决这个问题的唯一方法是“重塑”稀疏矩阵(numpy.float64),要么是processed_TEST_features,要么是processed_TRAIN_features…我认为只能重塑processed_TRAIN_features,因为其大小小于“processed_TEST_features”,或者有没有其他方法来实现我上面提到的点(1,2)。我无法实现这个问题来解决我的问题,仍然在寻找如何使其在形状和大小上与“processed_TEST_features”相等的方法。

如果你们中的任何人能帮我做这个,我将不胜感激。提前谢谢你们。

完整代码如下:

DataPath2     = ".../train.csv"TRAIN_dataset =   pd.read_csv(DataPath2)DataPath1     = "..../completeDATAset.csv"TEST_dataset  =   pd.read_csv(DataPath1)TRAIN_features = TRAIN_dataset.iloc[:, 1 ].valuesTRAIN_labels = TRAIN_dataset.iloc[:,0].valuesTEST_features = TEST_dataset.iloc[:, 1 ].valuesTEST_labeels = TEST_dataset.iloc[:,0].valueslab_enc = preprocessing.LabelEncoder()TEST_labels = lab_enc.fit_transform(TEST_labeels)processed_TRAIN_features = []for sentence in range(0, len(TRAIN_features)):    # Remove all the special characters    processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence]))    # remove all single characters    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)    #remove special symbols    processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature)    # remove special symbols    processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature)    # remove special symbols    processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature)    # Remove single characters from the start    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)    # Substituting multiple spaces with single space    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)    #remove links    processed_feature = re.sub(r"http\S+", "", processed_feature)    # Removing prefixed 'b'    processed_feature = re.sub(r'^b\s+', '', processed_feature)    #removing rt    processed_feature = re.sub(r'^rt\s+', '', processed_feature)    # Converting to Lowercase    processed_feature = processed_feature.lower()    processed_TRAIN_features.append(processed_feature)vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)processed_TEST_features = []for sentence in range(0, len(TEST_features)):    # Remove all the special characters    processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence]))    # remove all single characters    processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1)    #remove special symbols    processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1)    # remove special symbols    processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1)    # remove special symbols    processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1)    # Remove single characters from the start    processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1)    # Substituting multiple spaces with single space    processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I)    #remove links    processed_feature1 = re.sub(r"http\S+", "", processed_feature1)    # Removing prefixed 'b'    processed_feature1 = re.sub(r'^b\s+', '', processed_feature1)    #removing rt    processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1)    # Converting to Lowercase    processed_feature1 = processed_feature1.lower()    processed_TEST_features.append(processed_feature1)vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)text_classifier.fit(X_train_data, y_train_data)text_classifier.predict(processed_TEST_features)

标题编辑:预测数据集的分类 => 预测数据集


回答:

processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注