我有一个需要使用机器学习的学校项目,经过几次故障排除后,我遇到了死胡同,不知道如何解决这个问题。
这是我的代码:
db_connection = 'mysql+pymysql://root:@localhost/databases'conn = create_engine(db_connection)df = pd.read_sql("SELECT * from barang", conn)cth_data = pd.DataFrame(df)#print(cth_data.head())cth_data = cth_data.dropna()y = cth_data['kode_aset']x = cth_data[['merk','ukuran','bahan','harga']]x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)clf=RandomForestClassifier(n_estimators=100)vectorizer = CountVectorizer( max_features = 50000, ngram_range = ( 1,50 ) )d_feture = vectorizer.fit_transform(x_train)#d_label = vectorizer.transform(y_train)clf.fit(d_feture, y_train)t_data = vectorizer.transform(x_test)y_pred=clf.predict(t_data)print ("Model_Accuracy: " + str(np.mean(y_pred == y_test)))
我从MySQL数据库中获取了数据,以下是数据库的截图:
数据库截图:
最后出现了这样的错误:
File "Machine_learn_V_0.0.1.py", line 41, in <module> clf.fit(d_feture, y_train) File "C:\Python35\lib\site-packages\sklearn\ensemble\forest.py", line 333, in fit for i, t in enumerate(trees)) File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__ if self.dispatch_one_batch(iterator): File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks) File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "C:\Python35\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "C:\Python35\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__ self.results = batch() File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__ for func, args, kwargs in self.items] File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in <listcomp> for func, args, kwargs in self.items] File "C:\Python35\lib\site-packages\sklearn\ensemble\forest.py", line 119, in _parallel_build_trees tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) File "C:\Python35\lib\site-packages\sklearn\tree\tree.py", line 801, in fit X_idx_sorted=X_idx_sorted) File "C:\Python35\lib\site-packages\sklearn\tree\tree.py", line 236, in fit "number of samples=%d" % (len(y), n_samples))ValueError: Number of labels=223 does not match number of samples=4
回答:
CountVectorizer只能处理字符串,它无法像你希望的那样处理列,这意味着你应该将cth_data[['merk','ukuran','bahan','harga']]
中的字符串连接成一列,例如:
cols = ['merk','ukuran','bahan','harga']cth_data['combined'] = cth_data[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)x = cth_data["combined"]
从这里开始,你的代码应该可以正常工作了