我是机器学习的新手,在图像分类方面遇到了一些问题。我正在尝试使用简单的分类技术K最近邻来区分猫和狗。
到目前为止我的代码如下:
import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlineDATADIR = "/Users/me/Desktop/ds2/ML_image_classification/kagglecatsanddogs_3367a/PetImages"CATEGORIES = ['Dog', 'Cat']IMG_SIZE = 30data = []categories = []for category in CATEGORIES: path = os.path.join(DATADIR, category) categ_id = CATEGORIES.index(category) for img in os.listdir(path): try: img_array = cv2.imread(os.path.join(path,img), 0) new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE)) data.append(new_array) categories.append(categ_id) except Exception as e: # print(e) passprint(data[0])s1 = pd.Series(data)s2 = pd.Series(categories)frame = {'Img array': s1, 'category': s2}df = pd.DataFrame(frame) from sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)knn = KNeighborsClassifier()knn.fit(X_train, y_train)
当我尝试拟合数据时,这里出现了错误:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-76-9d98d7b11202> in <module> 2 from sklearn.neighbors import KNeighborsClassifier 3 ----> 4 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 5 6 print(X_train)~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options) 2094 raise TypeError("Invalid parameters passed: %s" % str(options)) 2095 -> 2096 arrays = indexable(*arrays) 2097 2098 n_samples = _num_samples(arrays[0])~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables) 228 else: 229 result.append(np.array(X))--> 230 check_consistent_length(*result) 231 return result 232 ~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays) 203 if len(uniques) > 1: 204 raise ValueError("Found input variables with inconsistent numbers of"--> 205 " samples: %r" % [int(l) for l in lengths]) 206 207 ValueError: Found input variables with inconsistent numbers of samples: [24946, 22451400]
如何正确地准备训练数据?顺便说一下,我不想使用深度学习。这将是我下一步要做的。
在这里任何帮助都将不胜感激..
回答:
如果你不为图像分类使用深度学习,你必须准备适合监督学习分类的数据。
步骤
1) 将所有图像调整到相同大小。你可以遍历每个图像,调整大小并保存。
2) 获取每张图像的像素向量并创建数据集。例如,如果你的猫图像在“Cat”文件夹中,狗图像在“Dog”文件夹中,遍历文件夹中的所有图像并获取像素值。同时将数据标记为“cat”(cat=1)和“non-cat”(non-cat=0)
3) 合并catdf和dogdf,并打乱数据框
data = pd.concat([catdf,dogdf]) data = data.sample(frac=1)
现在你有了一个带有图像标签的数据集。
4) 将数据集分成训练集和测试集,并拟合到模型中。