我在jupyterlab中尝试通过观看几个教程来清理我的数据,但每次都会遇到这样或那样的错误。所以我想来Stack Overflow问问是否有人能帮我。
这是我想清理的csv文件:https://1drv.ms/u/s!AvOXB8kb-IHBgjaveis044GVoPpk
我正在构建一个机器学习模型,所以我想转换所有对象值,但我不知道如何操作。
编辑:我尝试从头开始清理数据。
我的代码输入:
import pandas as pd from sklearn.tree import DecisionTreeClassifier criminal_data = pd.read_csv('database2.csv') X = criminal_data.drop(columns=['Agency Type', 'City', 'State', 'Crime Solved']) y = criminal_data['City'] model = DecisionTreeClassifier() model.fit(X, y) criminal_data
错误信息:
ValueError Traceback (most recent call last) <ipython-input-117-4b6968f9994f> in <module> 6 y = criminal_data['City'] 7 model = DecisionTreeClassifier() ----> 8 model.fit(X, y) 9 criminal_data ~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 896 """ 897 --> 898 super().fit( 899 X, y, 900 sample_weight=sample_weight, ~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 154 check_X_params = dict(dtype=DTYPE, accept_sparse="csc") 155 check_y_params = dict(ensure_2d=False, dtype=None) --> 156 X, y = self._validate_data(X, y, 157 validate_separately=(check_X_params, 158 check_y_params)) ~\anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params) 428 # :( 429 check_X_params, check_y_params = validate_separately --> 430 X = check_array(X, **check_X_params) 431 y = check_array(y, **check_y_params) 432 else: ~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 ~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 614 array = array.astype(dtype, casting="unsafe", copy=False) 615 else: --> 616 array = np.asarray(array, order=order, dtype=dtype) 617 except ComplexWarning as complex_warning: 618 raise ValueError("Complex data not supported\n" ~\anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order, like) 100 return _asarray_with_like(a, dtype=dtype, order=order, like=like) 101 --> 102 return array(a, dtype, copy=False, order=order) 103 104 ~\anaconda3\lib\site-packages\pandas\core\generic.py in __array__(self, dtype=None) -> np.ndarray: -> 1899 return np.asarray(self._values, dtype=dtype) 1900 1901 def __array_wrap__( ~\anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order, like) 100 return _asarray_with_like(a, dtype=dtype, order=order, like=like) 101 --> 102 return array(a, dtype, copy=False, order=order) 103 104 ValueError: could not convert string to float: 'Anchorage'
回答:
你试图用一些非数值数据来训练你的模型。在使用模型之前,你需要进行编码。你可以尝试使用LabelEncoder来实现这一点。
from sklearn import preprocessingle = preprocessing.LabelEncoder()for column_name in X.columns: if X[column_name].dtype == object: X[column_name] = le.fit_transform(X[column_name]) else: pass
如果一行中包含了不同类型的数据,可以尝试以下方法:
from sklearn import preprocessingle = preprocessing.LabelEncoder()for column_name in X.columns: X[column_name] = X[column_name].replace(np.nan, 'none', regex=True) X[column_name] = le.fit_transform(X[column_name].astype(str))