在处理机器学习算法时无法将字符串转换为浮点数 (Python 3) (Anaconda)

我目前正在观看一个关于对KDD 99杯数据集应用机器学习算法的视频。当我运行下面的代码时,出现了“无法将字符串转换为浮点数 ‘normal’”的错误。’normal’ 是下面显示的Y特征集中找到的标签之一。Y特征集有23个标签,当我测试算法仅预测3个Y特征(normal, smurf和neptune)时,它工作得非常好,但是一旦我尝试让它预测所有标签,我就会得到这个错误。任何指导都将不胜感激,因为我已经为此工作了两天了。

feature_cols =['duration','src_bytes','dst_bytes','land',   'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',   'num_compromised', 'root_shell', 'su_attempted', 'num_root',   'num_file_creations', 'num_shells', 'num_access_files',   'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',   'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',   'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',   'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',   'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',   'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',   'dst_host_serror_rate', 'dst_host_srv_serror_rate',   'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label',   'proto__icmp', 'proto__tcp', 'proto__udp']x = dataset[feature_cols]y = dataset.labely.value_counts(normalize=True)

Y特征标签

smurf.          neptune.  normal.   back.    satan.     ipsweep.     portsweep.    warezclient.   teardrop.    pod.        nmap.     guess_passwd.buffer_overflow.land.warezmaster.imap.rootkit.loadmodule.ftp_write.multihop.phf.perl.spy.Name: label, dtype: float64

代码和错误

from sklearn.tree import DecisionTreeClassifierdt = DecisionTreeClassifier()scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)print (scores)print ("Accuracy: %2.10f" % np.mean(scores))ValueError                                Traceback (most recent call last)<ipython-input-70-722f95b657f5> in <module>()      1 from sklearn.tree import DecisionTreeClassifier      2 dt = DecisionTreeClassifier()----> 3 scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)      4 print (scores)      5 print ("Accuracy: %2.10f" % np.mean(scores))~\Anaconda3\lib\site-packages\sklearn\cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)   1579                                               train, test, verbose, None,   1580                                               fit_params)-> 1581                       for train, test in cv)   1582     return np.array(scores)[:, 0]   1583 ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)    777             # was dispatched. In particular this covers the edge    778             # case of Parallel used with an exhausted iterator.--> 779             while self.dispatch_one_batch(iterator):    780                 self._iterating = True    781             else:~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)    623                 return False    624             else:--> 625                 self._dispatch(tasks)    626                 return True    627 ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)    586         dispatch_timestamp = time.time()    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)--> 588         job = self._backend.apply_async(batch, callback=cb)    589         self._jobs.append(job)    590 ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)    109     def apply_async(self, func, callback=None):    110         """Schedule a func to be run"""--> 111         result = ImmediateResult(func)    112         if callback:    113             callback(result)~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)    330         # Don't delay the application, to avoid keeping the input    331         # arguments in memory--> 332         self.results = batch()    333     334     def get(self):~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)    129     130     def __call__(self):--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]    132     133     def __len__(self):~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)    129     130     def __call__(self):--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]    132     133     def __len__(self):~\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)   1673             estimator.fit(X_train, **fit_params)   1674         else:-> 1675             estimator.fit(X_train, y_train, **fit_params)   1676    1677     except Exception as e:~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)    788             sample_weight=sample_weight,    789             check_input=check_input,--> 790             X_idx_sorted=X_idx_sorted)    791         return self    792 ~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)    114         random_state = check_random_state(self.random_state)    115         if check_input:--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")    117             y = check_array(y, ensure_2d=False, dtype=None)    118             if issparse(X):~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)    431                                       force_all_finite)    432     else:--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)    434     435         if ensure_2d:ValueError: could not convert string to float: 'normal.'

完整代码如要求

import pandas as pdimport warningswarnings.filterwarnings('ignore')col_names = ["duration","protocol_type","service","flag","src_bytes",    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",    "logged_in","num_compromised","root_shell","su_attempted","num_root",    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",    "is_host_login","is_guest_login","count","srv_count","serror_rate",    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]dataset = pd.read_csv('../data/kddcup.data', header=None, names=col_names)# 警告,加载需要一段时间# 为协议类型创建虚拟变量protocol_dummies = pd.get_dummies(dataset['protocol_type'], prefix='proto_')# 将虚拟变量列连接到原始DataFrame上(axis=0表示行,axis=1表示列)dataset = pd.concat([dataset, protocol_dummies], axis=1)del dataset['protocol_type']x = dataset.drop(['label'], axis=1)y = dataset.labelfrom sklearn.cross_validation import cross_val_scorefrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.linear_model import LogisticRegressionimport numpy as npfrom sklearn.cross_validation import train_test_splitfrom datetime import datetimefeature_cols =['duration','src_bytes','dst_bytes','land',       'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',       'num_compromised', 'root_shell', 'su_attempted', 'num_root',       'num_file_creations', 'num_shells', 'num_access_files',       'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',       'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',       'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',       'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',       'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',       'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',       'dst_host_serror_rate', 'dst_host_srv_serror_rate',       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label',       'proto__icmp', 'proto__tcp', 'proto__udp']x = dataset[feature_cols]y = dataset.labelfrom sklearn.tree import DecisionTreeClassifierdt = DecisionTreeClassifier()scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)print (scores)print ("Accuracy: %2.10f" % np.mean(scores))

KDD数据集中的一行

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.


回答:

我刚刚意识到我在x特征中保留了标签列。我已经将其移除,现在它可以工作了。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注