我试图将隔离森林应用于从事件日志转换而来的数据,但得到了“TypeError: invalid type promotion”的错误,这是否是因为日期时间造成的?我不明白我做错了什么!
我的表格的一部分(处理后):
+--------------+----------------------+--------------+--------------------+--------------------+-------------------+-----------------+| org:resource | lifecycle:transition | concept:name | time:timestamp | case:REG_DATE | case:concept:name | case:AMOUNT_REQ |+--------------+----------------------+--------------+--------------------+--------------------+-------------------+-----------------+| 52 | 0 | 9 | 2011 10-01 38:44.5 | 2011 10-01 38:44.5 | 0 | 20000 || 52 | 0 | 6 | 2011 10-01 38:44.9 | 2011 10-01 38:44.5 | 2 | 20000 || 52 | 0 | 7 | 2011 10-01 39:37.9 | 2011 10-01 38:44.5 | 0 | 20000 || 52 | 1 | 19 | 2011 10-01 39:38.9 | 2011 10-01 38:44.5 | 1 | 20000 || 68 | 2 | 19 | 2011 10-01 36:46.4 | 2011 10-01 38:44.5 | 3 | 20000 |+--------------+----------------------+--------------+--------------------+--------------------+-------------------+-----------------+
当打印时
df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 262200 entries, 0 to 262199Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 org:resource 262200 non-null int64 1 lifecycle:transition 262200 non-null int64 2 concept:name 262200 non-null int64 3 time:timestamp 262200 non-null datetime64[ns] 4 case:REG_DATE 262200 non-null datetime64[ns] 5 case:concept:name 262200 non-null int64 6 case:AMOUNT_REQ 262200 non-null int32 dtypes: datetime64[ns](2), int32(1), int64(4)memory usage: 13.0 MB
我的代码是:
from sklearn.ensemble import IsolationForestcontamination = 0.05model = IsolationForest(contamination=contamination, n_estimators=10000)model.fit(df)df["iforest"] = pd.Series(model.predict(df))df["iforest"] = df["iforest"].map({1: 0, -1: 1})df["score"] = model.decision_function(df)df.sort_values("score")
然而,我得到了以下错误:
---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-23-5edb86351ac8> in <module> 4 5 model = IsolationForest(contamination=contamination, n_estimators=10000)----> 6 model.fit(df) 7 8 df["iforest"] = pd.Series(model.predict(df))~\.conda\envs\process_mining\lib\site-packages\sklearn\ensemble\_iforest.py in fit(self, X, y, sample_weight) 261 ) 262 --> 263 X = check_array(X, accept_sparse=['csc']) 264 if issparse(X): 265 # Pre-sort indices to avoid that each individual tree of the~\.conda\envs\process_mining\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})---> 72 return f(**kwargs) 73 return inner_f 74 ~\.conda\envs\process_mining\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 531 532 if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):--> 533 dtype_orig = np.result_type(*dtypes_orig) 534 535 if dtype_numeric:<__array_function__ internals> in result_type(*args, **kwargs)TypeError: invalid type promotion
回答:
我通过这个答案找到了解决方案:Python – linear regression TypeError: invalid type promotion
技术上,你需要将时间戳转换为序数,这样就可以工作了,我使用以下代码进行了转换:
df['time:timestamp'] = df['time:timestamp'].map(dt.datetime.toordinal)df['case:REG_DATE'] = df['case:REG_DATE'].map(dt.datetime.toordinal)