我正在使用 scikit-learn
中的加利福尼亚住房数据集。我希望创建两个二元特征:“在旧金山10公里范围内”和“在洛杉矶10公里范围内”。我创建了一个自定义变换器,单独使用时运行良好,但在放入 ColumnTransformer
时会抛出 TypeError
。这是代码:
from math import radiansfrom sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.compose import ColumnTransformerfrom sklearn.metrics.pairwise import haversine_distancesfrom sklearn.datasets import fetch_california_housingimport numpy as npimport pandas as pd# Import data into DataFramedata = fetch_california_housing()X = pd.DataFrame(data['data'], columns=data['feature_names'])y = data['target']# Custom transformer for 'Latitude' and 'Longitude' colsclass NearCity(BaseEstimator, TransformerMixin): def __init__(self, distance=10): self.la = (34.05, -118.24) self.sf = (37.77, -122.41) self.dis = distance def calc_dist(self, coords_1, coords_2): coords_1 = [radians(_) for _ in coords_1] coords_2 = [radians(_) for _ in coords_2] result = haversine_distances([coords_1, coords_2])[0,-1] return result * 6_371 def fit(self, X, y=None): return self def transform(self, X): dist_to_sf = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.sf) dist_to_sf = (dist_to_sf < self.dis).astype(int) dist_to_la = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.la) dist_to_la = (dist_to_la < self.dis).astype(int) X_trans = np.column_stack((X, dist_to_sf, dist_to_la)) return X_transct = ColumnTransformer([('near_city', NearCity(), ['Latitude', 'Longitude'])], remainder='passthrough')ct.fit_transform(X)#> /Users/.../anaconda3/envs/data3/lib/python3.7/site-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.#> FutureWarning)#> Traceback (most recent call last):#> <ipython-input-13-603f6cd4afd3> in transform(self, X)#> 17 def transform(self, X):#> 18 dist_to_sf = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.sf)#> ---> 19 dist_to_sf = (dist_to_sf < self.dis).astype(int)#> 20 #> 21 dist_to_la = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.la)#> TypeError: '<' not supported between instances of 'float' and 'NoneType'
由 reprexpy 包 创建于 2020-04-23
问题在于 self.dis
属性没有保留。如果我单独实例化变换器,没有问题:self.dis = distance = 10
。但在 ColumnTransformer
中,它变成了 NoneType
。奇怪的是,如果我直接硬编码 self.dis = 10
,它就能工作。
大家认为这是怎么回事?
Session info --------------------------------------------------------------------Platform: Darwin-18.7.0-x86_64-i386-64bit (64-bit)Python: 3.7Date: 2020-04-23Packages ------------------------------------------------------------------------numpy==1.18.1pandas==1.0.1reprexpy==0.3.0scikit-learn==0.22.1
回答:
原来问题出在 sklearn.base
中。
deep_items = value.get_params().items()
get_params()
函数会查看 init
参数来确定类参数是什么,然后假设它们与内部变量名相同。
所以我可以通过更改我的 init
方法来解决这个问题:
def __init__(self, distance=10): self.la = (34.05, -118.24) self.sf = (37.77, -122.41) self.distance = distance # <-- 使用相同的名称
非常感谢我的同事发现了这个问题!