我希望将我的一个特征转换为独立的二进制特征:
df["pattern_id"]Out[202]: 0 31 3...7440 27441 27442 3Name: pattern_id, Length: 7443, dtype: int64 df["pattern_id"]Out[202]: 0 0 0 11 0 0 1...7440 0 1 07441 0 1 07442 0 0 1Name: pattern_id, Length: 7443, dtype: int64
我想使用OneHotEncoder,数据是整数,因此不需要进行编码:
onehotencoder = OneHotEncoder(categorical_features=["pattern_id"])df = onehotencoder.fit_transform(df).toarray()ValueError: could not convert string to float: 'http://www.zaragoza.es/sedeelectronica/'
有趣的是,我收到了一个错误… sklearn尝试编码另一列,而不是我想要的那一列。
我们需要将pattern_id编码为整数值
我使用了这个链接: 使用OneHotEncoder对分类特征进行编码的问题
#将pattern_id特征转换为整数encoding_feature = ["pattern_id"]enc = LabelEncoder()enc.fit(encoding_feature)working_feature = enc.transform(encoding_feature)working_feature = working_feature.reshape(-1, 1)ohe = OneHotEncoder(sparse=False)#将pattern_id特征转换为独立的二进制特征onehotencoder = OneHotEncoder(categorical_features=working_feature, sparse=False)df = onehotencoder.fit_transform(df).toarray()
我得到了同样的错误。我做错了什么?
编辑
来源:https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/analysis.py
dfOut[259]: found_img is_http link_img \0 True 0 img/aahoteles.svg //www.zaragoza.es/cont/paginas/img/sede/logo_e... pattern_id current_link site_id \0 3 https://www.aa-hoteles.com/es/reservas 3 6 3 https://www.aa-hoteles.com/es/ofertas-hoteles 3 7 2 http://about.pressreader.com/contact-us/ 4 8 3 http://about.pressreader.com/contact-us/ 4 status link_id 0 200 https://www.aa-hoteles.com/ 1 200 https://www.365travel.asia/ 2 200 https://www.365travel.asia/ 3 200 https://www.365travel.asia/ 4 200 https://www.aa-hoteles.com/ 5 200 https://www.aa-hoteles.com/ 6 200 https://www.aa-hoteles.com/ 7 200 http://about.pressreader.com 8 200 http://about.pressreader.com 9 200 https://www.365travel.asia/ 10 200 https://www.365travel.asia/ 11 200 https://www.365travel.asia/ 12 200 https://www.365travel.asia/ 13 200 https://www.365travel.asia/ 14 200 https://www.365travel.asia/ 15 200 https://www.365travel.asia/ 16 200 https://www.365travel.asia/ 17 200 https://www.365travel.asia/ 18 200 http://about.pressreade [7443 rows x 8 columns]
回答:
如果你查看OneHotEncoder
的文档,你会发现categorical_features
参数期望的是’“all”或索引数组或掩码’,不是字符串。你可以通过更改以下几行来使你的代码工作:
import pandas as pdfrom sklearn.preprocessing import OneHotEncoder# 创建一个随机整数的DataFramedf = pd.DataFrame(np.random.randint(0, 4, size=(100, 4)), columns=['pattern_id', 'B', 'C', 'D'])onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')])df = onehotencoder.fit_transform(df)
然而,df
将不再是一个DataFrame
,我建议直接使用numpy数组进行操作。