我有两个数据框:
city_state
数据框
city state0 huntsville alabama1 montgomery alabama2 birmingham alabama3 mobile alabama4 dothan alabama5 chicago illinois6 boise idaho7 des moines iowa
和句子数据框
sentence0 marthy was born in dothan1 michelle reads some books at her home2 hasan is highschool student in chicago3 hartford of the west is the nickname of des moines
我想从句子数据框中提取一个名为 city 的新特征。这个列 city
是从 sentence
中提取的,如果句子中包含来自 city_state['city']
列的特定 city
名称,则提取该名称;如果不包含,则其值为 Null。
预期的新数据框将如下所示:
sentence city0 marthy was born in dothan dothan1 michelle reads some books at her home Null2 hasan is highschool student in chicago chicago3 capital of dream is the motto of des moines des moines
我运行了以下代码
sentence['city'] ={}for city in city_state.city: for text in sentence.sentence: words = text.split() for word in words: if word == city: sentence['city'].append(city) break else: sentence['city'].append(None)
但此代码的结果如下
ValueError: Length of values does not match length of index
如果你有类似情况的特征工程经验,能否给我一些建议,如何编写代码以获得预期结果。
谢谢
注意:这是错误的完整日志
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-205-8a9038a015ee> in <module>----> 1 sentence['city'] ={} 2 3 for city in city_state.city: 4 for text in sentence.sentence: 5 words = text.split()~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value) 3117 else: 3118 # set column-> 3119 self._set_item(key, value) 3120 3121 def _setitem_slice(self, key, value):~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value) 3192 3193 self._ensure_valid_index(value)-> 3194 value = self._sanitize_column(key, value) 3195 NDFrame._set_item(self, key, value) 3196 ~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast) 3389 3390 # turn me into an ndarray-> 3391 value = _sanitize_index(value, self.index, copy=False) 3392 if not isinstance(value, (np.ndarray, Index)): 3393 if isinstance(value, list) and len(value) > 0:~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy) 3999 4000 if len(data) != len(index):-> 4001 raise ValueError('Length of values does not match length of ' 'index') 4002 4003 if isinstance(data, ABCIndexClass) and not copy:ValueError: Length of values does not match length of index
回答:
一些快速而粗糙的应用方法,还没有在大数据框上测试过,所以请谨慎使用。首先定义一个用于提取城市名称的函数:
def ex_city(col, cities): output = [] for w in cities: if w in col: output.append(w) return ','.join(output) if output else None
然后将其应用到你的句子数据框
city_list = city_state.city.unique().tolist()sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))