从句子列中提取新特征 – Python

我有两个数据框:

city_state 数据框

    city        state0   huntsville  alabama1   montgomery  alabama2   birmingham  alabama3   mobile      alabama4   dothan      alabama5   chicago     illinois6   boise       idaho7   des moines  iowa

和句子数据框

    sentence0   marthy was born in dothan1   michelle reads some books at her home2   hasan is highschool student in chicago3   hartford of the west is the nickname of des moines

我想从句子数据框中提取一个名为 city 的新特征。这个列 city 是从 sentence 中提取的,如果句子中包含来自 city_state['city'] 列的特定 city 名称,则提取该名称;如果不包含,则其值为 Null。

预期的新数据框将如下所示:

    sentence                                        city0   marthy was born in dothan                       dothan1   michelle reads some books at her home           Null2   hasan is highschool student in chicago          chicago3   capital of dream is the motto of des moines     des moines

我运行了以下代码

sentence['city'] ={}for city in city_state.city:    for text in sentence.sentence:        words = text.split()        for word in words:            if word == city:                sentence['city'].append(city)                break    else:        sentence['city'].append(None)

但此代码的结果如下

ValueError: Length of values does not match length of index

如果你有类似情况的特征工程经验,能否给我一些建议,如何编写代码以获得预期结果。

谢谢

注意:这是错误的完整日志

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-205-8a9038a015ee> in <module>----> 1 sentence['city'] ={}      2       3 for city in city_state.city:      4     for text in sentence.sentence:      5         words = text.split()~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)   3117         else:   3118             # set column-> 3119             self._set_item(key, value)   3120    3121     def _setitem_slice(self, key, value):~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)   3192    3193         self._ensure_valid_index(value)-> 3194         value = self._sanitize_column(key, value)   3195         NDFrame._set_item(self, key, value)   3196 ~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)   3389    3390             # turn me into an ndarray-> 3391             value = _sanitize_index(value, self.index, copy=False)   3392             if not isinstance(value, (np.ndarray, Index)):   3393                 if isinstance(value, list) and len(value) > 0:~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)   3999    4000     if len(data) != len(index):-> 4001         raise ValueError('Length of values does not match length of ' 'index')   4002    4003     if isinstance(data, ABCIndexClass) and not copy:ValueError: Length of values does not match length of index

回答:

一些快速而粗糙的应用方法,还没有在大数据框上测试过,所以请谨慎使用。首先定义一个用于提取城市名称的函数:

def ex_city(col, cities):    output = []    for w in cities:        if w in col:            output.append(w)    return ','.join(output) if output else None

然后将其应用到你的句子数据框

city_list = city_state.city.unique().tolist()sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注