Scikit-learn: 如何从文本中提取特征？

假设我有一个字符串数组：

['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']

我想从这些描述中提取如下的特征：

item=Laptopbrand=Applemodel=Macbook Air A1465cpu=Core i7...

我应该先准备好预定义的已知特征吗？例如：

brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']

我不确定是否需要在这里使用 CountVectorizer 和 TfidfVectorizer，使用 DictVictorizer 似乎更合适，但如何从整个字符串中提取值作为字典的键呢？

这是否可以使用 scikit-learn 的特征提取功能实现？还是我应该自己编写 .fit() 和 .transform() 方法？

更新：@人名，请审阅我是否正确理解了你的意思：

data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]for d in data:    for brand in brands:       if brand in d:          # ok brand is foundfor model in models:       if model in d:          # ok model is found

所以要为每个特征创建 N 个循环？这可能有效，但不确定是否正确和灵活。

回答：

是的，类似这样。

抱歉，可能你需要修正下面的代码。

import redata = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]features = {    'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],    'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']    # 和其他特征}cat_data = [] # 你需要转换成数字的类别not_found_columns = []for line in data:    line_cats = {}    for col, features in features.iteritems():        for i, feature in enumerate(features):            found = False            if re.findall(feature, line.lower(), flags=re.UNICODE) != []:                line_cats[col] = i + 1 # 找到列中的数字类别。例如，dell 是 2，acer 是 5。                               found = True                break # 当前类别由第一次出现决定        # 循环结束但未找到特征。将列值设为默认不存在的特征        if not found:                   line_cats[col] = 0            not_found_columns.append((col, line))        cat_data.append(line_cats)# 现在我们有了 cat_data，其中每一列对应于一个类别（索引+1），如果找到了特征，否则为 0。

现在你有列名和未找到的行（not_found_columns）。查看它们，可能你忘记了一些特征。

我们也可以将字符串（而不是数字）写为类别，然后使用 DV。结果这两种方法是等价的。

学技术

Scikit-learn: 如何从文本中提取特征？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复