我有一些来自研究的句子,并且手动提取了这些句子的关键词短语,这些关键词是我想要的。现在我想将这些句子和每个关键词一起向量化,以便构建SVM分类器的训练数据。请看代码
我考虑使用一个字典,然后应用sklearn库中的DictVectorizer。
Code:sklearn.feature_extraction import DictVectorizerv = DictVectorizer()D = [{"sentence":"the laboratory information system was evaluated", "keyword":"laboratory information system"}, {"sentence":"the electronic health record system was evaluated", "keyword":"electronic health record system"}]X = v.fit_transform(D)print(X)content = X.toarray()print(content)print(v.get_feature_names())Results: (0, 1) 1.0 (0, 3) 1.0 (1, 0) 1.0 (1, 2) 1.0[[0. 1. 0. 1.] [1. 0. 1. 0.]]['keyword=electronic health record system', 'keyword=laboratory information system', 'sentence=the electronic health record system was evaluated', 'sentence=the laboratory information system was evaluated']
这种方法在方法学上是否正确,或者我应该如何将每个句子与相应的手动提取的关键词结合起来进行向量化,以获得训练数据?非常感谢。
回答:
我认为这样做并不理想,因为你将整个句子作为一个特征。对于大型数据集来说,这会变得很麻烦。
例如,
D = [{"sentence":"This is sentence one", "keyword":"key 1"}, {"sentence":"This is sentence one", "keyword":"key 2"}, {"sentence":"This is sentence one", "keyword":"key 3"}, {"sentence":"This is sentence one", "keyword":"key 2"}, {"sentence":"This is sentence one", "keyword":"key 1"}]
X
将会是
[[1. 0. 0. 0. 0. 1. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 1. 0. 0. 0.] [1. 0. 0. 1. 0. 0. 0. 0.]]
你可能只需要应用scikit-learn中的TfidfVectorizer
,它可能会从句子中提取出重要的词语。
代码:
from sklearn.feature_extraction.text import TfidfVectorizersentences = [d['sentence'] for d in D]vectorizer = TfidfVectorizer()X = vectorizer.fit_transform(sentences)