我正在使用sklearn的Pipeline来对文本进行分类。
在这个示例Pipeline中,我使用了TfIDF向量化器和一些自定义特征,这些特征通过FeatureUnion包装,并将分类器作为Pipeline的步骤,然后我对训练数据进行拟合并进行预测:
from sklearn.pipeline import FeatureUnion, Pipelinefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.svm import LinearSVCX = ['I am a sentence', 'an example']Y = [1, 2]X_dev = ['another sentence']# 加载自定义特征和与向量化器一起使用的FeatureUnionfeatures = []measure_features = MeasureFeatures() # 这个类包含我的自定义特征features.append(('measure_features', measure_features))countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)features.append(('ngram', countVecWord))all_features = FeatureUnion(features)# 分类器LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)pipeline = Pipeline( [('all', all_features ), ('clf', LinearSVC1), ])pipeline.fit(X, Y)y_pred = pipeline.predict(X_dev)# 等等
上面的代码运行得很好,但有一个转折。我想对文本进行词性标注,并对标注后的文本使用不同的向量化器。
X = ['I am a sentence', 'an example']X_tagged = do_tagging(X) # X_tagged = ['PP AUX DET NN', 'DET NN']Y = [1, 2]X_dev = ['another sentence']X_dev_tagged = do_tagging(X_dev)# 加载自定义特征和与向量化器一起使用的FeatureUnionfeatures = []measure_features = MeasureFeatures() # 这个类包含我的自定义特征features.append(('measure_features', measure_features))countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)# 新的词性向量化器countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)features.append(('ngram', countVecWord))features.append(('pos_ngram', countVecWord))all_features = FeatureUnion(features)# 分类器LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)pipeline = Pipeline( [('all', all_features ), ('clf', LinearSVC1), ])# 如何在这里同时拟合X和X_tagged# 不同的向量化器如何分别获取X或X_tagged?pipeline.fit(X, Y)y_pred = pipeline.predict(X_dev)# 等等
我应该如何正确地拟合这种数据?两个向量化器如何区分原始文本和词性文本?我有哪些选择?
我还有自定义特征,其中一些会使用原始文本,另一些则使用词性文本。
编辑:添加了MeasureFeatures()
from sklearn.base import BaseEstimatorimport numpy as npclass MeasureFeatures(BaseEstimator): def __init__(self): pass def get_feature_names(self): return np.array(['type_token', 'count_nouns']) def fit(self, documents, y=None): return self def transform(self, x_dataset): X_type_token = list() X_count_nouns = list() for sentence in x_dataset: # 使用原始文本计算类型-标记比率 X_type_token.append(type_token_ratio(sentence)) # 使用词性标注文本计算名词词性标签的数量(NN, NNS等) X_count_nouns.append(count_nouns(sentence)) X = np.array([X_type_token, X_count_nouns]).T print X print X.shape if not hasattr(self, 'scalar'): self.scalar = StandardScaler().fit(X) return self.scalar.transform(X)
然后,这个特征转换器需要为count_nouns()函数使用标注文本,或者为type_token_ratio()函数使用原始文本
回答:
我认为你需要对两个转换器(TfidfTransformer和POSTransformer)进行FeatureUnion操作。当然,你需要定义那个POSTransformer。
也许这篇文章会对你有帮助。
也许你的Pipeline会看起来像这样。
pipeline = Pipeline([ ('features', FeatureUnion([ ('ngram_tf_idf', Pipeline([ ('counts_ngram', CountVectorizer()), ('tf_idf_ngram', TfidfTransformer()) ])), ('pos_tf_idf', Pipeline([ ('pos', POSTransformer()), ('counts_pos', CountVectorizer()), ('tf_idf_pos', TfidfTransformer()) ])), ('measure_features', MeasureFeatures()) ])), ('classifier', LinearSVC())])
这假设MeasureFeatures和POSTransformer是符合sklearn API的转换器。