我有关于各种客户属性的数据(自我描述和年龄),以及这些客户是否会购买特定产品的二元结果
{"would_buy": "No", "self_description": "I'm a college student studying biology", "Age": 19},
我想对self-description
使用MultinomialNB
来预测would_buy
,然后将这些预测结果纳入一个逻辑回归模型中,该模型不仅考虑would_buy
,还将age
作为协变量。
到目前为止,关于文本模型的代码(我是SciKit的新手!),使用了一个简化的数据集。
from sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report#包含客户是否会购买商品(我感兴趣的)、他们的自我描述和年龄的客户数据。 data = [ {"would_buy": "No", "self_description": "I'm a college student studying biology", "Age": 19}, {"would_buy": "Yes", "self_description": "I'm a blue-collar worker", "Age": 20}, {"would_buy": "No", "self_description": "I'm a Stack Overflow denzien", "Age": 56}, {"would_buy": "No", "self_description": "I'm a college student studying economics", "Age": 20}, {"would_buy": "Yes", "self_description": "I'm a UPS worker", "Age": 35}, {"would_buy": "No", "self_description": "I'm a Stack Overflow denzien", "Age": 56} ]def naive_bayes_model(customer_data): self_descriptions = [customer['self_description'] for customer in customer_data] decisions = [customer['would_buy'] for customer in customer_data] vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2)) X = vectorizer.fit_transform(self_descriptions, decisions) naive_bayes = MultinomialNB(alpha=0.01) naive_bayes.fit(X, decisions) train(naive_bayes, X, decisions)def train(classifier, X, y): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=22) classifier.fit(X_train, y_train) print(classification_report(classifier.predict(X_test), y_test))def main(): naive_bayes_model(data)main()
回答:
简短的回答是使用训练好的naive_bayes
的predict_proba
或predict_log_proba
方法来创建逻辑回归模型的输入。这些可以与Age
值连接起来,创建逻辑回归模型的训练和测试集。
然而,我要指出的是,您编写的代码在训练后无法访问naive_bayes
模型。因此,您肯定需要重构您的代码。
这个问题暂且不提,这是我将naive_bayes
的输出纳入逻辑回归的方法:
descriptions = np.array([customer['self_description'] for customer in data])decisions = np.array([customer['would_buy'] for customer in data])ages = np.array([customer['Age'] for customer in data])vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))desc_vec = vectorizer.fit_transform(descriptions, decisions)naive_bayes = MultinomialNB(alpha=0.01)desc_train, desc_test, age_train, age_test, dec_train, dec_test = train_test_split(desc_vec, ages, decisions, test_size=0.25, random_state=22)naive_bayes.fit(desc_train, dec_train)nb_train_preds = naive_bayes.predict_proba(desc_train)lr = LogisticRegression()lr_X_train = np.hstack((nb_tarin_preds, age_train.reshape(-1, 1)))lr.fit(lr_X_train, dec_train)lr_X_test = np.hstack((naive_bayes.predict_proba(desc_test), age_test.reshape(-1, 1)))lr.score(lr_X_test, dec_test)