使用机器学习计算文档权重

假设我有一份包含n个文档(简历)的列表,我想根据同类别的“Job description.txt”作为参考,对每个文档(简历)进行加权。我想按照以下方式对文档进行加权。我的问题是,在这种情况下,有没有其他方法来对文档进行加权?提前感谢。

行动计划:

a) 获取同一类别(例如Java)的简历(例如10份)

b) 从所有文档中获取词袋

对于:

c) 每个文档使用TFIDF向量化器得分获取特征名称d) 现在我有一份特征词的列表e) 现在将这些特征与“Job Description”的词袋进行比较f) 现在通过加总列来计算文档的得分并对文档进行加权

回答:

从问题中我理解到,你希望通过查看简历(文档)与工作描述文档的相似程度来对其进行评分。一种可以使用的方法是将所有文档转换为包括工作描述在内的TFIDF矩阵。每个文档可以被视为词空间中的向量。一旦你创建了TFIDF矩阵,你可以使用余弦相似度来计算两个文档之间的相似性。

你还应该做一些额外的事情,比如去除停用词、词形还原和编码。此外,你可能还想使用n-gram。

你也可以参考这本书以获取更多信息。

编辑:

添加一些设置代码

import numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom nltk.corpus import stopwordsimport stringimport spacynlp = spacy.load('en')# 去除标点符号translator = str.maketrans('', '', string.punctuation)# 一些示例文档resumes = ["Executive Administrative Assistant with over 10 years of experience providing thorough and skillful support to senior executives.","Experienced Administrative Assistant, successful in project management and systems administration.","10 years of administrative experience in educational settings; particular skill in establishing rapport with people from diverse backgrounds.","Ten years as an administrative support professional in a corporation that provides confidential case work.","A highly organized and detail-oriented Executive Assistant with over 15 years' experience providing thorough and skillful administrative support to senior executives.","More than 20 years as a knowledgeable and effective psychologist working with individuals, groups, and facilities, with particular emphasis on geriatrics and the multiple psychopathologies within that population.","Ten years as a sales professional with management experience in the fashion industry.","More than 6 years as a librarian, with 15 years' experience as an active participant in school-related events and support organizations.","Energetic sales professional with a knack for matching customers with optimal products and services to meet their specific needs. Consistently received excellent feedback from customers.","More than six years of senior software engineering experience, with strong analytical skills and a broad range of computer expertise.","Software Developer/Programmer with history of productivity and successful project outcomes."]job_doc = ["""Executive Administrative with a knack for matching and effective psychologist with particular emphasis on geriatrics"""]# 合并两个_all = resumes+job_doc# 将每个转换为spacy文档docs= [nlp(document) for document in _all]# 词形还原,去除停用词,去除标点符号docs_pp = [' '.join([token.lemma_.translate(translator) for token in docs if not token.is_stop]) for docs in docs]# 获取tfidf矩阵tfidf_vec = TfidfVectorizer()tfidf_matrix = tfidf_vec.fit_transform(docs_pp).todense()# 计算相似度cosine_similarity(tfidf_matrix[-1,], tfidf_matrix[:-1,])

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注