NLP分类标签有很多相似之处，替换为仅保留一个

我一直在尝试使用Python的fuzzywuzzy库来查找标签中字符串之间的相似度百分比。我遇到的问题是，即使我尝试进行查找和替换，仍然有很多字符串非常相似。

我想知道这里是否有人使用过清理标签的方法。举个例子。我有这些看起来非常相似的标签：

 'Cable replaced', 'Cable replaced.', 'Camera is up and recording', 'Chat closed due to inactivity.', 'Closing as duplicate', 'Closing as duplicate.', 'Closing duplicate ticket.', 'Closing ticket.',

理想情况下，我希望能够通过一个公共字符串进行查找和替换，以便我们只保留一个’closing as duplicate’的实例。任何想法或建议都将不胜感激。

为了提供一个更详细的例子。这是我的尝试内容：

import fuzzywuzzyfrom fuzzywuzzy import processimport chardetres = h['resolution'].unique()res.sort()res'All APs are up and stable hence resoling TT  Logs are updated in WL','Asset returned to IT hub closing ticket.','Auto Resolved - No reply from requester', 'Cable replaced','Cable replaced.', 'Camera is up and recording','Chat closed due to inactivity.', 'Closing as duplicate','Closing as duplicate.', 'Closing duplicate ticket.','Closing ticket.', 'Completed', 'Connection to IDF restored',

哦，看看这个，让我们看看能否找到像’cable replaced’这样的字符串。

# get the top 10 closest matches to "cable replaced"matches = fuzzywuzzy.process.extract("cable replaced", res, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)# take a look at themmatches[('cable replaced', 100), ('cable replaced.', 100), ('replaced cable', 100), ('replaced scanner cable', 78), ('replaced scanner cable.', 78), ('scanner cable replaced', 78), ('battery replaced', 73), ('replaced', 73), ('replaced battery', 73), ('replaced battery.', 73)]

嗯，或许我应该创建一个函数来替换相似度分数大于90的字符串。

# function to replace rows in the provided column of the provided dataframe# that match the provided string above the provided ratio with the provided stringdef replace_matches_in_column(df, column, string_to_match, min_ratio = 90):    # get a list of unique strings    strings = df[column].unique()        # get the top 10 closest matches to our input string    matches = fuzzywuzzy.process.extract(string_to_match, strings,                                          limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)    # only get matches with a ratio > 90    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]    # get the rows of all the close matches in our dataframe    rows_with_matches = df[column].isin(close_matches)    # replace all rows with close matches with the input matches     df.loc[rows_with_matches, column] = string_to_match        # let us know the function's done    print("All done!")# use the function we just wrote to replace close matches to "cable replaced" with "cable replaced"replace_matches_in_column(df=h, column='resolution', string_to_match="cable replaced")# get all the unique values in the 'City' columnres = h['resolution'].unique()# sort them alphabetically and then take a closer lookres.sort()res'auto resolved - no reply from requester', 'battery replaced',       'cable replaced', 'camera is up and recording',       'chat closed due to inactivity.', 'check ok',

太好了！现在我只有一个’cable replaced’的实例。让我们验证一下

# get the top 10 closest matches to "cable replaced"matches = fuzzywuzzy.process.extract("cable replaced", res, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)# take a look at themmatches[('cable replaced', 100), ('replaced scanner cable', 78), ('replaced scanner cable.', 78), ('scanner cable replaced', 78), ('battery replaced', 73), ('replaced', 73), ('replaced battery', 73), ('replaced battery.', 73), ('replaced.', 73), ('hardware replaced', 71)]

是的！看起来不错。现在，这个例子效果很好，但如你所见，它相当手动。我理想情况下希望为我的resolution列中的所有字符串自动化这个过程。有什么想法吗？

回答：

使用这个链接中的函数，你可以找到如下映射：

from fuzzywuzzy import fuzzdef replace_similars(input_list):    # Replaces %90 and more similar strings    for i in range(len(input_list)):        for j in range(len(input_list)):            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:                input_list[j] = input_list[i]def generate_mapping(input_list):    new_list = input_list[:]  # copy list    replace_similars(new_list)    mapping = {}    for i in range(len(input_list)):        mapping[input_list[i]] = new_list[i]    return mapping

让我们看看如何使用：

# Let's assume items in labels are unique.# If they are not unique, it will work anyway but will be slower.labels = [    "Cable replaced",    "Cable replaced.",    "Camera is up and recording",    "Chat closed due to inactivity.",    "Closing as duplicate",    "Closing as duplicate.",    "Closing duplicate ticket.",    "Closing ticket.",    "Completed",    "Connection to IDF restored",]mapping = generate_mapping(labels)# Print to see mappingprint("\n".join(["{:<50}: {}".format(k, v) for k, v in mapping.items()]))

输出：

Cable replaced                                    : Cable replacedCable replaced.                                   : Cable replacedCamera is up and recording                        : Camera is up and recordingChat closed due to inactivity.                    : Chat closed due to inactivity.Closing as duplicate                              : Closing as duplicateClosing as duplicate.                             : Closing as duplicateClosing duplicate ticket.                         : Closing duplicate ticket.Closing ticket.                                   : Closing ticket.Completed                                         : CompletedConnection to IDF restored                        : Connection to IDF restored

因此，你可以为h['resolution'].unique()找到一个映射，然后使用这个映射更新h['resolution']列。由于我没有你的数据框，我无法尝试。根据这个，我猜你可以使用以下代码：

for k, v in mapping.items():    if k != v:        h.loc[h['resolution'] == k, 'resolution'] = v

学技术

NLP分类标签有很多相似之处，替换为仅保留一个

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复