如何在一个文本文档中查找频繁出现的短语

我有一个包含多个段落的文本文档。我需要找出频繁出现的短语组合在一起。

例如

患者姓名 xyz 电话号码 12345 电子邮件 [email protected]患者姓名 abc 地址某个美国地址

比较这两行，共同的短语是“患者姓名”。现在这个短语可能出现在段落的任何位置。我的需求是使用自然语言处理（NLP）技术找出文档中最频繁出现的短语，无论其位置如何。

回答：

你应该使用n-grams来处理这个问题，这样你只需要计算一连串连续的n个词出现的次数。因为你不知道有多少词会重复，你可以尝试不同的n值来生成n-grams，即从2到6。

在JDK 1.8.0上测试过的Java ngrams示例：

import java.util.*;public class NGramExample{    public static HashMap<String, Integer> ngrams(String text, int n) {        ArrayList<String> words = new ArrayList<String>();        for(String word : text.split(" ")) {            words.add(word);        }        HashMap<String, Integer> map = new HashMap<String, Integer>();        int c = words.size();        for(int i = 0; i < c; i++) {            if((i + n - 1) < c) {                int stop = i + n;                String ngramWords = words.get(i);                for(int j = i + 1; j < stop; j++) {                    ngramWords +=" "+ words.get(j);                }                map.merge(ngramWords, 1, Integer::sum);            }        }        return map;    }     public static void main(String []args){        System.out.println("Ngrams: ");        HashMap<String, Integer> res = ngrams("Patient name xyz phone no 12345 emailid [email protected]. Patient name abc address some us address", 2);        for (Map.Entry<String, Integer> entry : res.entrySet()) {            System.out.println(entry.getKey() + ":" + entry.getValue().toString());        }     }}

输出结果：

Ngrams: name abc:1[email protected]. Patient:1emailid [email protected].:1phone no:112345 emailid:1Patient name:2xyz phone:1address some:1us address:1name xyz:1some us:1no 12345:1abc address:1

所以你可以看到“Patient name”出现的次数最多，有2次。你可以使用这个函数尝试不同的n值，并找出出现次数最多的短语。

编辑：我将这段Python代码留在这里，仅供历史参考。

一个简单的Python（使用nltk）工作示例，展示我的意思：

from nltk import ngramsfrom collections import Counterparagraph = 'Patient name xyz phone no 12345 emailid [email protected]. Patient name abc address some us address'n = 2words = paragraph.split(' ') # 当然你应该用更好的方式分割句子bigrams = ngrams(words, n)c = Counter(bigrams)c.most_common()[0]

这会给你以下输出：

>> (('Patient', 'name'), 2)

学技术

如何在一个文本文档中查找频繁出现的短语

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复