从文件中提取ID和相应的标记并添加到字典中，Python

我试图从一些定义了语料库（RCV1数据集标记）的文本文件中创建一个字典。我已经使用正则表达式清理了一些停用词。文件原本看起来像这样：https://ibb.co/h3eG5v。我使用以下代码清理了停用词：

def cleanFile():    infile = "lyrl2004_tokens_train.dat"    outfile = "cleaned_file.dat"    delete_list = [".W",'.I ']    fin = open(infile)    fout = open(outfile, "w+")    for line in fin:        for word in delete_list:            line = line.replace(word, "")    fout.write(line)    fin.close()    fout.close()

然后我使用了一小段代码来删除任何空行。现在文本文件基本上看起来像这样：https://ibb.co/e7Ww5v

因此，现在的格式是包含文档ID的一行，后跟一个整数（训练数据为2286-26150），然后是多个由单个空格分隔的标记行，然后这个块会重复：

2286

token token token token token token

token token token token token

2287

token token token..

我试图实现的是编写一个函数，该函数可以读取整个文件（幸运的是文件可以装入内存），然后构建一个包含文档ID及其对应标记列表的字典。它应该看起来像这样：{‘2286′:[token,token,token….],’2287’:[token,token,token…],…}。我已经没有想法了，因为我找不到一种方法来反复处理两个连续数字之间的文本，因为我搜索的所有方法通常包括不是数字的分隔符。仅供参考，我接下来将使用这些数据构建一个文本分类器（这就是我需要字典的原因）。测试标记的格式与训练标记相同，只是整数更高，最高可达800.000

回答：

如果你确定标记中不会有数字，可以尝试以下代码：

text = '''123dpqvjp jpo fjiqo iq[woe fwf q  q[ewfp wqervg436oiwbrveojibnpibn eprvnj erv p eprwoij536oiberv ih reip rewp wepri pirvneperpvj preererpno poj rgwe epo'''dct = {}current_id = Nonefor el in text.split():    if el.isdigit():        current_id = el        dct[current_id] = []    else:        dct[current_id].append(el)print(dct)

结果：

{'123': ['dpqvjp', 'jpo', 'fjiqo', 'iq[woe', 'f', 'wf', 'q', 'q[ewfp', 'wqervg'], '536': ['oiberv', 'ih', 'reip', 'rewp', 'wepri', 'pirvnep', 'erpvj', 'pre', 'er', 'erpno', 'poj', 'rgwe', 'epo'], '436': ['oiwbrveojibnpibn', 'eprvnj', 'erv', 'p', 'eprwoij']}

如果标记中包含数字，你可以为你的id设计一个模式，例如：

<ID=2647>

这样你的文档将看起来像这样：

<ID=123>dpqvjp jpo fjiqo iq[woe fwf q  q[ewfp wqervg<ID=436>oiwbrveojibnpibn eprvnj erv p eprwoij<ID=536>oiberv ih reip rewp wepri pirvneperpvj preererpno poj rgwe epo

你应该使用以下代码解析它：

import retext = '''<ID=123>dpqvjp jpo fjiqo iq[woe fwf q  q[ewfp wqervg<ID=436>oiwbrveojibnpibn eprvnj erv p eprwoij<ID=536>oiberv ih reip rewp wepri pirvneperpvj preererpno poj rgwe epo'''dct = {}current_id = Nonefor el in text.split():    match = re.match(r'<ID=(\d+)>', el)    if match:        current_id = match.group(1)        dct[current_id] = []    else:        dct[current_id].append(el)print(dct)

学技术

从文件中提取ID和相应的标记并添加到字典中，Python

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复