Home IT技术使用Python移除URL/@等内容

使用Python移除URL/@等内容

IT技术 xiaolong · 2025年5月22日 · 0 Comment

我刚开始学习Python和编程。我有一个文本文件，里面包含了一些URL/@/#等内容，我想去除这些内容以获得干净的文本数据，供机器学习算法使用。例如，文本数据如下所示，

@Su2ieQ13 But you're IMing with meeeeee. "@apogeum whoooaa, thats soo awesome  my eyes look like black.. except if you have a yellow light bulb close to my eyes then u can"The shop of the day  http://"i couldn't sleep so i stayed awake watching @lilbsuremusic on this live stream thingy and now i'm taking my butt to bed, so sweet dreams "@Lee_Knight ok haha thanks i will try that lol

我编写的代码如下，

import reimport string# load text negativefilename_neg = '/path/to/my/text_file'file = open(filename_neg, encoding="ISO-8859-1")text_neg = file.read()text_neg = re.sub(r'^https?:\/\/.*[\r\n]*', '', text_neg,flags=re.MULTILINE)file.close()# split into words by white spacewords_neg = text_neg.split()print(words_neg)

但我仍然无法移除URL等内容。如果有人能帮我解决这个问题，我将不胜感激。谢谢。

回答：

text_neg = re.sub('@|http://|"', '', text_neg,flags=re.MULTILINE).

你想移除的符号应该用|分隔。

machine-learning python-3.x

发表回复取消回复