我刚开始学习Python和编程。我有一个文本文件,里面包含了一些URL/@/#等内容,我想去除这些内容以获得干净的文本数据,供机器学习算法使用。例如,文本数据如下所示,
@Su2ieQ13 But you're IMing with meeeeee. "@apogeum whoooaa, thats soo awesome my eyes look like black.. except if you have a yellow light bulb close to my eyes then u can"The shop of the day http://"i couldn't sleep so i stayed awake watching @lilbsuremusic on this live stream thingy and now i'm taking my butt to bed, so sweet dreams "@Lee_Knight ok haha thanks i will try that lol
我编写的代码如下,
import reimport string# load text negativefilename_neg = '/path/to/my/text_file'file = open(filename_neg, encoding="ISO-8859-1")text_neg = file.read()text_neg = re.sub(r'^https?:\/\/.*[\r\n]*', '', text_neg,flags=re.MULTILINE)file.close()# split into words by white spacewords_neg = text_neg.split()print(words_neg)
但我仍然无法移除URL等内容。如果有人能帮我解决这个问题,我将不胜感激。谢谢。
回答:
text_neg = re.sub('@|http://|"', '', text_neg,flags=re.MULTILINE).
你想移除的符号应该用|分隔。