我在学习NLP,并且尝试进行基本的预处理步骤。我正在尝试将标点符号与单词的开头和结尾分开,以便用于嵌入。在这样做的时候,我不想破坏像can't
、I'm
等单词,因为我会单独处理它们。
s = 'This is what I'm trying to do, but I can't figure out how.'
期望的输出:
s_separated = 'This is what I'm trying to do , but I can't figure out how .'
回答:
可以尝试以下方法:
import re
str = "This is what I'm trying to do, but I can't figure out how."
res = re.sub(r'(?<=\w)(?=[,.!;:])', ' ', str)
print res