我有一些句子,其中引用了文本,比如:
Why did the author use three sentences in a row that start with the words, "it spun"?Why did the queen most likely say “I would have tea instead.”Why did the fdsfdsf repeat the phrase "he waited" so many times?Why were "the lights of his town growing smaller below them"?What is a fdsfdsf for the word "adjust"?Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?
我尝试使用REGEX来屏蔽引用部分,但效果并不准确。例如,对于最后一句:
txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))
输出结果是:
Reread these sentences: "<quote>" mean?
而正确的输出应该是:
Reread these sentences: "<quote>" What does the word "courtship" mean?
由于我有超过10,000个实例,找到一个适用于所有情况的通用REGEX模式非常困难。
我的问题是,是否有任何库(可能是基于神经网络的?)或方法来解决这个问题?
回答:
对于这些示例,请使用
import retxt = """Why did the author use three sentences in a row that start with the words, "it spun"?Why did the queen most likely say “I would have tea instead.”Why did the fdsfdsf repeat the phrase "he waited" so many times?Why were "the lights of his town growing smaller below them"?What is a fdsfdsf for the word "adjust"?Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)print(txt)
请查看 Python证明。对于不同类型的引号,请使用单独的命令,这样可以更容易控制。
结果:
Why did the author use three sentences in a row that start with the words, "it spun"?Why did the queen most likely say <quote>Why did the fdsfdsf repeat the phrase "he waited" so many times?Why were <quote>?What is a fdsfdsf for the word "adjust"?Reread these sentences: <quote> What does the word "courtship" mean?