我有一份故事的子字符串列表。它们都从同一个地方开始,但结束点不同。这是我的示例输入:
["Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud", "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu","Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.", "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. This is some extra text i don't care about"]
实际情况是这样的,不过我有大约40个这样的子字符串。我的目标是使用机器学习,尝试生成一个包含整个故事的字符串,在这个例子中是这样的:
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
不需要完全精确,我只需要一种方法来以最佳准确度提取它。
我尝试过找出每个子字符串的最长部分,并试图将它们拼接起来,但没有成功。我需要某种算法来尝试找出它对故事的猜测。
我不能只使用最后一个字符串,因为有些字符串还包含额外的信息。
在我拥有的40个字符串中,有些比所需的故事长,有些则较短。较短的字符串从开头开始并在故事中间结束。较长的字符串从开头开始,包含完整的故事,然后在结尾处有不需要的其他额外信息。每个较长字符串的额外信息是独一无二的(如果不是独一无二的,它将被视为故事的一部分)
回答:
这应该可以做到(为了可读性,句子已缩短):
stories = [ "Lorem ipsum", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. This is some extra text i don't care about", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. A different gibberish this time.", "Lorem ipsum dolor sit amet, consectetur", "Lorem ipsum dolor sit amet, consectetur adipiscing elit.", # This is the full story]stories.sort(key=lambda s: len(s))story = ""for i, short_story in enumerate(stories[:-1]): for long_story in stories[i+1:]: if not long_story.startswith(short_story): break else: story = short_storyprint(story)
输出:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
请注意,这段代码假设至少有一个故事结尾有乱码,否则它无法处理您问题中的样本输入。