统计机器翻译的短语提取算法

我已经编写了以下用于统计机器翻译的短语提取算法的代码。

# -*- coding: utf-8 -*-def phrase_extraction(srctext, trgtext, alignment):    """    短语提取算法。    """    def extract(f_start, f_end, e_start, e_end):        phrases = set()        # 如果 f_end 为 0，则返回 { }        if f_end == 0:            return        # 对于所有 (e,f) ∈ A        for e,f in alignment:            # 如果 e < e_start 或 e > e_end，则返回 { }            if e < e_start or e > e_end:                        return        fs = f_start        # 重复-        while True:            fe = f_end            # 重复-            while True:                # 将短语对 ( e_start .. e_end , f_s .. f_e ) 添加到集合 E 中                trg_phrase = " ".join(trgtext[i] for i in range(fs,fe))                src_phrase = " ".join(srctext[i] for i in range(e_start,e_end))                phrases.add("\t".join([src_phrase, trg_phrase]))                fe+=1 # fe++                # -直到 fe 对齐                if fe in f_aligned or fe > trglen:                    break            fs-=1 # fe--            # -直到 fs 对齐            if fs in f_aligned or fs < 0:                break        return phrases    # 计算源文本和目标文本中的词元数量。    srctext = srctext.split()    trgtext = trgtext.split()    srclen = len(srctext)    trglen = len(trgtext)    # 保持源/目标词的对齐索引。    e_aligned = [i for i,_ in alignment]    f_aligned = [j for _,j in alignment]     bp = set() # 短语对 BP 的集合    # 对于 e_start = 1 ... length(e) 做    for e_start in range(srclen):        # 对于 e_end = e_start ... length(e) 做               for e_end in range(e_start, srclen):            # // 查找最小的匹配外语短语            # (f_start , f_end ) = ( length(f), 0 )            f_start, f_end = trglen, 0            # 对于所有 (e,f) ∈ A 做            for e,f in alignment:                # 如果 e_start ≤ e ≤ e_end 则                if e_start <= e <= e_end:                    f_start = min(f, f_start)                    f_end = max(f, f_end)            # 将 extract (f_start , f_end , e_start , e_end ) 添加到集合 BP 中            phrases = extract(f_start, f_end, e_start, e_end)            if phrases:                bp.update(phrases)    return bpsrctext = "michael assumes that he will stay in the house"trgtext = "michael geht davon aus , dass er im haus bleibt"alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]phrases = phrase_extraction(srctext, trgtext, alignment)for i in phrases:    print i

菲利普·科恩的《统计机器翻译》书中第133页的短语提取算法如下所示：

enter image description here

期望的输出应为：

enter image description here

然而，使用我的代码，我只能得到以下输出：

michael assumes that he will stay in the - michael geht davon aus , dass er im haus

michael assumes that he will stay in the - michael geht davon aus , dass er im haus bleibt

有谁能发现我的实现有什么问题吗？

代码确实提取了短语，但并非如上表所示的完整期望输出：

enter image description here

回答：

我已经研究了这个问题，现在可以重现期望的输出。原来存在一些问题：

算法并不完全。在书的在线版本（2012年第三次印刷）中，extract 函数的第4行已更新。（可能有勘误表）
算法假设索引从1开始，直到并包括结束。
Python假设从0开始的索引，直到但不包括结束。
特别是这影响了一些对 range 的调用和一些比较的停止值。
集合中的项目已被修改，以便更容易重现期望的输出。

示例输出（匹配19个短语，其中一些匹配重复了5次，总共提取了24个）：

$ python2.7 phrase_extract_new.py ( 1) (0, 1) michael — michael( 2) (0, 2) michael assumes — michael geht davon aus ; michael geht davon aus ,( 3) (0, 3) michael assumes that — michael geht davon aus , dass( 4) (0, 4) michael assumes that he — michael geht davon aus , dass er( 5) (0, 9) michael assumes that he will stay in the house — michael geht davon aus , dass er im haus bleibt( 6) (1, 2) assumes — geht davon aus ; geht davon aus ,( 7) (1, 3) assumes that — geht davon aus , dass( 8) (1, 4) assumes that he — geht davon aus , dass er( 9) (1, 9) assumes that he will stay in the house — geht davon aus , dass er im haus bleibt(10) (2, 3) that — dass ; , dass(11) (2, 4) that he — dass er ; , dass er(12) (2, 9) that he will stay in the house — dass er im haus bleibt ; , dass er im haus bleibt(13) (3, 4) he — er(14) (3, 9) he will stay in the house — er im haus bleibt(15) (4, 6) will stay — bleibt(16) (4, 9) will stay in the house — im haus bleibt(17) (6, 8) in the — im(18) (6, 9) in the house — im haus(19) (8, 9) house — haus$ python2.7 phrase_extract_new.py | grep -c ';'5

下面是算法的建议解释。这个算法需要进行相当多的重构，但在进行重构之前，需要使用不同的示例进行更多测试。重现书中的示例只是一个开始：

# -*- coding: utf-8 -*-def phrase_extraction(srctext, trgtext, alignment):    """    短语提取算法。    """    def extract(f_start, f_end, e_start, e_end):        if f_end < 0:  # 0-based indexing.            return {}        # 检查对齐点是否一致。        for e,f in alignment:            if ((f_start <= f <= f_end) and               (e < e_start or e > e_end)):                return {}        # 添加短语对（包括额外的未对齐的 f）        # 备注：如何解释“额外的未对齐的 f”？        phrases = set()        fs = f_start        # 重复-        while True:            fe = f_end            # 重复-            while True:                # 将短语对 ([e_start, e_end], [fs, fe]) 添加到集合 E 中                # 需要在 range 中 +1 以包括端点。                src_phrase = " ".join(srctext[i] for i in range(e_start,e_end+1))                trg_phrase = " ".join(trgtext[i] for i in range(fs,fe+1))                # 包含更多数据以便后续排序。                phrases.add(((e_start, e_end+1), src_phrase, trg_phrase))                fe += 1 # fe++                # -直到 fe 对齐或超出边界                if fe in f_aligned or fe == trglen:                    break            fs -=1  # fe--            # -直到 fs 对齐或超出边界            if fs in f_aligned or fs < 0:                break        return phrases    # 计算源文本和目标文本中的词元数量。    srctext = srctext.split()   # e    trgtext = trgtext.split()   # f    srclen = len(srctext)       # len(e)    trglen = len(trgtext)       # len(f)    # 保持源/目标词的对齐索引。    e_aligned = [i for i,_ in alignment]    f_aligned = [j for _,j in alignment]    bp = set() # 短语对 BP 的集合    # 对于 e_start = 1 ... length(e) 做    # 从 0 到 len(e) - 1 索引 e_start    for e_start in range(srclen):        # 对于 e_end = e_start ... length(e) 做        # 从 e_start 到 len(e) - 1 索引 e_end        for e_end in range(e_start, srclen):            # // 查找最小的匹配外语短语            # (f_start , f_end ) = ( length(f), 0 )            # f_start ∈ [0, len(f) - 1]; f_end ∈ [0, len(f) - 1]            f_start, f_end = trglen-1 , -1  #  0-based indexing            # 对于所有 (e,f) ∈ A 做            for e,f in alignment:                # 如果 e_start ≤ e ≤ e_end 则                if e_start <= e <= e_end:                    f_start = min(f, f_start)                    f_end = max(f, f_end)            # 将 extract (f_start , f_end , e_start , e_end ) 添加到集合 BP 中            phrases = extract(f_start, f_end, e_start, e_end)            if phrases:                bp.update(phrases)    return bp# 参考匹配矩阵。#             0      1      2   3  4     5   6   7    8srctext = "michael assumes that he will stay in the house"#             0      1    2    3  4  5   6  7   8     9trgtext = "michael geht davon aus , dass er im haus bleibt"alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]phrases = phrase_extraction(srctext, trgtext, alignment)# 跟踪 srctext 中每个短语的翻译及其# 对齐，使用一个字典，键为短语，值为# 列表 [e_alignement pair, [f_extractions, ...] ]dlist = {}for p, a, b in phrases:    if a in dlist:        dlist[a][1].append(b)    else:        dlist[a] = [p, [b]]# 根据它们的长度对翻译列表进行排序。较短的短语优先。for v in dlist.values():    v[1].sort(key=lambda x: len(x))# 帮助根据书中示例进行排序的函数。def ordering(p):    k,v = p    return v[0]#for i, p in enumerate(sorted(dlist.items(), key = ordering), 1):    k, v = p    print "({0:2}) {1} {2} — {3}".format( i, v[0], k, " ; ".join(v[1]))

学技术

统计机器翻译的短语提取算法

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复