我需要自动将#标签拆分为有意义的单词。
示例输入:
- iloveusa
- mycrushlike
- mydadhero
示例输出
- i love usa
- my crush like
- my dad hero
有什么工具或开放API可以用来实现这个功能吗?
回答:
from __future__ import divisionfrom collections import Counterimport re, nltkWORDS = nltk.corpus.brown.words()COUNTS = Counter(WORDS)def pdist(counter): "Make a probability distribution, given evidence from a Counter." N = sum(counter.values()) return lambda x: counter[x]/NP = pdist(COUNTS)def Pwords(words): "Probability of words, assuming each word is independent of others." return product(P(w) for w in words)def product(nums): "Multiply the numbers together. (Like `sum`, but with multiplication.)" result = 1 for x in nums: result *= x return resultdef splits(text, start=0, L=20): "Return a list of all (first, rest) pairs; start <= len(first) <= L." return [(text[:i], text[i:]) for i in range(start, min(len(text), L)+1)]def segment(text): "Return a list of words that is the most probable segmentation of text." if not text: return [] else: candidates = ([first] + segment(rest) for (first, rest) in splits(text, 1)) return max(candidates, key=Pwords)print segment('iloveusa') # ['i', 'love', 'us', 'a']print segment('mycrushlike') # ['my', 'crush', 'like']print segment('mydadhero') # ['my', 'dad', 'hero']
如果需要比这更好的解决方案,可以使用二元/三元模型。
更多示例请查看 : 词语分割任务