Home IT技术 Unique ID for sentence

Unique ID for sentence

IT技术 xiaolong · 2025年5月22日 · 0 Comment

我有数百个不同语言（Unicode）的文本片段。我需要为每个句子分配一个唯一ID，以便训练机器学习算法。我编写了自己的算法，发现大约有3万个重复的数字。后来我找到了这个解决方案：

def remapWord(word):    return int.from_bytes(word.encode(), 'little')

但显然这个整数对于numpy来说太大了，它会抛出一个错误：

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

当我尝试拟合数据时。还有其他方法可以获取唯一ID或防止这个ValueError发生吗？

回答：

import hashlib def remap(word):    h = hashlib.md5()    h.update(word)    return int(h.hexdigest(), 16))

machine-learning numpy python-3.x scikit-learn

发表回复取消回复