最近看代码,发现了一个敏感词检测是用前缀树写的,看起来速度蛮快,毕竟是拿空间换时间,LOG倍速。但是缺点也很明显,待检测文本需要与敏感词词库中的值完全匹配。所以对于简短的词法比较合适。
原理:
- 每一个节点可以有多个子节点
- 节点“存储”字符, 节点与节点之间的连线自动形成单词。 如a节点与d节点,之间的连线就是单词 ad
- 节点可能是叶子节点,此时也是一个单词的“终点”,否则是其他拥有相同前缀的节点的“过客”, wordcount要加一。
- 删除一个单词,则对应节点上的“过客”都要减一,直至减至叶子节点。
# coding: utf8
MAX_TREE_WIDTH = 26
INIT_CHAR = 'a'
forbiddenwords = """
fuck
fucker
damn
silly
"""
class TrieNode(object):
def __init__(self):
self.nodes = [None] * MAX_TREE_WIDTH
self.wordcount = 0
self.isend = 0
class TrieTree(object):
def __init__(self):
self.root = TrieNode()
def add(self, word):
word = word.lower()
curnode = self.root
for char in word:
index = ord(char) - ord(INIT_CHAR)
if curnode.nodes[index] is None:
curnode.nodes[index] = TrieNode()
curnode = curnode.nodes[index]
curnode.wordcount += 1
curnode.isend = 1
def search(self, word):
word = word.lower()
curnode = self.root
for char in word:
index = ord(char) - ord(INIT_CHAR)
if curnode.nodes[index] is None:
return -1
curnode = curnode.nodes[index]
return curnode.wordcount
def countstartswith(self, prefix):
curnode = self.root
for char in prefix:
index = ord(char) - ord(INIT_CHAR)
if curnode.nodes[index] is None:
return -1
curnode = curnode.nodes[index]
return curnode.wordcount
def delete(self, word):
if self.countstartswith(word) > 0:
curnode = self.root
for char in word:
index = ord(char) - ord(INIT_CHAR)
curnode.nodes[index].wordcount -= 1
if curnode.nodes[index].wordcount == 0:
curnode.nodes[index] = None
return
curnode = curnode.nodes[index]
curnode.isend = 0
if __name__ == "__main__":
print("hello trie tree.")
tree = TrieTree()
tree.add("h")
tree.add("He")
tree.add("hEl")
tree.add("helL")
tree.add("hello")
print(tree.search('he'))
print(tree.countstartswith("h"))
print(tree.countstartswith("hel"))
tree.delete("hel")
print(tree.countstartswith("hel"))
print(tree.countstartswith("guo"))
words = [item for item in forbiddenwords.strip("").split("\n") if item != '']
for word in words:
print("adding word: {}".format(word))
tree.add(word)
# test sequence
teststr = "you a silly mother fucker"
tests = teststr.split(" ")
for test in tests:
print("{} --> {}".format(test, tree.search(test)))
运行结果:
hello trie tree.
4
5
3
2
-1
adding word: fuck
adding word: fucker
adding word: damn
adding word: silly
you --> -1
a --> -1
silly --> 1
mother --> -1
fucker --> 1
相较于之前 基于朴素贝叶斯分类算法 的golang 与PHP结合的例子,效率,准确性,适用性都很逊。这也跟场景息息相关,选对了场景,前缀树会给出一个令人满意的答复的。
有疑问加站长微信联系(非本文作者)