Sentences: A sentence tokenizer

polaris · · 456 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

<a href="https://github.com/neurosnap/sentences">Sentences</a> is a multilingual command line sentence tokenizer. This golang package converts a blob of text into a list of sentences. The ultimate goal is to become one of the fastest and accurate sentence tokenizers with an emphasis on extending it to fit developers' needs. <ul> <li><a href="http://sentences.erock.io/">Demo</a></li> <li><a href="https://godoc.org/gopkg.in/neurosnap/sentences.v1">Docs</a></li> <li><a href="https://github.com/neurosnap/sentences">https://github.com/neurosnap/sentences</a></li> </ul> Any feedback is greatly appreciated. <hr/>**评论：** epiris: <pre>Woah, that's a big public API. Do you maybe have a good bit of internal details in there? I saw you asked a about Unicode testing, Luckily working with UTF8 is super easy in Go, I love it. There is 4 unicode code points that I use for all my tests, 0x41, 0xc0, 0xFf21 0x1d400 check the last one I always have to look it up. They have properties that I think are great for testing. <ul> <li>They all should look lok a capital A but distinct enough to tell apart</li> <li>each range is linear to Z from that code point. I.e. 'A' + 32 = a</li> <li>above invariant is nice because you can convert regular asci sentences to utf8 sentences of various length by generating a simple lookup table you can index from zero and lookup with w4['A'-runeVal] // 0x1d400 <ul> <li>it covers the 4 utf8 widths (1-4 bytes) so it covers all the issues you run into when iterating as bytes like off by ones and such </li> </ul></li> </ul></pre>

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

456 次点击

加入收藏微博

github

io

godoc

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Sentences: A sentence tokenizer

用户登录

今日阅读排行

一周阅读排行

最新主题