<p><a href="https://github.com/neurosnap/sentences">Sentences</a> is a multilingual command line sentence tokenizer. This golang package converts a blob of text into a list of sentences. The ultimate goal is to become one of the fastest and accurate sentence tokenizers with an emphasis on extending it to fit developers' needs.</p>
<ul>
<li><a href="http://sentences.erock.io/">Demo</a></li>
<li><a href="https://godoc.org/gopkg.in/neurosnap/sentences.v1">Docs</a></li>
<li><a href="https://github.com/neurosnap/sentences">https://github.com/neurosnap/sentences</a></li>
</ul>
<p><strong>Any feedback is greatly appreciated.</strong></p>
<hr/>**评论:**<br/><br/>epiris: <pre><p>Woah, that's a big public API. Do you maybe have a good bit of internal details in there? I saw you asked a about Unicode testing, Luckily working with UTF8 is super easy in Go, I love it. There is 4 unicode code points that I use for all my tests, 0x41, 0xc0, 0xFf21 0x1d400 check the last one I always have to look it up. They have properties that I think are great for testing.</p>
<ul>
<li>They all should look lok a capital A but distinct enough to tell apart</li>
<li>each range is linear to Z from that code point. I.e. 'A' + 32 = a</li>
<li>above invariant is nice because you can convert regular asci sentences to utf8 sentences of various length by generating a simple lookup table you can index from zero and lookup with w4['A'-runeVal] // 0x1d400
<ul>
<li>it covers the 4 utf8 widths (1-4 bytes) so it covers all the issues you run into when iterating as bytes like off by ones and such </li>
</ul></li>
</ul></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传