Sentences: A sentence tokenizer

polaris · · 347 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p><a href="https://github.com/neurosnap/sentences">Sentences</a> is a multilingual command line sentence tokenizer. This golang package converts a blob of text into a list of sentences. The ultimate goal is to become one of the fastest and accurate sentence tokenizers with an emphasis on extending it to fit developers&#39; needs.</p> <ul> <li><a href="http://sentences.erock.io/">Demo</a></li> <li><a href="https://godoc.org/gopkg.in/neurosnap/sentences.v1">Docs</a></li> <li><a href="https://github.com/neurosnap/sentences">https://github.com/neurosnap/sentences</a></li> </ul> <p><strong>Any feedback is greatly appreciated.</strong></p> <hr/>**评论:**<br/><br/>epiris: <pre><p>Woah, that&#39;s a big public API. Do you maybe have a good bit of internal details in there? I saw you asked a about Unicode testing, Luckily working with UTF8 is super easy in Go, I love it. There is 4 unicode code points that I use for all my tests, 0x41, 0xc0, 0xFf21 0x1d400 check the last one I always have to look it up. They have properties that I think are great for testing.</p> <ul> <li>They all should look lok a capital A but distinct enough to tell apart</li> <li>each range is linear to Z from that code point. I.e. &#39;A&#39; + 32 = a</li> <li>above invariant is nice because you can convert regular asci sentences to utf8 sentences of various length by generating a simple lookup table you can index from zero and lookup with w4[&#39;A&#39;-runeVal] // 0x1d400 <ul> <li>it covers the 4 utf8 widths (1-4 bytes) so it covers all the issues you run into when iterating as bytes like off by ones and such </li> </ul></li> </ul></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

347 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传