Why the delim parameters in Golang APIs are bytes instead of runes?

xuanbao · · 553 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>For example, the ReadString API of bufio.Reader:</p> <pre><code>func (b *Reader) ReadString(delim byte) (line string, err error) </code></pre> <p>It takes a <em>byte</em> as the <em>delim</em>. Is there a API which takes <em>rune</em> as the <em>delim</em>?</p> <hr/>**评论:**<br/><br/>nosmileface: <pre><p>Because it&#39;s easier to implement. Having multibyte separators means doing lookahead reads. For such advanced cases there is a Scanner type right there in the same package: <a href="http://golang.org/pkg/bufio/#Scanner">http://golang.org/pkg/bufio/#Scanner</a></p></pre>FUZxxl: <pre><p>As to “why:” The <code>bufio</code> package only allows you to put back a single character, which makes putting back entire runes impossible. This restriction comes from C&#39;s <code>stdio</code> and is pretty sensible because allowing arbitrary pushback involves extra overhead in every single IO operation or other complications, such as resizing the IO buffers occassionally, all of which are detrimental to performance.</p></pre>bradfitz: <pre><blockquote> <p>This restriction comes from C&#39;s stdio</p> </blockquote> <p>That&#39;s not correct. Go doesn&#39;t use C at all. And even the first versions of Go 5+ years ago didn&#39;t use C for the io packages. It&#39;s always been pure Go for io, bufio, fmt, etc.</p></pre>FUZxxl: <pre><p>Yes, the reference implementation of Go is not written against the libc, but the <code>bufio</code> API has a similar design to the <code>stdio</code> API and this was one of the things they took over. Remember that Go has been designed by a team that also designed parts of UNIX and almost all of Plan 9, so it&#39;s no wonder that they did things the same way as back then.</p></pre>moinboin: <pre><p>A bufio.Reader has an <a href="http://godoc.org/bufio#Reader.UnreadRune" rel="nofollow">UnreadRune</a> method.</p></pre>FUZxxl: <pre><p>This only works if you read in a <code>rune</code> as the last operation for some reason. Valid point though.</p></pre>calebdoxsey: <pre><p>Because searching for multiple bytes is difficult.</p> <p>The <code>bufio.Reader</code> reads data into a fixed-size buffer that is periodically refilled. Suppose I was looking for the string <code>ab</code> inside of <code>xyzabc</code>. The code would do a <code>bytes.Index</code> and find it at <code>3</code>.</p> <p>But if the buffer size were <code>4</code> you&#39;d have <code>xyza</code> inside the buffer so it wouldn&#39;t find it the first time, read the next chunk as <code>bc</code> and also not find it in there.</p> <p>Certainly you could write code to work around that, but I think <code>bufio.Reader</code> is generally considered a low-level API.</p> <p><a href="https://github.com/golang/go/issues/2164" rel="nofollow">https://github.com/golang/go/issues/2164</a> <a href="https://github.com/golang/go/issues/511" rel="nofollow">https://github.com/golang/go/issues/511</a></p> <p>I&#39;m not sure if there&#39;s a helper to do this, but here&#39;s how you can write it yourself: (<a href="https://play.golang.org/p/vgpRbPSmVt" rel="nofollow">https://play.golang.org/p/vgpRbPSmVt</a>)</p> <pre><code>func readString(rdr io.RuneReader, delim rune) (string, error) { var buf bytes.Buffer for { r, _, err := rdr.ReadRune() // probably need to do something special for io.EOF and the // the trailing bytes if err != nil { return &#34;&#34;, err } buf.WriteRune(r) if r == delim { break } } return buf.String(), nil } </code></pre></pre>moinboin: <pre><p>It&#39;s more efficient to scan for a single byte. The delimiters in many data formats can be found by scanning for a single byte. </p></pre>drvd: <pre><p>No. Rune &lt;--&gt; []byte mapping requires an encoding and dealing with failures becomes ugly. Of course the hypothetical method <code>ReadStringRuneDelimited(delim rune)</code> could state &#34;I work properly only for UTF-8 encoded input&#34; but people tend to ignore such documentation, run it on EBCDIC encoded data and complain that ReadStringRuneDelimited is buggy. </p></pre>noydoc: <pre><p>My guess: eurocentrism.</p></pre>moinboin: <pre><p>Are commonly used delimiters like \t, \n and &#39; &#39; representable as a single byte because of eurocentrism?</p></pre>shekispeaks: <pre><p>I don&#39;t understand this, can you explain ?</p></pre>noydoc: <pre><p>I was being a bit of a smart ass, as most European languages can get away/are still understandable with only having one byte per letter.</p></pre>sorennielsen: <pre><p>Well... As a dane who cannot write my own name using a-z I think you must think of US and their lovely insensitive ASCII ;)</p></pre>FUZxxl: <pre><p>Ehem, no. It&#39;s Americacentrism if at all as English can get away with ASCII only, but as soon as you start with languages like German, you have characters outside of ASCII. Remember, Go is all about Unicode so everything outside of ASCII is a multi-byte character.</p></pre>tgaz: <pre><p>That is indeed ugly, given the string output.</p> <p>The function has two inputs; the byte stream and delim. While the input byte stream can contain full UTF-8, delim cannot. And the output type indicates they should both support UTF-8. Not the prettiest API I&#39;ve seen, then.</p> <p>It&#39;s implemented in a straight-forward way <a href="http://golang.org/src/bufio/bufio.go?s=11642:11706#L435" rel="nofollow">using ReadBytes and a conversion to string</a>.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

553 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传