Why is []byte used as a string type?

xuanbao · · 430 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

<code>string</code> seems on the face of it to be much better than <code>[]byte</code>. They give Unicode characters when iterated over, can be used as map keys, can be compared with <code>==</code> <code><</code> <code>></code>, and have a convenient concatenation operator. Why, then, does the library force us to deal in <code>[]byte</code>s by returning them and using them in important interfaces? Edit: and they can be compile-time constants. <hr/>**评论：** uncle_bad_touches: <pre>string types are immutable, so if you want to modify the contents of the buffer you'll need to use a []byte.</pre>TheMerovius: <pre>To give a full explanation, as there is quite a bit of confusion and misinformation in this thread: <ul> <li>Both <code>string</code>s and <code>[]byte</code> have a very similar data layout. They are values, that contain a pointer to the data and a length - <code>[]byte</code> also contain a capacity. You can think of them as <code>type string struct { ptr *byte, length int }</code> and <code>type []byte struct { ptr *byte, length int, capacity int }</code>, respectively. In particular, slicing creates new <code>[]byte</code> resp. <code>string</code> structs with the same <code>ptr</code>, but different <code>length</code> and <code>capacity</code>.</li> <li>The only difference between <code>string</code> and <code>[]byte</code> is, that <code>string</code>s are immutable. You can't change the content of a <code>string</code>, but you can change the content of a <code>[]byte</code>. This is expressed in the fact, that <code>s[i]</code> isn't assignable if <code>s</code> is a <code>string</code>, but <code>b[i]</code> is, if <code>b</code> is a <code>[]byte</code>.</li> <li>The fact that strings are immutable is the reason why they don't have a capacity. The use-case of <code>capacity</code> is mainly <code>append</code> - if a <code>[]byte</code> has additional capacity, <code>append</code> can just re-use that and save allocations. That doesn't work with <code>string</code>s, as another <code>string</code> might share that additional space, so modifying that would violate the immutability of <code>string</code>s.</li> <li>There is no difference in what data they can contain. A <code>string</code> doesn't have to contain valid utf8. It is most often assumed that they do and the language gives special operations for that (e.g. <code>range</code>'s behavior on strings), but it also copes just fine with arbitrary bytes. The reason for that is, that everything else would require expensive runtime-checks. If you want to assume that a <code>string</code> contains valid utf8, use <code>utf8.ValidString</code>.</li> <li>There is no difference in whether one of them is a reference type or not. Gophers like to say that in go, everything is pass-by-value and they are right, in a sense. But realistically, you can think of pointers, <code>chan</code>, <code>map</code>, slices and strings all as reference types (in case of <code>string</code>, the reference is an immutable one, however) and of <code>{,u}int{,8,16,32,64}</code>, <code>byte</code>, <code>rune</code>, <code>float{32,64}</code>, <code>complex{64,128}</code>, <code>struct</code>s and arrays (so e.g. <code>[8]byte</code>) as value types. I know that this is considered heresy, but calling it differently is just confusing to newcomers. In any case: Either both are value types, or both are reference types, there is no difference between them.</li> <li>You can slice <code>string</code> and <code>[]byte</code> respectively, without copying the underlying data, as outlined above. For <code>string</code>, this is the only thing you really can do to them, whereas for <code>[]byte</code>, you can also change the contents of the underlying array (but must be aware of that when passing it around). </li> <li>Whenever you convert from <code>string</code> to <code>[]byte</code> or the other way around, however, you have to allocate and copy data. Otherwise you can't preserve both the mutability of <code>[]byte</code> and the immutability of <code>string</code>. If a <code>string</code> and a <code>[]byte</code> share data, you could modify the string by changing the contents of <code>[]byte</code>. (There is one, uninteresting, exception, with the expression <code>m[string(b)]</code>, if <code>m</code> is a <code>map[string]T</code> and <code>b</code> is a <code>[]byte</code>. But that's it.)</li> </ul> So, the reason why people largely avoid <code>string</code>s are: <ul> <li><code>[]byte</code>s will, in general, need less allocations if you don't share them. That is, if you know that a <code>[]byte</code> only ever has one owner, that owner can do more, useful transformations to it, without allocation. For example, to replace a sub-<code>string</code>, you always need to allocate, whereas to replace a sub-<code>[]byte</code>, you might not need to (if the original isn't needed afterwards and the modified slice fits into the capacity of the original).</li> <li>Every <code>string</code>, except when coming from a literal, will incur at least one extra allocation, as it must have been read as a <code>[]byte</code> originally and then converted to a <code>string</code>. So if you want to return a <code>string</code> from a library, your user might incur an ultimately useless allocation, so returning a <code>[]byte</code> is less safe, but more performant, in general.</li> <li>All the "features" of a <code>string</code> you mention are just minor conveniences, they have equivalents in the <code>bytes</code> package. And if you really want them, you can always decide to do the conversion yourself. That doesn't work, if the library returns a <code>string</code>, a user could then not decide to not incur the performance penalty.</li> </ul> It's kind of a sad state, as <code>[]byte</code> are less safe, so every function and method using a <code>[]byte</code> must document their behavior in regards to ownership and concurrent modifications and it is relatively easy to fuck that up (pop quiz: Without looking at the documentation, what parts of a buffer can a <code>Reader</code> modify? What about a <code>Writer</code>? Can a <code>Reader</code> keep the slice around? Can a <code>Writer</code>?). However, it is the only way around egregious copying when massaging data. In the long-term I would wish for better optimizations in the compiler around <code>string</code><-><code>[]byte</code> conversions, so that we can use them more interchangeably without worrying about allocations (just like escape analysis lets us mostly forget about whether something lives on the heap or the stack). But we're not there yet.</pre>adonovan76: <pre>Nice summary. It's very tempting think that in a hypothetical post-generics Go 2.0, <code>string</code> and <code>[]byte</code> might be two implementations of some generic "string view" type, and that we could write more functions to be agnostic about which of these two types they are dealing with. There are many subtleties though, as the two types really have very little in common besides the <code>len(x)</code> and <code>x[i]</code> operations. Crucially, a string guarantees immutability whereas a "string view" type would only prevent modifications through that reference (like <code>const</code> in C), and this has fundamental implications for the way in which you would use the value. The meaning of equality is unclear, and a string view would not be a valid map key, for example. Even the <code>range</code> statement behaves differently for the two types (and personally I rarely want the UTF-8 decoding semantics for strings). I think there might be some benefit in avoiding some copies during bulk data operations, but less benefit for strings whose values are interpreted by the program logic.</pre>uncreativemynameis: <pre>What <del>library or libraries</del> packages are you referring to? Most things come into and leave your Go process as byte slices and there isn't a need to have it become a string. If you need rune iteration or use the data as a map key, then make it a string. It's not difficult.</pre>Partageons: <pre>I'm referring to the standard library. We all know <code>io.Reader</code> and <code>io.Writer</code>, which probably influences the following, but <code>[]byte</code> is used nearly every other time a function requires or returns text.</pre>barsonme: <pre>Because []byte is used for random data that you get from a file or over the wire or whatever, and returning string from various stdlib functions would require an allocation since it copies the []byte. So, not only is it good to keep things as primitive as possible for as long as possible, it's also a performance thing.</pre>jahayhurst: <pre>There are lots of times when <code>io.Reader</code> or <code>io.Writer</code> reads/writes data that's not really a string at all. For instance, a list of IP addresses could be stored raw in a file. <code>[]byte</code> is kindof more of a raw storage (without any structure) than <code>string</code>.</pre>TornadoTerran: <pre>Byte slice is more convenient when you need to pass big chunk of data across app. It's passed by pointer. String on the other hand is passed by value. It's battle less allocation vs immutability.</pre>szabba: <pre>Being passed by reference is irrelevant -- the compiler could in principle (if it doesn't already, which I'd find odd) pass strings by reference safely precisely because they're immutable. The reason <code>[]byte</code>s can reduce allocation (which is where you were right -- they can) is, since they're mutable you can reuse the same chunk of memory many times with different content without the GC having to free and allocate stuff.</pre>thockin: <pre>It is exactly this that makes the lack of 'const' decorations so egregious in Go. It's a huge opportunity for optimization that is largely discarded. Disclaimer: not a compiler engineer, I bet some deeper level of whole-program analysis could enable similar optimizations</pre>barsonme: <pre><blockquote> the compiler could in principle (if it doesn't already, which I'd find odd) pass strings by reference safely precisely because they're immutable. </blockquote> Huh? If I write <pre><code>s := "foo" doThing(s) </code></pre> What good would be using a reference to <code>s</code>? <code>s</code> doesn't contain the data, only a pointer to it (plus its length). (I'm assuming you mean references like C++.) You'd be saving 8 or 4 bytes, depending on your arch.</pre>szabba: <pre>Reference is a heavily overloaded term, sorry for muddying things up! In this case when saying pass strings by reference I meant without copying the underlying text data. In this sense a two-word struct containing a pointer to the actual data that is passed by copy is by reference enough, especially if it's an unexposed implementation detail. I'm sorry you've got downvoted, you're raising a perfectly valid question about what I've said.</pre>barsonme: <pre>It's okay! :-) "Reference" can be used a hundred different ways, and I was assuming you meant reference a la c++ (standard "pass-by-reference"). Cheers.</pre>TheMerovius: <pre>They are saying "assuming in go, strings would be copied around whenever you pass them. Then the compiler could optimize away the copying anyway, because they are immutable". It wasn't about current semantics.</pre>barsonme: <pre><blockquote> Then the compiler could optimize away the copying anyway, because they are immutable". It wasn't about current semantics. </blockquote> This seems kind of pointless and could easily trip up new people. It's hard enough getting people to realize strings contain internal pointers--telling them they could, theoretically, behave like <code>int</code> does doesn't seem to be a good idea imo. I mean, even people like Rob tell others on the mailing list that strings are two-word structs. As long as the string is distinct from its data, references aren't exactly possible--or, at least, I'm not sure how they'd implement it. I'd love to know. A lot of reference implementations just use pointers to the data. Since Go's strings are structs, that'd require a pointer to the struct... how would that react if another goroutine assigned to the variable? Go would lose its copy semantics. I don't mean to be combative at all, but I don't see how it'd be possible.</pre>natefinch: <pre>Everything is pass by value. A string is a pointer to an array of bytes and a length field. A slice of bytes is a pointer to an array of bytes and a length and a cap. Copying them will be almost exactly the same.</pre>TornadoTerran: <pre>Yeah you are completely right. <a href="https://play.golang.org/p/l7Ra6N0Sn4" rel="nofollow">https://play.golang.org/p/l7Ra6N0Sn4</a> So if I understand correctly. Passing slice creates new pointer {ptr, len, cap} that points to the same array but passing string always create entire copy of underlying data.</pre>Partageons: <pre>I benchmarked a while ago to be sure & found that passing a <code>string</code> to a function is actually 20-25% faster than passing a <code>[]byte</code>. Confused, I did some research on the issue and found that, according to a post on the golang-nuts mailing list I don't care to search out right now, <code>string</code>s are in fact not passed by value.</pre>barsonme: <pre>Huh? Everything in Go is passed by value. []byte is a slice, so it's a three-word struct. <pre><code>type slice struct { data unsafe.Pointer len int cap int } </code></pre> string is similar: <pre><code>type string struct { data unsafe.Pointer len int } </code></pre> So, when you pass a string or a byte you're only passing two or three words, respectively. Unsure why there'd be a major performance issue.</pre>ctbel: <pre>I'm not all that familiar with the internals but might point you in the right direction on this one; I'd say a more apt comparison would be to compare strings and fixed-size arrays. If there's any performance penalty using the latter, then (afaik) it's most likely caused by optimizations leveraging the property that arrays are mutable while strings are not. There's also the slice overhead. I've encountered some <a href="https://play.golang.org/p/shJS-jV8-0" rel="nofollow">unexpected behavior</a> while setting up an example. This does however clearly illustrate how the underlying fixed-size arrays are two additional objects that must be tracked by the GC. To figure out the overhead of the latter, you could compare slices to fixed-size arrays.</pre>barsonme: <pre>Re: optimizations, yes. For example, returning a string from the various strconv functions is more efficient than []byte (provided you eventually convert the return value into a string!) because the internal array used inside the function is (IIRC) just converted into a string by swapping pointers instead of a copy. </pre>rek2gnulinux: <pre>good question</pre>

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

430 次点击

加入收藏微博

slice

io

goroutine

runtime

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Why is []byte used as a string type?

用户登录

今日阅读排行

一周阅读排行

最新主题