What is the benefit of decoding strings with utf8 or converting the string to a []rune?

blov · · 2212 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I&#39;ve been working on porting a really small (and worthless) scripting language from Ruby to Go, mostly for learning but also for other projects I plan to build in the future. In doing this, I&#39;ve been building the lexer based on <a href="https://www.youtube.com/watch?v=HxaD_trXwRE">this video by Rob Pike</a>. It&#39;s a decent structure and I&#39;ve made some good progress with it so far.</p> <p>One thing I notice though, is that they decode runes from the string using utf8 (reference <a href="https://www.youtube.com/watch?v=HxaD_trXwRE&amp;t=25m48s">25:48</a>). Using this method they store the size of the last read rune so they &#34;backup&#34; easily. But with this method, backup is only valid once per call to <code>next()</code>. </p> <p>I&#39;ve done some more experimenting and discovered that you can simply cast a string into an array of <code>rune</code>s.</p> <pre><code>runeList := []rune(&#34;some string value&#34;) </code></pre> <p>And laster easily cast that back:</p> <pre><code>str := string(someRuneSlice) </code></pre> <p>This would allow for any number of backups without worry about storing the last rune read or the size of that rune.</p> <p>Are there potential downsides to this that I&#39;m missing?</p> <hr/>**评论:**<br/><br/>djherbis: <pre><p>That&#39;s a type conversion not a cast. There is a run-time cost to those operations. []rune(&#34;some string value&#34;) will allocate a new []rune, and copy the string as runes into it, similarly string(someRuneSlice) will allocate a new string with a copy of the runes in someRuneSlice.</p> <p><a href="https://golang.org/ref/spec#Conversions">https://golang.org/ref/spec#Conversions</a></p></pre>izuriel: <pre><p>That is what I thought would be taking place. So Is that more expensive than decoding each rune individually? It seems the initial conversion would be more up front cost but again with improved backtracking than just decoding as weave along. </p> <p>Although the backtracking could be implemented with a rune slice for previous runes incurring no upfront cost but costing more memory. </p></pre>djherbis: <pre><p>Don&#39;t worry about picky optimizations over a correct, readable/maintainable design. That&#39;s basically one of the things that Rob mentions in the video, he designed the lexer with concurrency in mind not because it made it faster but because it made it easier to reason about. </p> <p>That being said here&#39;s a few comments on your question.</p> <p>The &#34;improved backtracking&#34; could be implemented easily by keeping a slice of widths, however for the lexer being written in the video I believe he only required a backup of one anyway, so it wouldn&#39;t have been helpful to keep more. </p> <p>Also consider the case where the lexer returns an error after reading only a few characters. Your pre-processing would be wasted effort. </p> <p>I&#39;m not going to outright say that your method is better or worse than then the one in the video, often times what is &#34;more expensive&#34; or &#34;more efficient&#34; is totally dependent on the kinds of inputs you expect.</p></pre>jerf: <pre><p>&#34;So Is that more expensive than decoding each rune individually?&#34;</p> <p>So, generic answer so I don&#39;t feel bad: &#34;profile&#34;.</p> <p>However, specific answer, yes, converting to <code>[]rune</code> is more expensive than iteration. I can say this confidently because the work is a strict superset of the iteration, since the conversion to <code>[]rune</code> is done via iteration, <em>plus</em> a <code>rune</code> is <a href="http://golang.org/pkg/builtin/#rune">actually a <code>int32</code></a>, so you&#39;re creating a brand new array that contains 4 bytes per unicode code point. And I don&#39;t know exactly how that will be allocated, but be it by scanning the string once to find the necessary size then a second time to copy, or be it by scanning through once and dynamically resizing the <code>[]rune</code> when it becomes too small, it&#39;s likely to be even more expensive than I&#39;m making it sound.</p> <p>It may be the case that none of this matters, which is why I start with &#34;profile&#34; as the generic answer. If it makes the code flow better, and you do a one-time conversion to <code>[]rune</code> then take extensive advantage of a representation more convenient for your code, it may still be a win. (You may also be able to work out how to start with that representation instead, or chunk, or something that amortizes away most of the expense I was talking about.) On the other hand, in the worst case where you casually convert back and forth all over the place, probably without even realizing it, this can easily come to dominate your entire program. It Depends (TM).</p></pre>izuriel: <pre><p>The part I hadn&#39;t really though through when asking the question was realizing that the <code>[]rune</code> wasn&#39;t seamlessly pushed back into a <code>string</code> (which I mix up most of the time forgetting <code>string</code> and <code>[]byte</code> are interchangeable).</p> <p>Thanks for the advice though - I do see a reasonable amount of transitioning between <code>[]rune</code> to <code>string</code> where this would become very problematic.</p></pre>FUZxxl: <pre><p>You already know all facets of the two approaches. Go make your own opinion.</p></pre>izuriel: <pre><p>I do love the tough love answer. I was being lazy and I apologize for doing so. Although I hadn&#39;t read through the entire spec yet, so that was enlightening - and of course has now been added to my reading list.</p></pre>izuriel: <pre><p>Thank you sir, you pointed the spec which I haven&#39;t read through (only perused sections as they pertained to me). I will be giving this a read over for a better understanding - although I though something like this might have been happening but on an &#34;optimized&#34; level behind the scenes it looks like it&#39;s not necessarily the best approach for this project.</p></pre>mc_hammerd: <pre><p>wait so my string vars that im using now -- do they support runes or not? i was hoping to get full unicode support, but i just built everything with string type</p></pre>izuriel: <pre><p>As has been pointed out, yes, they do support runes, however <code>string</code> is a type alias for <code>[]byte</code> with some syntatic sugar sprinkled on top of it (i.e. <code>&#34;Hello, I&#39;m a string&#34;</code> representation).</p> <p>That means, in a string like: <code>&#34;Hello, World!&#34;</code> if you grab (via indexing) <code>str[0]</code> you will get <code>72 (byte)</code>. Now 72 is the actual rune for &#34;H&#34;, but we&#39;re looking at two different data types (not just <code>byte</code> and <code>rune</code> but realistically <code>int8</code> and <code>int32</code>). The reason this is the case is the way that Unicode encodes text values - they maintained the standard 0-127 encodings and added rules for 128+ values in addition to more things (If you don&#39;t know much about Unicode I highly advise you read <a href="http://www.joelonsoftware.com/articles/Unicode.html">this article by Joel Spolsky</a>).</p> <p>In this case it&#39;s what I would call &#34;coincidence&#34; that you can grab the first index from a string and have it map to a character. Because if you alter the string to <code>&#34;こんいちはせかい&#34;</code> Then we grab <code>str[0]</code> we get 227, and when printed as a character we get <code>ã</code> which is obviously wrong. </p> <p>So when working with strings it&#39;s important to decode runes from the string - like in this example where I&#39;m using <code>&#34;unicode/utf8&#34;</code>:</p> <pre><code>str := &#34;こんにちはえかい&#34; r, size := utf8.DecodeRuneInString(str) if size &gt; 0 { fmt.Printf(&#34;Decoded %q with a length of %d byte(s)\n&#34;, r, size) fmt.Printf(&#34;Rest of the string %q\n&#34;, str[size:]) } </code></pre> <p>Output:</p> <pre><code>Decoded &#39;こ&#39; with a length of 3 byte(s) Rest of the string &#34;んにちはえかい&#34; </code></pre> <p>(<a href="http://play.golang.org/p/qB-6eLQiup">playground</a>)</p> <p><code>DecodeRuneInString</code> gives us the <code>rune</code> and a size (in bytes) of the rune, and to get the &#34;rest&#34; of the string after the rune I have to select from size to the end of the string.</p> <p><strong>tl;dr</strong> So yes, strings are &#34;unicode&#34; aware, but they aren&#39;t &#34;unicode&#34; smart. In other words, you as the developer need to know if the string was encoded with UTF8 or UTF16 or whatever (UTF8 can read ASCII, FYI) and code with that in mind in order to take advantage of the unicode awareness.</p></pre>FUZxxl: <pre><blockquote> <p>however string is a type alias for []byte with some syntatic sugar sprinkled on top of it (i.e. &#34;Hello, I&#39;m a string&#34; representation).</p> </blockquote> <p>You&#39;re almost correct. Strings don&#39;t have a <code>cap</code> field as you cannot write to them. They only have a <code>len</code> field. Also, they are immutable.</p> <blockquote> <p>So yes, strings are &#34;unicode&#34; aware, but they aren&#39;t &#34;unicode&#34; smart. In other words, you as the developer need to know if the string was encoded with UTF8 or UTF16 or whatever (UTF8 can read ASCII, FYI) and code with that in mind in order to take advantage of the unicode awareness.</p> </blockquote> <p>The concept of text processing in Go is that the first thing you do when you receive textual data from the outside is translating it into Unicode. The only place where a <code>string</code> might not contain UTF-8 encoded data should be the IO-layer of your program.</p></pre>FUZxxl: <pre><p>Yes, they do. Strings are always UTF-8 encoded.</p></pre>gohacker: <pre><p>Not true. You can also get invalid UTF-8 encoding when converting from []byte.</p> <p><a href="http://play.golang.org/p/jAlAP4tjAv">http://play.golang.org/p/jAlAP4tjAv</a></p></pre>mc_hammerd: <pre><p>interesting, thx for response.. all my data is json so i will have to test how that handles your case</p></pre>Exaltred: <pre><p>Isn&#39;t the printing discrepancy solved via: <a href="http://play.golang.org/p/54-yBLYCdY" rel="nofollow">http://play.golang.org/p/54-yBLYCdY</a> ? Although it seems play.golang.org doesn&#39;t support whatever character is given.</p></pre>mc_hammerd: <pre><p>ty good to know</p> <p>if you know a list of things that break strings but not runes or compaitibility article id love to read it, if not ill google it in a bit :&gt;</p></pre>drvd: <pre><p>Half true: String literals in Go code are always UTF-8 encoded. String values are just a bag of bytes.</p></pre>FUZxxl: <pre><p>Yes, that&#39;s correct. I didn&#39;t want to type up a long but correct explanation just to let <a href="/u/mc_hammerd" rel="nofollow">/u/mc_hammerd</a> think he shouldn&#39;t use strings (he really should).</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

2212 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传