Is doing this simple thing with strings really so painful?

polaris · · 606 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I was converting some Python code to Go recently as an exercise for starting to pick up Go, and I went to print out my results, which is a map[string]int. I noticed garbage in the resulting output. After looking into it more, I found that using slice notation for a string actually treats the string as bytes rather than characters. Wow... Completely different from Python, and completely the opposite of what most people would probably want and expect...</p> <p>After about an hour of googling around, I can&#39;t figure out the way to just get part of a string given some index range for the characters. What the fucking fuck? Isn&#39;t this type of stuff supposed to be really easy? I&#39;ve seen that I can use the <em>range</em> operator to iterate through a string, but I don&#39;t want to iterate through something. I just want a slice of characters (not bytes that may or may not represent whole characters). Does Go even have suitable high-level functions for dealing with strings like this?</p> <p>I read a really long article by Rob Pike where it seems like he just skirts around the issue of storing characters, and says strings are just a bag of bytes.</p> <p><em>Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters. In fact, the definition of &#34;character&#34; is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters.</em></p> <p>Everything is built from bytes, Rob Pike. That doesn&#39;t mean that string operations need to be such a pain in the ass. :-(</p> <hr/>**评论:**<br/><br/>cs-guy: <pre><p>You might have better luck converting the string to a []rune before indexing. <a href="http://play.golang.org/p/ORgImtR5Y2" rel="nofollow">http://play.golang.org/p/ORgImtR5Y2</a></p></pre>jp599: <pre><p>This did it. I have to alternate between using []rune and string in my function. Thanks for your help.</p></pre>dilap: <pre><p>cs-guy has your answer.</p> <p>I like Go&#39;s approach.</p> <p>It&#39;s perhaps a little bit clumsier, but once you realize what&#39;s going on, not-error prone, and easy enough.</p> <p>(I have been burned many times by python 2 unicode vs string nonsense exceptions (when all I wanted to do was transport a blob of a string around), as well as buggy unicode handling (due to old ucs-16 nastiness), as well as Objective-C weirdness.)</p></pre>jp599: <pre><p>The Unicode handling in Python 2 is also really bad. Ruby 1.9+ was the first language I used that had really, really good text handling. Matz is Japanese, and in Japan, they use about a half dozen different encodings for different things because they don&#39;t like Han unification in Unicode. In Ruby, every string is text, but the string has an encoding attribute attached. You can convert to and from encodings very easily for any string.</p> <pre><code>s = &#34;A&#34; =&gt; &#34;A&#34; s.encoding =&gt; #&lt;Encoding:UTF-8&gt; s.encode!(&#34;UTF-32BE&#34;) =&gt; &#34;A&#34; s.encoding =&gt; #&lt;Encoding:UTF-32BE&gt; Encoding.list.size =&gt; 100 </code></pre> <p>There is always a concept of an external encoding and an internal encoding as well. It&#39;s easy to use, but very fine-grained and controllable. Python 3 converts strings to Unicode internally, unless they are specifically read as byte strings. IMO, this is the absolute easiest way to handle the matter, because it doesn&#39;t have to be managed much at all by the programmer.</p> <p>Go seems to use more of a Python 2 approach, but piecemeal. I still don&#39;t know if I&#39;m using the right functions. For example, if string.Replace can act safely on Unicode strings... I know Rob and Ken invented UTF-8, and that&#39;s awesome, but the string handling seems to be done from the ASCII C perspective of chars = bytes, whereas the rest of the world seems to consider strings to be text composed of characters (what Go would consider &#34;runes&#34;).</p></pre>dilap: <pre><p>How I use Go is this: a string is always utf8-encoded.</p> <p>And that is the perspective Go takes as well! String handling is definitely <em>not</em> from a byte perspective.</p> <p>So this:</p> <pre><code>for i, r := range someString </code></pre> <p>iterates through each code-point of the string, assuming it is utf8. strings.ToUpper will uppercase each code-point of the string, assuming it is utf8. strings.Replace assumes utf8 strings (though that&#39;s a little bit of a trick, because utf8 has the nice property that byte replace is unicode replace). If a string is ever <em>not</em> being treated as a sequence of utf8-encoded unicode code points, it&#39;s explicitly called out, like in IndexByte.</p> <p>This is completely unlike the python2 situation, where none of the string functions were in any way unicode aware.</p> <p>That said, you can, if you want to, store non-utf8 bytes in a string -- there&#39;s nothing in the language that will break if you store some arbitrary sequence of bytes, but the standard library assumes it&#39;s utf8.</p> <p>So what if you need something that isn&#39;t utf8? What I would recommend, and what&#39;s worked for me so far, is always convert to utf8 at your system boundary. Just embrace utf8, it&#39;s a great encoding.</p> <p>But if I really <em>did</em> need to use some other encoding, I would make a new type to represent that:</p> <p>type FooBarString string </p> <p>Of course, then you&#39;d be on your own for stuff like strings.ToUpper.</p> <p>But it&#39;d be straightforward to implement using the encoding package to convert to utf8 and then use the built-in strings package, or you could implement your own methods on the type if you needed something faster.</p> <p>With this approach, the only thing you&#39;ve lost is the ability to write functions that can operate simultaneously on FooBarString and string, but I think that&#39;s reasonable.</p></pre>pierrrre: <pre><p>show your fucking code</p></pre>WellAdjustedOutlaw: <pre><p>I think if you read that article you might find the answer to your question. Also, the tutorials found on the golang site describe how to dissect strings.</p></pre>fubo: <pre><p><a href="https://golang.org/pkg/strings/" rel="nofollow">https://golang.org/pkg/strings/</a></p></pre>YEPHENAS: <pre><p>What do you mean by &#34;character&#34;? Code point (&#34;rune&#34;)? CCS? Grapheme cluster?</p> <p>Where do the broken indexes for the slicing operation come from in your code? Maybe you want to use strings.Index* / strings.LastIndex*.</p></pre>fluffl3: <pre><blockquote> <p>Is doing this simple thing with strings really so painful?</p> </blockquote> <p>No. </p></pre>drunken_thor: <pre><blockquote> <p>switching from a dynamic language to a typed language isn&#39;t the same WTFBBQOMG </p> </blockquote> <p>if you even played around you might find that</p> <pre><code>str := &#34;test&#34; println(string(str[0:2])) // &#34;te&#34; </code></pre> <p><a href="http://play.golang.org/" rel="nofollow">http://play.golang.org/</a> is your friend</p> <p>EDIT: Also as an add on, when you start using go more often you will find out how convenient strings as byte arrays end up being </p></pre>jp599: <pre><p>Except I&#39;m using CJK text, which is all multi-byte Unicode, for which pretending everything is ASCII is not a useful option.</p></pre>weberc2: <pre><blockquote> <p>I noticed garbage in the resulting output I&#39;m not seeing any garbage: <code>fmt.Println(map[string]int{&#34;hello&#34;: 1, &#34;world&#34;: 2})</code> produces <code>map[hello:1 world:2]</code>. What are you expecting to happen?</p> <p>Isn&#39;t this type of stuff supposed to be really easy? No. Strings and characters are deceitfully complicated, and assuming everything is ASCII is a bad idea. How can the compiler know the encoding of a string at compile time so as to know what code to generate for the index <code>[]</code> operator?</p> </blockquote></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

606 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传