Golang string, byte slices, rune

打倒美帝 · · 698 次点击 · · 开始浏览    
这是一个创建于 的文章,其中的信息可能已经有所发展或是发生改变。

我们经常会碰到stringbyte slices以及rune之间的相互转化问题,现简单介绍一下。

String本质上是只读的slice of bytes

indexing a string yields its bytes, not its characters: a string is just a bunch of bytes.

runeint32的别名,代表字符的Unicode编码,采用4个字节存储,将string转成rune就意味着任何一个字符都用4个字节来存储其unicode值,这样每次遍历的时候返回的就是unicode值,而不再是字节了。

  • String is immutable byte sequence.
  • Byte slice is mutable byte sequence.
  • Rune slice is re-grouping of byte slice so that each index is a character.
 // rune is an alias for int32 and is equivalent to int32 in all ways. It is
 // used, by convention, to distinguish character values from integer values.
 type rune = int32

下面我们定义placeOfInterestraw string, 其由反引号 back quotes包围着, 因此它仅仅只能包含literal text

func main() {
    const placeOfInterest = `⌘`

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    for _, ch := range placeOfInterest {
        fmt.Printf("\nUnicode character: %c", ch)
    }
    fmt.Printf("\nThe length of placeOfInterest: %d", len(placeOfInterest))
    fmt.Printf("\n")

    const Chinese = "中国话"
    fmt.Println(len(Chinese))
    for index, runeValue := range Chinese {
          fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }
}

输出结果为:

plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98
Unicode character: ⌘
The length of placeOfInterest: 3
9
U+4E2D '中' starts at byte position 0
U+56FD '国' starts at byte position 3
U+8BDD '话' starts at byte position 6

从上面输出结果可以看出:

  1. 符号⌘的Unicode character值为U+2318,其由三个字节组成:e2 8c 98。它们是UTF-8编码表示的16进制值2318
  2. 通过for range对字符串进行遍历时,每次获取到的对象都是rune类型的。而for循环遍历输出的是各个字节。
  3. go采用的是UTF-8编码,即go的源代码是被定义成UTF-8文本形式的,其他形式的表述是不被允许的。这就是说,当我们在代码中写下时,程序将符号 的UTF-8编码写入源代码文本中。因此当我们打印16进制bytes时,我们只是将编辑器放置在文件中的数据给dump下来了而已。
  4. 使用len函数获取到string的长度并不是字符个数,而是字节个数
  5. Unicode标准使用码点 code point来表示a single value所表述的item。例如符号⌘,其16进制值为2318,其code point 为U+2318。

但是由于Code point比较绕口,因此go引进了一个新的词汇项rune来表示。rune经常出现在library和源代码中,它基本上就和Code point一样,但是go语言将rune表示为int32的alias,这样通过一个整形值来代表Code point将更加清晰明了。因此,在Golang中我们可以将character constant称为rune constant 。表达式'⌘'的类型和值分别为rune ,整形值0x2318.

总结

  • Go source code is always UTF-8.
  • A string holds arbitrary bytes.
  • A string literal, absent byte-level escapes, always holds valid UTF-8 sequences. Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes. To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.
  • Those sequences represent Unicode code points, called runes.
  • No guarantee is made in Go that characters in strings are normalized.
  • String is a nice way to deal with short sequence, of bytes or characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. [see Golang: String]
  • Byte slice is just like string, but mutable. i.e. you can modify each byte or character. This is very efficient for working with file content, either as text file, binary file, or IO stream from networking. [see Golang: Slice]
  • Rune slice is like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese text or math formulas ∑ or text with emoji ♥ . [see Golang: Rune]

References


有疑问加站长微信联系(非本文作者)

本文来自:简书

感谢作者:打倒美帝

查看原文:Golang string, byte slices, rune

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

698 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传