我们经常会碰到string,byte slices以及rune之间的相互转化问题,现简单介绍一下。
String本质上是只读的slice of bytes。
indexing a string yields its bytes, not its characters: a string is just a bunch of bytes.
rune是int32的别名,代表字符的Unicode编码,采用4个字节存储,将string转成rune就意味着任何一个字符都用4个字节来存储其unicode值,这样每次遍历的时候返回的就是unicode值,而不再是字节了。
Stringis immutable byte sequence.Byte sliceis mutable byte sequence.Runeslice is re-grouping of byte slice so that each index is a character.// rune is an alias for int32 and is equivalent to int32 in all ways. It is // used, by convention, to distinguish character values from integer values. type rune = int32
下面我们定义placeOfInterest为 raw string, 其由反引号 back quotes包围着, 因此它仅仅只能包含literal text。
func main() {
const placeOfInterest = `⌘`
fmt.Printf("plain string: ")
fmt.Printf("%s", placeOfInterest)
fmt.Printf("\n")
fmt.Printf("quoted string: ")
fmt.Printf("%+q", placeOfInterest)
fmt.Printf("\n")
fmt.Printf("hex bytes: ")
for i := 0; i < len(placeOfInterest); i++ {
fmt.Printf("%x ", placeOfInterest[i])
}
for _, ch := range placeOfInterest {
fmt.Printf("\nUnicode character: %c", ch)
}
fmt.Printf("\nThe length of placeOfInterest: %d", len(placeOfInterest))
fmt.Printf("\n")
const Chinese = "中国话"
fmt.Println(len(Chinese))
for index, runeValue := range Chinese {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
}
输出结果为:
plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98
Unicode character: ⌘
The length of placeOfInterest: 3
9
U+4E2D '中' starts at byte position 0
U+56FD '国' starts at byte position 3
U+8BDD '话' starts at byte position 6
从上面输出结果可以看出:
- 符号⌘的
Unicode character值为U+2318,其由三个字节组成:e2 8c 98。它们是UTF-8编码表示的16进制值2318。 - 通过
for range对字符串进行遍历时,每次获取到的对象都是rune类型的。而for循环遍历输出的是各个字节。 - go采用的是UTF-8编码,即go的源代码是被定义成UTF-8文本形式的,其他形式的表述是不被允许的。这就是说,当我们在代码中写下
⌘时,程序将符号⌘的UTF-8编码写入源代码文本中。因此当我们打印16进制bytes时,我们只是将编辑器放置在文件中的数据给dump下来了而已。 - 使用len函数获取到string的长度并不是字符个数,而是字节个数。
- Unicode标准使用码点
code point来表示a single value所表述的item。例如符号⌘,其16进制值为2318,其code point 为U+2318。
但是由于Code point比较绕口,因此go引进了一个新的词汇项rune来表示。rune经常出现在library和源代码中,它基本上就和Code point一样,但是go语言将rune表示为int32的alias,这样通过一个整形值来代表Code point将更加清晰明了。因此,在Golang中我们可以将character constant称为rune constant 。表达式'⌘'的类型和值分别为rune ,整形值0x2318.
总结
- Go source code is always UTF-8.
- A string holds arbitrary bytes.
- A string literal, absent byte-level escapes, always holds valid UTF-8 sequences. Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes. To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.
- Those sequences represent Unicode code points, called runes.
- No guarantee is made in Go that characters in strings are normalized.
Stringis a nice way to deal with short sequence, of bytes or characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. [see Golang: String]Byte sliceis just like string, but mutable. i.e. you can modify each byte or character. This is very efficient for working with file content, either as text file, binary file, or IO stream from networking. [see Golang: Slice]Rune sliceis like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese text or math formulas ∑ or text with emoji ♥ . [see Golang: Rune]
References
- Strings, bytes, runes and characters in Go
- Go系列 string、bytes、rune的区别
- Golang: String, Byte Slice, Rune Slice
- unicode/utf8/
- 字符编码笔记:ASCII,Unicode 和 UTF-8
有疑问加站长微信联系(非本文作者)
