In this article, I’ll share with you a few ways to break a string into characters in Go:
- Using strings.Split()
- Using rune() – this is my preferred approach
- Using byte()
1. Break String into Character Substrings using strings.Split()
Given a string and a delimiter, strings.Split() slices the string into a slice of substrings.
When the delimiter is empty, the function returns a slice of the individual Unicode characters that make up the string instead.
[Example]
str := "Hello World"
res := strings.Split(str, "")
fmt.Printf("Data type: %T\n", res[0])
fmt.Printf("Characters: %q\n", res)
Note: Package strings need to be imported.
[Output]
Data type: string
Characters: ["H" "e" "l" "l" "o" " " "W" "o" "r" "l" "d"]
This approach is helpful when you want the result to be a slice of strings.
2. String to Rune Slice
The second approach is to use the built-in rune() function. This is my preferred way to break a string into individual characters.
In Go, rune is an alias for the int32 and it is used to represent Unicode code points. Since Go uses UTF-8, rune is the data type you should use to store individual characters in a string for most applications. Naturally, rune also supports ASCII because UTF-8 is compatible with ASCII.
The rune() function works similarly to strings.Split() when the supplied delimiter is an empty string, except that rune() returns a slice of runes.
To convert a string to runes:-
outputRuneSlice := []rune(inputStr)
[Example]
str1 := "Hello World"
res1 := []rune(str1)
fmt.Printf("Data type: %T\n", res1[0])
fmt.Printf("Raw values: %v\n", res1)
fmt.Printf("Characters: %q\n", res1)
fmt.Printf("Slice size: %d bytes\n", binary.Size(res1))
str2 := "我是程式员"
res2 := []rune(str2)
fmt.Printf("Data type: %T\n", res2[0])
fmt.Printf("Raw values: %v\n", res2)
fmt.Printf("Characters: %q\n", res2)
fmt.Printf("Slice size: %d bytes\n", binary.Size(res2))
Note: import fmt and binary packages.
[Output]
Data type: int32
Raw values: [72 101 108 108 111 32 87 111 114 108 100]
Slice size: 44 bytes
Characters:
'H' 'e' 'l' 'l' 'o' ' ' 'W' 'o' 'r' 'l' 'd'
Data type: int32
Raw values: [25105 26159 31243 24335 21592]
Slice size: 20 bytes
Characters:
'我' '是' '程' '式' '员'
Implicit Conversion to Rune when Ranging Over a String
If the goal is to loop through the characters in a string, you might not need to convert the string into a slice explicitly.
The Range statement gives you the capability to examine the string rune by rune.
[Syntax]
for i, v := range <string> {
}
i is the index of the rune for current iteration
v is the numerical value of the rune for current iteration.
<string> is a placeholder for the string to range over.
[Example]
str1 := "Hello World"
fmt.Println("Ranging over: ", str1)
for _, v := range str1 {
fmt.Printf("%v:%q ", v, v)
}
fmt.Printf("\n")
str2 := "我是程式员"
fmt.Println("Ranging over: ", str2)
for _, v := range str2 {
fmt.Printf("%v:%q ", v, v)
}
fmt.Printf("\n")
[Output]
Ranging over: Hello World
72:'H' 101:'e' 108:'l' 108:'l' 111:'o' 32:' ' 87:'W' 111:'o' 114:'r' 108:'l' 100:'d'
Ranging over: 我是程式员
25105:'我' 26159:'是' 31243:'程' 24335:'式' 21592:'员'
3. String to Byte Slice
The byte() function can be used to split a string into characters when the string is made up of entirely ASCII characters.
[Syntax]
byteSlice := []byte(inputStr)
byte() can be used to convert any string to bytes, but if they are not ASCII, the output will not be a slice of individual characters.
If this is confusing, I encourage you to read this article for a better understanding of how Unicode works.
But here’s what you need to know for now.
Remember that UTF-8 is a variable-length encoding and its length can vary from 1 byte to 4 bytes.
ASCII is a subset of UTF-8 and uses only 1 byte to store a character.
When you call byte() on a 2-byte character, it will break into two bytes. Make sense?
But that character is defined by the 2-byte combination, breaking it into two separate bytes means that the resultant slice is no longer a valid character.
For illustration, check out this example.
str1 := "Hello World"
res1 := []byte(str1)
fmt.Printf("%v\n", res1)
fmt.Printf("Data type: %T\n", res1[0])
fmt.Printf("Raw values: %v\n", res1)
fmt.Printf("Characters: %q\n", res1)
fmt.Printf("Slice size: %d bytes\n", binary.Size(res1))
str2 := "我是程式员"
res2 := []byte(str2)
fmt.Printf("%v\n", res2)
fmt.Printf("Data type: %T\n", res2[0])
fmt.Printf("Raw values: %v\n", res2)
fmt.Printf("Characters: %q\n", res2)
fmt.Printf("Slice size: %d bytes\n", binary. Size(res2))
[Output]
Bytes
[72 101 108 108 111 32 87 111 114 108 100]
Data type: uint8
Raw values: [72 101 108 108 111 32 87 111 114 108 100]
Slice size: 11 bytes
Characters:
'H' 'e' 'l' 'l' 'o' ' ' 'W' 'o' 'r' 'l' 'd'
[230 136 145 230 152 175 231 168 139 229 188 143 229 145 152]
Data type: uint8
Raw values: [230 136 145 230 152 175 231 168 139 229 188 143 229 145 152]
Slice size: 15 bytes
Characters:
'æ' '\u0088' '\u0091' 'æ' '\u0098' '¯' 'ç' '¨' '\u008b' 'å' '¼' '\u008f' 'å' '\u0091' '\u0098'
See how the character slice become gibberish for the second string str2 := "我是程式员"
which is made up of multi-byte UTF-8 characters?
Implicit Conversion to Bytes: String in For-loop
Individual bytes in a string can be accessed by using the square bracket operator in Go.
E.g. str[0] returns the first byte of the string. When the characters in the string only occupy one byte (e.g. ASCII characters) respectively, str[0] also returns the first character of the string.
And since len() returns the number of bytes in a string, this means we can use the typical three-component for loop to loop through a string:-
for i := 0; i < len(str); i++ {
}
[Example]
str1 := "Hello World"
for i := 0; i < len(str1); i++ {
fmt.Printf("%v:%q ", str1[i], str1[i])
}
fmt.Printf("\n")
str2 := "我是程式员"
for i := 0; i < len(str2); i++ {
fmt.Printf("%v:%q ", str2[i], str2[i])
}
[Output]
// ASCII chars: OK
72:'H' 101:'e' 108:'l' 108:'l' 111:'o' 32:' ' 87:'W' 111:'o' 114:'r' 108:'l' 100:'d'
// Multi-byte character: gibberish
230:'æ' 136:'\u0088' 145:'\u0091' 230:'æ' 152:'\u0098' 175:'¯' 231:'ç' 168:'¨' 139:'\u008b' 229:'å' 188:'¼' 143:'\u008f' 229:'å' 145:'\u0091' 152:'\u0098'
As shown in the output above, the loop goes through the string byte-by-byte. And it’s equivalent to character-by-character when each byte represents a character (see the first string).
But when the string consists of multi-byte characters, this approach might not be what you’re looking for. See the second string and how it becomes unrecognizable in the output above.
Discussion & Conclusion
In this article, we’ve looked at 3 different ways to break a string into slice of characters.
The first two approaches, strings.Split() and rune() work for all strings in Go.
The third approach byte() requires a little more caution and only works for single-byte characters such as ASCII characters. This approach has one advantage over other solutions: it is more memory efficient. If you look at the “slice size” metric from the examples above, you’ll see that the rune slice for “Hello World” consumes 44 bytes while the byte slice only requires 11 bytes.
Why is that? This is because byte is equivalent to uint8 and rune is equivalent to int32 in Go. While UTF-8’s length is variable, rune is fixed length and always takes up 4 bytes (32 bits / 8 = 4 bytes).