blob: 6df7dd83cbc73f53283f1ab5ba99496eaa91ca20 [file] [log] [blame] [view]
## Storing UTF-8 Encoded Text with Strings
We talked about strings in Chapter 4, but well look at them in more depth now.
New Rustaceans commonly get stuck on strings for a combination of three
reasons: Rusts propensity for exposing possible errors, strings being a more
complicated data structure than many programmers give them credit for, and
UTF-8. These factors combine in a way that can seem difficult when youre
coming from other programming languages.
We discuss strings in the context of collections because strings are
implemented as a collection of bytes, plus some methods to provide useful
functionality when those bytes are interpreted as text. In this section, well
talk about the operations on `String` that every collection type has, such as
creating, updating, and reading. Well also discuss the ways in which `String`
is different from the other collections, namely how indexing into a `String` is
complicated by the differences between how people and computers interpret
`String` data.
### What Is a String?
Well first define what we mean by the term _string_. Rust has only one string
type in the core language, which is the string slice `str` that is usually seen
in its borrowed form `&str`. In Chapter 4, we talked about _string slices_,
which are references to some UTF-8 encoded string data stored elsewhere. String
literals, for example, are stored in the programs binary and are therefore
string slices.
The `String` type, which is provided by Rusts standard library rather than
coded into the core language, is a growable, mutable, owned, UTF-8 encoded
string type. When Rustaceans refer to strings in Rust, they might be
referring to either the `String` or the string slice `&str` types, not just one
of those types. Although this section is largely about `String`, both types are
used heavily in Rusts standard library, and both `String` and string slices
are UTF-8 encoded.
### Creating a New String
Many of the same operations available with `Vec<T>` are available with `String`
as well because `String` is actually implemented as a wrapper around a vector
of bytes with some extra guarantees, restrictions, and capabilities. An example
of a function that works the same way with `Vec<T>` and `String` is the `new`
function to create an instance, shown in Listing 8-11.
<Listing number="8-11" caption="Creating a new, empty `String`">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-11/src/main.rs:here}}
```
</Listing>
This line creates a new, empty string called `s`, into which we can then load
data. Often, well have some initial data with which we want to start the
string. For that, we use the `to_string` method, which is available on any type
that implements the `Display` trait, as string literals do. Listing 8-12 shows
two examples.
<Listing number="8-12" caption="Using the `to_string` method to create a `String` from a string literal">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-12/src/main.rs:here}}
```
</Listing>
This code creates a string containing `initial contents`.
We can also use the function `String::from` to create a `String` from a string
literal. The code in Listing 8-13 is equivalent to the code in Listing 8-12
that uses `to_string`.
<Listing number="8-13" caption="Using the `String::from` function to create a `String` from a string literal">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-13/src/main.rs:here}}
```
</Listing>
Because strings are used for so many things, we can use many different generic
APIs for strings, providing us with a lot of options. Some of them can seem
redundant, but they all have their place! In this case, `String::from` and
`to_string` do the same thing, so which one you choose is a matter of style and
readability.
Remember that strings are UTF-8 encoded, so we can include any properly encoded
data in them, as shown in Listing 8-14.
<Listing number="8-14" caption="Storing greetings in different languages in strings">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:here}}
```
</Listing>
All of these are valid `String` values.
### Updating a String
A `String` can grow in size and its contents can change, just like the contents
of a `Vec<T>`, if you push more data into it. In addition, you can conveniently
use the `+` operator or the `format!` macro to concatenate `String` values.
#### Appending to a String with `push_str` and `push`
We can grow a `String` by using the `push_str` method to append a string slice,
as shown in Listing 8-15.
<Listing number="8-15" caption="Appending a string slice to a `String` using the `push_str` method">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-15/src/main.rs:here}}
```
</Listing>
After these two lines, `s` will contain `foobar`. The `push_str` method takes a
string slice because we dont necessarily want to take ownership of the
parameter. For example, in the code in Listing 8-16, we want to be able to use
`s2` after appending its contents to `s1`.
<Listing number="8-16" caption="Using a string slice after appending its contents to a `String`">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-16/src/main.rs:here}}
```
</Listing>
If the `push_str` method took ownership of `s2`, we wouldnt be able to print
its value on the last line. However, this code works as wed expect!
The `push` method takes a single character as a parameter and adds it to the
`String`. Listing 8-17 adds the letter _l_ to a `String` using the `push`
method.
<Listing number="8-17" caption="Adding one character to a `String` value using `push`">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-17/src/main.rs:here}}
```
</Listing>
As a result, `s` will contain `lol`.
#### Concatenation with the `+` Operator or the `format!` Macro
Often, youll want to combine two existing strings. One way to do so is to use
the `+` operator, as shown in Listing 8-18.
<Listing number="8-18" caption="Using the `+` operator to combine two `String` values into a new `String` value">
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-18/src/main.rs:here}}
```
</Listing>
The string `s3` will contain `Hello, world!`. The reason `s1` is no longer
valid after the addition, and the reason we used a reference to `s2`, has to do
with the signature of the method thats called when we use the `+` operator.
The `+` operator uses the `add` method, whose signature looks something like
this:
```rust,ignore
fn add(self, s: &str) -> String {
```
In the standard library, youll see `add` defined using generics and associated
types. Here, weve substituted in concrete types, which is what happens when we
call this method with `String` values. Well discuss generics in Chapter 10.
This signature gives us the clues we need in order to understand the tricky
bits of the `+` operator.
First, `s2` has an `&`, meaning that were adding a _reference_ of the second
string to the first string. This is because of the `s` parameter in the `add`
function: we can only add a `&str` to a `String`; we cant add two `String`
values together. But waitthe type of `&s2` is `&String`, not `&str`, as
specified in the second parameter to `add`. So why does Listing 8-18 compile?
The reason were able to use `&s2` in the call to `add` is that the compiler
can _coerce_ the `&String` argument into a `&str`. When we call the `add`
method, Rust uses a _deref coercion_, which here turns `&s2` into `&s2[..]`.
Well discuss deref coercion in more depth in Chapter 15. Because `add` does
not take ownership of the `s` parameter, `s2` will still be a valid `String`
after this operation.
Second, we can see in the signature that `add` takes ownership of `self`
because `self` does _not_ have an `&`. This means `s1` in Listing 8-18 will be
moved into the `add` call and will no longer be valid after that. So, although
`let s3 = s1 + &s2;` looks like it will copy both strings and create a new one,
this statement actually takes ownership of `s1`, appends a copy of the contents
of `s2`, and then returns ownership of the result. In other words, it looks
like its making a lot of copies, but it isnt; the implementation is more
efficient than copying.
If we need to concatenate multiple strings, the behavior of the `+` operator
gets unwieldy:
```rust
{{#rustdoc_include ../listings/ch08-common-collections/no-listing-01-concat-multiple-strings/src/main.rs:here}}
```
At this point, `s` will be `tic-tac-toe`. With all of the `+` and `"`
characters, its difficult to see whats going on. For combining strings in
more complicated ways, we can instead use the `format!` macro:
```rust
{{#rustdoc_include ../listings/ch08-common-collections/no-listing-02-format/src/main.rs:here}}
```
This code also sets `s` to `tic-tac-toe`. The `format!` macro works like
`println!`, but instead of printing the output to the screen, it returns a
`String` with the contents. The version of the code using `format!` is much
easier to read, and the code generated by the `format!` macro uses references
so that this call doesnt take ownership of any of its parameters.
### Indexing into Strings
In many other programming languages, accessing individual characters in a
string by referencing them by index is a valid and common operation. However,
if you try to access parts of a `String` using indexing syntax in Rust, youll
get an error. Consider the invalid code in Listing 8-19.
<Listing number="8-19" caption="Attempting to use indexing syntax with a String">
```rust,ignore,does_not_compile
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-19/src/main.rs:here}}
```
</Listing>
This code will result in the following error:
```console
{{#include ../listings/ch08-common-collections/listing-08-19/output.txt}}
```
The error and the note tell the story: Rust strings dont support indexing. But
why not? To answer that question, we need to discuss how Rust stores strings in
memory.
#### Internal Representation
A `String` is a wrapper over a `Vec<u8>`. Lets look at some of our properly
encoded UTF-8 example strings from Listing 8-14. First, this one:
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:spanish}}
```
In this case, `len` will be `4`, which means the vector storing the string
`"Hola"` is 4 bytes long. Each of these letters takes one byte when encoded in
UTF-8. The following line, however, may surprise you (note that this string
begins with the capital Cyrillic letter _Ze_, not the number 3):
```rust
{{#rustdoc_include ../listings/ch08-common-collections/listing-08-14/src/main.rs:russian}}
```
If you were asked how long the string is, you might say 12. In fact, Rusts
answer is 24: thats the number of bytes it takes to encode “Здравствуйте” in
UTF-8, because each Unicode scalar value in that string takes 2 bytes of
storage. Therefore, an index into the strings bytes will not always correlate
to a valid Unicode scalar value. To demonstrate, consider this invalid Rust
code:
```rust,ignore,does_not_compile
let hello = "Здравствуйте";
let answer = &hello[0];
```
You already know that `answer` will not be `З`, the first letter. When encoded
in UTF-8, the first byte of `З` is `208` and the second is `151`, so it would
seem that `answer` should in fact be `208`, but `208` is not a valid character
on its own. Returning `208` is likely not what a user would want if they asked
for the first letter of this string; however, thats the only data that Rust
has at byte index 0. Users generally dont want the byte value returned, even
if the string contains only Latin letters: if `&"hi"[0]` were valid code that
returned the byte value, it would return `104`, not `h`.
The answer, then, is that to avoid returning an unexpected value and causing
bugs that might not be discovered immediately, Rust doesnt compile this code
at all and prevents misunderstandings early in the development process.
#### Bytes and Scalar Values and Grapheme Clusters! Oh My!
Another point about UTF-8 is that there are actually three relevant ways to
look at strings from Rusts perspective: as bytes, scalar values, and grapheme
clusters (the closest thing to what we would call _letters_).
If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is
stored as a vector of `u8` values that looks like this:
```text
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
224, 165, 135]
```
Thats 18 bytes and is how computers ultimately store this data. If we look at
them as Unicode scalar values, which are what Rusts `char` type is, those
bytes look like this:
```text
['न', 'म', 'स', '्', 'त', 'े']
```
There are six `char` values here, but the fourth and sixth are not letters:
theyre diacritics that dont make sense on their own. Finally, if we look at
them as grapheme clusters, wed get what a person would call the four letters
that make up the Hindi word:
```text
["न", "म", "स्", "ते"]
```
Rust provides different ways of interpreting the raw string data that computers
store so that each program can choose the interpretation it needs, no matter
what human language the data is in.
A final reason Rust doesnt allow us to index into a `String` to get a
character is that indexing operations are expected to always take constant time
(O(1)). But it isnt possible to guarantee that performance with a `String`,
because Rust would have to walk through the contents from the beginning to the
index to determine how many valid characters there were.
### Slicing Strings
Indexing into a string is often a bad idea because its not clear what the
return type of the string-indexing operation should be: a byte value, a
character, a grapheme cluster, or a string slice. If you really need to use
indices to create string slices, therefore, Rust asks you to be more specific.
Rather than indexing using `[]` with a single number, you can use `[]` with a
range to create a string slice containing particular bytes:
```rust
let hello = "Здравствуйте";
let s = &hello[0..4];
```
Here, `s` will be a `&str` that contains the first four bytes of the string.
Earlier, we mentioned that each of these characters was two bytes, which means
`s` will be `Зд`.
If we were to try to slice only part of a characters bytes with something like
`&hello[0..1]`, Rust would panic at runtime in the same way as if an invalid
index were accessed in a vector:
```console
{{#include ../listings/ch08-common-collections/output-only-01-not-char-boundary/output.txt}}
```
You should use caution when creating string slices with ranges, because doing
so can crash your program.
### Methods for Iterating Over Strings
The best way to operate on pieces of strings is to be explicit about whether
you want characters or bytes. For individual Unicode scalar values, use the
`chars` method. Calling `chars` on “Зд” separates out and returns two values of
type `char`, and you can iterate over the result to access each element:
```rust
for c in "Зд".chars() {
println!("{c}");
}
```
This code will print the following:
```text
З
д
```
Alternatively, the `bytes` method returns each raw byte, which might be
appropriate for your domain:
```rust
for b in "Зд".bytes() {
println!("{b}");
}
```
This code will print the four bytes that make up this string:
```text
208
151
208
180
```
But be sure to remember that valid Unicode scalar values may be made up of more
than one byte.
Getting grapheme clusters from strings, as with the Devanagari script, is
complex, so this functionality is not provided by the standard library. Crates
are available on [crates.io](https://crates.io/)<!-- ignore --> if this is the
functionality you need.
### Strings Are Not So Simple
To summarize, strings are complicated. Different programming languages make
different choices about how to present this complexity to the programmer. Rust
has chosen to make the correct handling of `String` data the default behavior
for all Rust programs, which means programmers have to put more thought into
handling UTF-8 data up front. This trade-off exposes more of the complexity of
strings than is apparent in other programming languages, but it prevents you
from having to handle errors involving non-ASCII characters later in your
development life cycle.
The good news is that the standard library offers a lot of functionality built
off the `String` and `&str` types to help handle these complex situations
correctly. Be sure to check out the documentation for useful methods like
`contains` for searching in a string and `replace` for substituting parts of a
string with another string.
Lets switch to something a bit less complex: hash maps!