Safer pointers in rust
For illustration purposes, suppose you want to have a struct First that holds a pointer to a string slice.
It should have a method that returns the first character in said string slice. Storing raw pointers instead of using
owning containers may give you a performance boost in some scenarios (always measure first!), and you may also want to do it to avoid that pesky
borrow checker.
struct First {
data: *const u8,
}
We will be storing the data as a pointer to unsigned bytes, since that is how strings in Rust (and most other languages) are represented under the hood.
In order to construct one, we will add a new method:
impl First {
fn new(data: &str) -> Self {
Self {
data: data.as_ptr(),
}
}
To get the first character, we will add a first method:
fn first(&self) -> char {
let first = unsafe { *self.data };
char::from(first)
}
In a pointer to an array, the memory address stored in such pointer points to the first element of the array. So in order to just get the first character, it is enough to dereference it. Since dereferencing a pointer is obviously unsafe, we need to wrap this up in an unsafe block.
This will give us the first u8 in the array, but we want to return a char. A char is a 32bit code-point. If our string started with a
multi-byte non-ASCII character, then our code would be incorrect, but since this is just an experiment, we will assume that we are
only dealing with ASCII and just return it as-is.
We can then test our code:
fn main() {
let name = String::from("Josh");
let first = First::new(&name);
println!("{}", first.first());
}
Which will print out: J.
As a test, let’s feed it a multi-byte code-point:
fn main() {
let name = String::from("你好");
let first = First::new(&name);
println!("{}", first.first());
}
This outputs: ä on my machine, which is obviously wrong, but expected, since we are not handling
UTF-8 at all.
This exercise is not about correct UTF-8 handling though, it’s about trying to make our unsafe code safer in Rust, by leveraging the tools that the language provides.
Consider what happens when we drop the String in variable name before calling println!:
fn main() {
let name = String::from("你好");
let first = First::new(&name);
drop(name);
println!("{}", first.first());
}
This compiles just fine. We can always choose to use raw pointers in Rust and skedaddle around the borrow checker if we so choose. There are many other unsafe escape hatchets if we find the borrow checker too troublesome and we really know what we are doing, which I might cover in another blog post.
This code of course has a use after free bug. When we call first(), we end up dereferencing a pointer to memory that has been freed, since
it was owned by the String bound to the name variable. On my system, this code
outputs nothing, but it may behave differently on your system, depending on a myriad of different things like the system allocator, the
version of rustc, etc. It is undefined behavior.
We can use ASAN using nightly to check that we do in fact have a user after free, by running our example with
RUSTFLAGS=-Zsanitizer=address cargo +nightly run :
==30967==ERROR: AddressSanitizer: heap-use-after-free on address 0x6020000000f0 at pc 0x00010232e678 bp 0x00016dad21e0 sp 0x00016dad21d8
READ of size 1 at 0x6020000000f0 thread T0
#0 0x00010232e674 in first::First::first::h8e131cbea1cf814a main.rs:30
#1 0x00010232e3ac in first::main::h7e742f6d761002ff main.rs:39
...
Lifetimes?
We could use lifetimes to make sure the compiler checks that we are not dereferencing the pointer if the owner goes out of scope.
We can therefore make our struct First generic over some lifetime, which should be the same as the validity of our data pointer.
Big Rust does not want you to know this, but you can actually name your lifetimes however you want. They don’t need to be cryptic one-letter
names like 'a.
struct First<'data> {
data: *const u8,
}
Here I decided to make our struct First generic over a lifetime called 'data, since our First will be valid as long as our data pointer is
valid. However, we can’t add it to our struct since we aren’t actually using this lifetime anywhere. We can’t just add a lifetime to a pointer, that was
the whole point! We did not want to store a reference and worry about lifetimes! We chose a pointer because pointers allow us to by-pass lifetimes
all together, but now we would like to tie our pointer back to one. Somehow.
We can always get a reference out of a pointer by calling as_ref on the pointer. And if we do end up storing a reference instead of a pointer,
we will be storing a lifetime too, which is what we want. So I guess, if you want safe pointers, just use references. That’s literally why they exist.
But since this is just a convoluted example anyways, and we want to know how to make pointers themselves safer, let’s see how we can do this instead.
The rust compiler being as helpful as ever, already spoils the solution for this in its error message:
1 error[E0392]: lifetime parameter `'data` is never used
--> src/main.rs:1:14
|
1 | struct First<'data> {
| ^^^^^ unused lifetime parameter
|
= help: consider removing `'data`, referring to it in a field, or using a marke
r such as `std::marker::PhantomData`
The easiest way is to just use PhantomData, which is a marker: we can use it on an extra field to give more information
to the compiler. However, a marker compiles down to nothing, it is just there to help the compiler. In this case, we want to mark our struct as having a lifetime over our data pointer,
but we don’t want to store a reference (because reasons), so we need a way to “store the lifetime itself” so the compiler keeps track of
it for us, but not actually store anything.
struct First<'data> {
data: *const u8,
_marker: std::marker::PhantomData<&'data str>,
}
Adding a proper name to the lifetime makes things clearer. We act as if we were actually storing a reference to a &'data str, but we
aren’t actually storing an extra reference. Just our plain unsafe pointer (I will like to remind the reader that a reference is also just a plain pointer under the hood
and incurs no extra runtime cost).
We update our new method accordingly:
impl<'data> First<'data> {
fn new(data: &'data str) -> Self {
Self {
data: data.as_ptr(),
_marker: std::marker::PhantomData,
}
}
}
And now the borrow checker has got our back and forbids us from compiling the code:
1 error[E0505]: cannot move out of `name` because it is borrowed
--> src/main.rs:23:10
|
21 | let name = String::from("Josh");
| ---- binding `name` declared here
22 | let first = First::new(&name);
| ----- borrow of `name` occurs here
23 | drop(name);
| ^^^^ move out of `name` occurs here
24 | println!("{}", first.first());
| ----- borrow later used here
|
help: consider cloning the value if the performance cost is acceptable
|
22 | let first = First::new(&name.clone());
If we do
println!("{}", std::mem::size_of::<First>());
with our old implementation of struct First without the PhantomData, and then with our new PhantomData, we will
see that the output is the same: 8 bytes (64 bits for the size of our pointer on my 64 bit machine).
This was just references with extra steps.
Please use references.