Safely Truncating UTF-8 Text

I'm letting users POST data to a web application, which is then stored in my database. I obviously don't want to accept arbitrarily large strings so I need to truncate the text while ensuring what remains is a valid UTF-8 string.

I'm using the Tide web framework (with tide-sqlx), so the code for my endpoint looks something like:

// Provides user_id() and user_name() methods on a tide::Request.
use crate::service::auth::AuthenticatedRequest as _;

use async_std::prelude::*;
use sqlx::{postgres::Postgres, Acquire};
// Attaches a DB connection onto a tide::Request.
use tide_sqlx::SQLxRequestExt as _;


// Endpoint: POST to /profile/:user
pub async fn edit_profile<State>(mut req: tide::Request<State>) -> tide::Result
    where State: 'static + Clone + Send + Sync,
{
    let viewing_user = req.param("user").unwrap_or_default();
    if ! req.user_name().contains(&viewing_user) {
        // Nobody can modify another person's profile.
        return Ok(tide::Response::new(tide::StatusCode::Unauthorized));
    }

    // In a real application, the maximum size should be read from the configuration
    // file.
    let (user_bio, mut conn) = sanitize_content(req.take_body(), 600)
        .join(req.sqlx_conn::<Postgres>())
        .await;

    sqlx_query("UPDATE users SET info = $2 WHERE id = $1")
        .bind(req.user_id())
        .bind(user_bio?)
        .execute(conn.acquire().await?).await?;

    Ok(tide::Response::new(tide::StatusCode::Ok))
}

async fn sanitize_content(content: tide::Body, max_bytes: usize)
-> std::io::Result<String> {
    todo!()
}

The sanitize_content function reads up to max_bytes bytes from the stream, ensures that it's valid UTF-8, then uses ammonia to clean the HTML.

async fn sanitize_content(content: tide::Body, max_size: usize)
-> std::io::Result<String> {
    // Reserving max_size bytes for buf would likely be a performance
    // improvement, depending on the size of a typical body.
    let mut buf = vec![];
    content.take(max_bytes as u64)
        .read_to_end(&mut buf).await?;

    let content = String::from_utf8_lossy(truncate_utf8(&buf, max_bytes));

    let clean_content = ammonia::Builder::new()
        .tags(TAGS_ALLOWED_LIST)
        .clean(&content)
        .to_string();

    Ok(clean_content)
}

fn truncate_utf8(text: &[u8], len: usize) -> &[u8] {
    todo!()
}

 And now we come to the actual truncation function. For my current use case I technically only need to check the end of the string for validity since I'm only reading max_bytes from the stream, but I'd rather have a general truncation function that I can use elsewhere.

Notice that both the input and output will be a byte array, rather than a string type. I could use one of String's from_utf8 constructors within truncate_utf8 but whether I want a lossy conversion or not will vary by situation, so that's a decision for the calling function (especially since the only from_utf8 functions that don't traverse the string are the _unchecked variants). The fewer assumptions and promises we make in a function, the better.

A single UTF-8 code point can be up to four bytes (a single Unicode grapheme cluster can be made up of multiple code points, but we're ignoring them here); truncating to len may leave us with a partial code point, so we may need to remove up to three more bytes. We'll start with the code, then look at how it works:

/// Truncate a UTF-8 string to no more than the specified length in bytes,
/// without splitting within a multi-byte sequence.
///
/// Returns a truncated byte slice of the input data.
///
/// See [RFC 3629](https://datatracker.ietf.org/doc/html/rfc3629) and the
/// encoding tables at <https://en.wikipedia.org/wiki/UTF-8> for more
/// information.
fn truncate_utf8(text: &[u8], len: usize) -> &[u8] {
    if text.len() <= len { return text; }

    let end = &text[len-3..len];

    if end[2] & 0x80 != 0 {
        if end[2] & 0x40 != 0 {
            &text[..len-1]
        } else if end[1] & 0xE0 == 0xE0 {
            &text[..len-2]
        } else if end[0] & 0xF0 == 0xF0 {
            &text[..len-3]
        } else {
            &text[..len]
        }
    } else {
        &text[..len]
    }
}

In a UTF-8 code point, the first 1-4 high-order bits of the first byte tell us the number of bytes in the code point. Each remaining byte in multi-byte code points begins with binary 10. We simply check the last three bytes to see how long the last sequence is supposed to be, and remove it if we've truncated within it. The table below should help understand the byte layout.

  Byte 1 - Binary Mask Byte 1 - Hex Mask Byte 2 (3, 4 if applicable) - Binary Mask
One Byte  0xxx xxxx  0  
Two Bytes  110x xxxx  C0 10xx xxxx
Three Bytes  1110 xxxx  E0 10xx xxxx
Four Bytes  1111 0xxx  F0 10xx xxxx

Let's walk through each condition of our truncation.

if end[2] & 0x80 != 0 { // 0x80 == 0b1000_0000
    // The high-order bit of the last byte is 1, so we're somewhere
    // in the middle of a multi-byte code.

    if end[2] & 0x40 != 0 { // 0x40 == 0b0100_0000
        // The second-highest bit is 1.
        // In a multi-byte code point, this bit in every byte *except*
        // the first is a 0, so we know that our last byte begins a
        // multi-byte code point and should be removed.

        &text[..len-1]
    } else if end[1] & 0xF0 == 0xE0 {
        // 0xF0 == 0x1111 0000
        // 0xE0 == 0x1110 0000

        // If the first three bits are 1, then we begin a three-byte
        // code point. We're checking this at the second byte, so if
        // the second byte is part of a three-byte code point we need
        // to remove the last two bytes.

        &text[..len-2]
    } else if end[0] & 0xF0 == 0xF0 {
        // If the first four bits of the first byte are 1, we've begun
        // a four-byte code point. We need to remove all three bytes.

        &text[..len-3]
    } else {
        // Our last bytes form a complete UTF-8 code point.

        &text[..len]
    }
} else { // end[2] & 0x80 != 0
    // The high-order bit of the last byte is 0, so is a single-byte
    // code.

    &text[..len]
}

Truncating a UTF-8 string is pretty simple as long as you're willing to truncate within a grapheme cluster comprised of multiple code points. I'm a bit surprised this function is not commonly provided in standard libraries -- Rust could easily have made their String::truncate method behave properly in more scenarios and with little cost for ASCII text, likely reducing the amount of incorrect code people write in the process.