Applies to: Rust workspaces across the ecosystem. Companion to ./rust-patterns — that one covers shape, this one covers speed.
- unsafe_code = "forbid" at the workspace. A specific crate can
override to "allow" on a conservative, case-by-case basis (existing
precedent: FFI/binding crates, see ./rust-patterns §Lints).
Performance can justify the same override — see §Unsafe escape hatch.
Never per-function in an otherwise-safe crate.
- Stable Rust. No #![feature(...)], no nightly toolchains.
- tokio runtime. Thread-per-core runtimes (glommio, monoio) are
out of scope — see §Out of scope.
Optimization without profiling is guesswork. Always profile and benchmark
with --release; debug builds run 10–100× slower with different hot paths.
| Tool | Surface | When |
| ------------------- | ---------------------------------------- | -------------------------------- |
| samply | CPU sampling, flamegraphs | "Where is wall-clock going?" |
| tokio-console | Live task states, busy/idle, polls | Async stalls, starvation |
| cargo-instruments | macOS Instruments integration | Memory allocations on Apple HW |
| Cachegrind | Instruction counts, I-cache, branch miss | Verifying inline/cold heuristics |
samply is the default starting point on Linux. Reach for tokio-console
the moment an async server feels mysteriously sluggish — it surfaces tasks
that never yield, polls that take milliseconds, and workers that idle while
the queue grows.
| Crate | Metric | Strengths | | --------------- | ------------------ | --------------------------------------------------------- | | Criterion | Wall-clock + stats | Long-standing default; CI regression integrations | | Divan | Wall-clock + stats | Lighter macros, native multithreaded benches | | Iai-Callgrind | Instruction counts | Deterministic single run, no OS jitter; ideal for CI/micro |
Wall-clock benches are intuitive but sensitive to OS jitter and thermal scaling. Instruction-count benches catch single-instruction regressions but have weaker non-x86 / non-Linux support.
Never hold a sync lock across .await. parking_lot::Mutex and
std::sync::Mutex guards block the executor thread; if the holder yields
to the runtime mid-section, the runtime can deadlock or starve other tasks
that need the lock. Either:
- Scope the guard so it drops before the await point:
```rust
let value = { let g = state.lock(); g.derive() };
do_async(value).await;
```
- Or use tokio::sync::Mutex / tokio::sync::RwLock, which suspend
cleanly across yields.
parking_lot remains the default for sync sections (./rust-patterns
§Dependencies). tokio::sync::* is for critical regions that themselves
must .await. std::sync::* only when you need poisoning semantics.
Arc<RwLock<HashMap<K, V>>> serializes all readers under any contended
write, and even a pure-read workload bounces the lock's cache line across
cores. DashMap shards internally — each shard has its own lock, so
unrelated keys don't collide. Reach for it when profiling shows lock
contention on a single map. Not the default; the RwLock shape is fine
until proven otherwise.
Allocate on purpose, not by reflex. A deliberate allocation can be the right design — terminating a pipeline, decoupling lifetimes, batching work that would otherwise repeat. The tactics below target *incidental* allocations: heap traffic that snuck in by habit, not by choice. Cheap, idiomatic wins — apply without ceremony.
- Vec::with_capacity(n) when n is known or estimable. Default
growth is geometric reallocation + memcpy per resize.
- swap_remove(i) instead of remove(i) when ordering is irrelevant.
O(1) vs O(n).
- retain(|x| ...) for bulk filter-deletes — single pass, in place.
- .copied() / .cloned() on iterators of Copy references often
generates better codegen than threading &T through the chain.
- Don't .collect() mid-pipeline just to keep chaining. Fuse the
iterators (iter().map(...).filter(...).sum()) — every intermediate
Vec is a heap allocation plus drop.
- Strings: String::with_capacity, write! into a buffer, format!
only when the result is the final value.
- Take &str / &[T], not String / Vec<T>, unless you need
ownership. Already covered in ./rust-patterns §Avoid clone smells —
it's both a clarity and an allocation rule.
Let the compiler prove what you already know. Rust inserts a runtime
bounds check on every vec[i] access; inside tight loops these stall
the pipeline and inhibit auto-vectorization. The tactics below give
LLVM the invariants it needs to drop the check:
- Iterators over indexing. for x in &v — no check.
for i in 0..v.len() { v[i] } — check every iteration.
- Hoist slice lengths. let slice = &v[..n]; *before* the loop. The
compiler proves bounds once, elides per-iteration checks.
- assert!(i < v.len()) before a loop is a free optimizer hint —
LLVM propagates the invariant downward.
get_unchecked / get_unchecked_mut are off-limits in workspace-default
crates. If a benchmark proves the bounds check is the bottleneck *and*
iterator/assert rewrites can't eliminate it, isolate the hot kernel in a
dedicated crate that overrides unsafe_code = "allow" — see §Unsafe
escape hatch.
Inlining isn't primarily about saving call overhead — it's about giving
the compiler a single block to optimize across, so constants fold,
branches collapse, and dead code drops. #[inline] is a hint;
#[inline(always)] is a mandate; #[inline(never)] blocks it. Apply
surgically:
- #[inline(always)]: tiny, hot, frequently-called helpers — e.g.
branches that collapse away when a callsite passes false to a bool
parameter.
- #[inline(never)] + #[cold]: rare error formatters, panic
builders, anything off the hot path. Pulls cold code out of the critical
I-cache footprint, keeping the hot sequence dense.
- Cross-crate inlining: #[inline] only inlines within a crate by
default. The release profile sets lto = true and codegen-units = 1
(./rust-patterns §Release Profile), which enables cross-crate
inlining without per-fn annotations.
Don't sprinkle #[inline(always)] on large functions. Duplicating big
bodies at every call site bloats the binary and blows the I-cache.
Verify with Cachegrind: an inlined function's brace lines have zero event counts.
When per-thread counters/stats sit adjacent in memory, they share a 64-byte CPU cache line. A write by thread A invalidates the line on every other core, forcing thread B to re-fetch from memory even though it touched a different field. Under contention this can be a 5–10× slowdown on what looks like independent atomic increments.
#[repr(align(64))]
struct ThreadStats {
allocs: AtomicU64,
bytes: AtomicU64,
}Cost: padding bytes per struct. Win: eliminating cross-core invalidation traffic on hot atomics. Apply to per-thread / per-shard structures that get written from multiple cores.
Status: not currently in use in any workspace crate. Documented as the default choice if/when AST or parser work motivates an arena — pointer- bump allocations + single arena drop eliminate per-node free-list traversal and most pointer chasing, with cache-friendly contiguous layout.
The drop-glue trap: bumpalo::Vec and arena-allocated types do *not*
run Drop for their contents when the arena drops. If you put a String
or std::vec::Vec inside a bumpalo::Vec and call into_bump_slice, the
inner heap allocations leak permanently. Real-world incidents elsewhere
have run 18 months before detection.
Discipline:
- Arenas hold POD-only data (Copy types, &'arena str, primitives,
arena-internal references).
- For arena-allocated types with destructors, use typed-arena instead —
it walks allocations and runs Drop on teardown.
- Never round-trip global-heap collections (String, Vec, HashMap)
through into_bump_slice.
- One arena per logical phase (per-file parse, per-request render). Drop
at phase end.
Status: not currently in use. Documented as a candidate for content-
addressed object storage — blake3-hashed bodies and {path → hash}
snapshot manifests that get read repeatedly without intervening mutation.
The HTTP / SSE / JSON-RPC wire surfaces stay on serde + serde_json.
serde_json parses bytes into freshly allocated structs on every read.
rkyv skips the parse step entirely: the on-disk bytes *are* the in-
memory layout. An archive is reached via a single offset calculation
into the buffer, returning a &Archived<T> that aliases the buffer.
Internal pointers are encoded as offsets relative to the field's own
position, so the buffer is valid whether it was just mmap'd, read from
disk, or received over a socket — no fixup pass needed.
When it fits:
- mmap'd or repeatedly-read on-disk artifacts (content-addressed bodies, packed snapshot manifests, cached index pages) - the hot path is "decode → traverse → drop without mutating" - the schema is stable across versions, or evolution is rare enough to warrant explicit migration
When it doesn't:
- HTTP / SSE / JSON-RPC wire surfaces — serde_json stays the right call
- mutation-heavy workloads — writes go through AlignedVec, not in-place
edits; rkyv is for the read side
- one-shot reads where the buffer is decoded once and thrown away —
serde's cost is amortized away
Discipline:
- Pair archived reads from untrusted input with bytecheck (rkyv's
validation layer). Without it, a malformed buffer can dereference out-
of-bounds offsets — and validation is opt-in, not the default.
- Treat the archived schema as a wire format. Renaming a field or
reordering enum variants breaks every existing file on disk. The
pre-stable workspace policy still applies — break in place and
migrate, no compat shims — but the migration is "re-archive every
file," not just "update callers."
- Don't mix archived and serde-derived shapes for the same type. Pick
one per type so a reader doesn't have to guess.
Portable SIMD (std::simd) is nightly-only and out of scope. On stable:
- target-feature / target-cpu=native via RUSTFLAGS enables LLVM
auto-vectorization for the build target. The compiler emits AVX2 / SSE /
NEON when the target supports it, no source changes required.
- Crate-specific SIMD features: some crates expose simd cargo
features that gate std::arch intrinsic paths. Precedent: blake3
uses wasm32_simd for WASM SIMD (./wasm-patterns).
- Don't ship AVX-512 binaries to general consumers — most consumer
CPUs lack it, and binaries compiled with AVX-512 crash instantly on
older hardware with no unsafe block to blame. Use target-cpu=native
for in-house workloads; explicit feature flags for distributed binaries.
Hand-rolled std::arch intrinsics require unsafe. Isolate in a dedicated
crate (§Unsafe escape hatch); pair with strict bench regression tests.
Status: no workspace crate currently overrides the global allocator. Guidance here is for when a daemon's RSS or CPU profile motivates a switch.
Default glibc malloc fragments badly under long uptimes with mixed-size
async allocations — RSS climbs continuously even with no leaks. Swapping
the allocator can drop RSS and CPU substantially.
| Allocator | Strengths | Trade-offs |
| --------- | ------------------------------------------------------ | ------------------------------------------- |
| glibc | Default, no setup | Heavy fragmentation in long-running servers |
| jemalloc | Rock-stable RSS under chaotic load; great profiling | Slightly higher CPU than mimalloc |
| mimalloc | Best throughput / lowest CPU; aggressive munmap | RSS can spike during traffic bursts |
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;Candidates: long-running daemons (zzz_server). Not candidates:
CLIs and short-lived tools — startup cost matters more than fragmentation
they'll never see. Bench per service before adopting; the right answer
depends on workload shape.
C-malloc gotcha: a dependency that links a C library calling raw
malloc (e.g., LMDB) bypasses your Rust global allocator, mimicking a
leak. Fix: use mimalloc's symbol-override mode or LD_PRELOAD it so C
malloc calls also route through it.
Workspace lints forbid unsafe_code. A specific crate can override to
"allow" when justified — existing precedent is FFI/binding crates
(blake3_component, N-API bindings; see ./rust-patterns §Lints).
Performance can also justify it, but conservatively:
- Isolate. The unsafe code lives in a dedicated crate or tightly-scoped
module, never sprinkled per-function across safe code.
- Document. Every unsafe { ... } block carries a // SAFETY: comment
stating the invariants the caller must uphold.
- Bench-justify. A regression test or benchmark demonstrates the
unsafe path wins meaningfully over the safe equivalent — not "I think
this might be faster."
- Reversible. Keep a safe fallback in the same crate so the unsafe
path can be disabled if it ever causes trouble in production.
- Narrow scope — the crate-level unsafe_code = "allow" should not
become a license for unrelated unsafe elsewhere in the crate. Code
review treats new unsafe blocks as gated.
Candidates that have cleared this bar elsewhere: get_unchecked in
proven-safe inner loops, std::arch SIMD intrinsics for a specific CPU
target. Candidates that have *not*: dodging clone(), "the compiler
should be able to prove this," speed claims without measurements.
Honest notes to prevent cargo-culting:
- Thread-per-core runtimes (glommio, monoio): Linux-only, io_uring-
bound, abandon tokio. Stack uses tokio + axum throughout — adopting TPC
for one service is a major architectural break and the load-balance
trade-offs rarely favor it.
- Portable SIMD (std::simd): nightly-only. See §SIMD on stable.
- Structure-of-Arrays layouts (soapy, soa_derive): niche to bulk
numeric pipelines (graphics, physics, ML kernels). Reach for it if
profiling shows cache-line waste on a homogeneous workload; otherwise
the AoS-shaped struct is fine and clearer.
- multiversion crate for runtime CPU-feature dispatch: not currently
justified — single-target builds suffice for the workloads here.
- Manual AVX intrinsics with hand-unrolled loops: covered by §Unsafe
escape hatch; rare in practice and almost never the right first move.
- Left-right concurrency (evmap and similar): two synchronized
copies of a structure for fully lock-free reads at the cost of 2×
memory, eventual consistency, and writers blocked on slow readers.
Niche to workloads where reads outnumber writes by orders of magnitude
*and* DashMap / RwLock have been profiled as the bottleneck. Today
the read paths here haven't hit that wall.
- Hand-rolled lock-free structures with crossbeam-epoch: epoch-
based reclamation is the standard answer for "free a node only once
no reader can still see it," but writing your own lock-free stack /
queue / skiplist is rarely the right move. DashMap, tokio::sync,
and crossbeam::queue already vendor the well-tested versions —
reach for those before reaching for the primitive.