Generated: 2026-02-15 23:35:37 UTC
| Parameter | Value |
|---|---|
| Repo | vscode (git mode, warm cache) |
| Bytes scanned | 1,203,352,576 (1.12 GiB) |
| Machine | ARM Graviton3 (aarch64), 16 vCPUs, 61 GiB RAM |
| perf_event_paranoid | 2 (user-space events, :u suffix) |
| Methodology | 1 warmup + 1 measured run per (scanner, event group) |
| Event multiplexing | 6 events per group, time-multiplexed by kernel |
Original benchmark results (vscode, git, warm):
| Scanner | Wall Time | Throughput | Findings |
|---|---|---|---|
| scanner-rs | 13.7s | 84 MiB/s | 98,584 |
| Kingfisher | 31.1s | 37 MiB/s | 303 |
| TruffleHog | 178.5s | 6.4 MiB/s | 0 |
| Gitleaks | 113.7s | 10.1 MiB/s | 116 |
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| Cycles | 157,619,084,107 | 532,509,784,422 | 678,210,237,914 | 2,271,669,688,468 |
| Instructions | 411,789,892,684 | 1,426,719,985,424 | 1,696,903,305,992 | 10,690,140,886,959 |
| L1D loads | 123,210,234,217 | 452,557,253,455 | 498,049,596,053 | 4,452,278,795,146 |
| L1D misses | 1,209,415,888 | 2,447,094,121 | 5,128,182,208 | 8,775,249,857 |
| L1I loads | 80,809,948,119 | 330,390,689,117 | 355,390,878,506 | 1,523,868,766,465 |
| L1I misses | 283,561,600 | 745,428,529 | 5,392,181,661 | 2,888,133,830 |
| L2D refill | 328,843,938 | 497,284,829 | 2,618,442,067 | 1,850,602,731 |
| L2D LLC miss | 329,841,031 | 495,869,342 | 2,592,819,199 | 1,843,850,762 |
| L2D writeback | 713,524,135 | 1,517,956,160 | 4,007,442,708 | 3,145,739,388 |
| L2D allocate | 208,956,309 | 474,052,300 | 620,612,044 | 524,963,070 |
| Memory accesses | 123,985,681,355 | 454,409,357,352 | 499,102,529,628 | 4,445,246,559,947 |
| Branch predictions | 97,941,399,259 | 299,916,874,655 | 309,692,700,357 | 2,621,171,103,819 |
| Branch misses | 1,870,945,056 | 8,238,677,792 | 7,879,836,883 | 11,553,796,741 |
| Frontend stalls | 11,626,952,340 | 52,555,189,605 | 97,980,832,540 | 90,284,295,847 |
| Backend stalls | 64,097,407,942 | 141,440,571,268 | 230,734,465,747 | 663,045,657,519 |
| dTLB loads | 126,957,811,310 | 452,810,087,792 | 499,657,825,644 | 4,455,730,287,796 |
| dTLB misses | 826,966,710 | 1,533,221,981 | 4,008,936,174 | 4,157,538,969 |
| dTLB walks | 78,874,945 | 111,265,937 | 461,329,206 | 283,306,516 |
| iTLB loads | 25,600,430,196 | 143,805,579,347 | 167,599,942,712 | 167,910,816,063 |
| iTLB misses | 44,955,625 | 112,654,236 | 879,805,524 | 415,683,618 |
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| IPC | 2.61 | 2.68 | 2.50 | 4.71 |
| L1D miss rate | 0.982% | 0.541% | 1.03% | 0.197% |
| L1I miss rate | 0.351% | 0.226% | 1.52% | 0.190% |
| L2 miss rate | 6.24% | 4.27% | 9.55% | 6.42% |
| LLC miss rate | 6.25% | 4.26% | 9.46% | 6.40% |
| Branch miss rate | 1.91% | 2.75% | 2.54% | 0.441% |
| Frontend stall | 7.38% | 9.87% | 14.45% | 3.97% |
| Backend stall | 40.67% | 26.56% | 34.02% | 29.19% |
| dTLB miss rate | 0.651% | 0.339% | 0.802% | 0.093% |
| iTLB miss rate | 0.176% | 0.078% | 0.525% | 0.248% |
| Insns per L1D miss | 340.49 | 583.03 | 330.90 | 1,218.21 |
| Bytes per insn | 0.0029 | 0.0008 | 0.0007 | 0.0001 |
Each subsection maps a measured CPU-level advantage to specific design decisions in scanner-rs source code, contrasting with the approach taken by Go-based competitors.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| IPC | 2.61 | 2.68 | 2.50 | 4.71 |
| Branch misses | 1,870,945,056 | 8,238,677,792 | 7,879,836,883 | 11,553,796,741 |
| Branch miss rate | 1.91% | 2.75% | 2.54% | 0.441% |
Design: scanner-rs compiles all ~223 detection rules into a single Vectorscan (Hyperscan) multi-pattern DFA that scans the input buffer in one pass using SIMD-accelerated state transitions. The DFA's deterministic state machine eliminates per-pattern branch speculation — each byte advances the state via a table lookup rather than a branch tree.
Note on IPC: Gitleaks shows higher IPC because its inner loop (iterate rules, run regex, check match) is a simple, predictable pattern that the branch predictor handles well. However, it executes 26× more instructions — high IPC on wasted work is not an advantage. scanner-rs executes complex DFA state transitions (lower IPC per instruction) but needs 3.5–26× fewer total instructions, resulting in 3.4–14.4× fewer total cycles.
Code:
src/engine/vectorscan_prefilter.rs:112-135—VsPrefilterDb: compiled database holding all patternssrc/engine/vectorscan_prefilter.rs:89-100—RawPatternMeta: 12-byte#[repr(C)]per-pattern metadata for the match callbacksrc/engine/core.rs:30-44— Scan phase algorithm: prefilter → window → validate
Contrast: TruffleHog and Gitleaks are written in Go. TruffleHog uses a per-detector regex scan loop — each detector runs its own regex engine against every matched span. Gitleaks iterates all rules sequentially with Go's regexp package. Both approaches create unpredictable branching as the CPU must speculate which rule will match next. Kingfisher (also Rust) uses Vectorscan + per-rule regex — the same two-stage architecture — but without scanner-rs's window narrowing, resulting in separate regex validation over broader regions.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| L2 refills | 328,843,938 | 497,284,829 | 2,618,442,067 | 1,850,602,731 |
| L2 miss rate | 6.24% | 4.27% | 9.55% | 6.42% |
| LLC misses | 329,841,031 | 495,869,342 | 2,592,819,199 | 1,843,850,762 |
Design: Each scanner-rs worker thread owns its own WorkerCtx containing a ScratchVec, Vectorscan VsScratch, and BufferPool — all accessed via Rc (not Arc), never shared across threads. This means each worker's hot data stays in its own L1/L2 cache slice without cross-core invalidation traffic.
Code:
src/scratch_memory.rs:54—ScratchVec<T>: fixed-capacity, page-aligned, never reallocatessrc/scheduler/executor.rs:472-508—WorkerCtx: per-worker deque + scratch + metricssrc/engine/vectorscan_prefilter.rs:229-252—VsScratch: per-thread,Sendbut notSync
Contrast: Go scanners share detector state behind mutexes or use goroutine-local state that the Go runtime may migrate between OS threads. This migration causes cache-line bouncing as L2 lines must be fetched from remote cores. The MOESI/MESI protocol upgrade from Shared→Modified on every counter increment amplifies this cost.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| L2 writebacks | 713,524,135 | 1,517,956,160 | 4,007,442,708 | 3,145,739,388 |
| L2 allocations | 208,956,309 | 474,052,300 | 620,612,044 | 524,963,070 |
Design: Scanner-rs uses #[repr(align(64))] padding on shared atomic counters (CachePaddedAtomicU64) to ensure each counter occupies its own 64-byte cache line. A compile-time assertion verifies the size and alignment. This eliminates false sharing where incrementing one counter would invalidate adjacent counters in the same cache line.
Code:
src/engine/core.rs:142-167—CachePaddedAtomicU64with#[repr(align(64))]and compile-time size/alignment assertionsrc/engine/core.rs:172-179—VectorscanCounters: each field is aCachePaddedAtomicU64
Contrast: In Go, atomic.Int64 fields are typically packed together in structs without explicit padding. When multiple goroutines increment adjacent counters, the cache-line ping-pong effect causes L2 writebacks to spike as modified lines bounce between cores.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| dTLB misses | 826,966,710 | 1,533,221,981 | 4,008,936,174 | 4,157,538,969 |
| dTLB miss rate | 0.651% | 0.339% | 0.802% | 0.093% |
| dTLB walks | 78,874,945 | 111,265,937 | 461,329,206 | 283,306,516 |
Design: scanner-rs pre-allocates all major data structures at startup and never reallocates during scanning:
ScratchVec: page-aligned, fixed capacity, no reallocationNodePoolType: contiguous arena with bitset free-listBufferPool: pre-allocated fixed-size (8 MiB) chunk buffers
These allocations form a compact, stable virtual address footprint. The TLB entries for these pages stay warm throughout the scan because the same pages are accessed repeatedly without new mappings.
Code:
src/scratch_memory.rs:43-127—ScratchVec<T>: page-aligned, never growssrc/pool/node_pool.rs:44-114—NodePoolType: contiguous buffer + bitset free-list, O(1) allocate/freesrc/runtime.rs:570-704—BufferPoolInner:Rc+UnsafeCell, single-threaded, fixed capacity
Contrast: Go's garbage collector periodically moves objects, which invalidates TLB entries for the relocated pages. Dynamic append() on slices triggers reallocation and copies to new virtual addresses, further fragmenting the address space and increasing TLB pressure.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| Insns per L1D miss | 340.49 | 583.03 | 330.90 | 1,218.21 |
| L1D miss rate | 0.982% | 0.541% | 1.03% | 0.197% |
Design: Hot-path metadata is packed into minimal, cache-friendly structures:
PairMeta: 4 bytes (u16len +u8coalesced +u8pad). 16 consecutive pairs fit in one 64-byte cache line.RawPatternMeta: 12 bytes (u32rule_id +u32match_width +u32seed_radius),#[repr(C)]with a compile-time size assertion. 5 entries fit per cache line.
Code:
src/engine/hit_pool.rs:82-101—PairMeta(4 bytes,#[repr(C)])src/engine/vectorscan_prefilter.rs:89-100—RawPatternMeta(12 bytes,#[repr(C)], compile-time size guard)
Contrast: Go interface values carry a 16-byte header (type pointer + data pointer). Each regexp.Regexp object includes multiple pointer-chased fields. A detector list of 223 interface values occupies ~3.5 KiB of headers alone — over 50 cache lines — before any pattern data is touched. scanner-rs packs the equivalent metadata into ~2.6 KiB (223 × 12 bytes) with guaranteed sequential layout.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| Backend stall % | 40.67% | 26.56% | 34.02% | 29.19% |
| Frontend stall % | 7.38% | 9.87% | 14.45% | 3.97% |
Design: scanner-rs uses a custom work-stealing executor with Chase-Lev deques (LIFO local push/pop, FIFO steal) and a tiered idle strategy (spin → yield → park). Local-first spawn maximizes cache locality; randomized stealing avoids correlated contention. Workers park with a 200μs timeout to balance responsiveness against CPU waste.
Note on stall rates: scanner-rs shows a higher backend stall percentage because its Vectorscan DFA is memory-bandwidth bound — it streams data through the cache hierarchy at maximum throughput, causing the execution units to occasionally wait for data. However, in absolute stall cycles, scanner-rs has 2.2× fewer backend stalls than Kingfisher and 3.6× fewer than TruffleHog. The high stall rate reflects efficient memory-bound work, not wasted cycles.
Code:
src/scheduler/executor.rs:3-54— Architecture diagram and design docssrc/scheduler/executor.rs:74-142—ExecutorConfigwith tuning knobssrc/scheduler/executor.rs:472-508—WorkerCtx: per-worker deque, scratch, RNG for steal-victim selection
Contrast: Go's runtime scheduler uses a global run queue (GRQ) plus per-P local run queues. Goroutines can be migrated between OS threads by the runtime, causing unpredictable cache invalidation. The Go scheduler also lacks backpressure — it will eagerly create goroutines that sit idle waiting for I/O, consuming stack memory and scheduling overhead.
Measured:
| Metric | scanner-rs | Kingfisher | TruffleHog | Gitleaks |
|---|---|---|---|---|
| Total instructions | 411,789,892,684 | 1,426,719,985,424 | 1,696,903,305,992 | 10,690,140,886,959 |
| Kingfisher / scanner-rs ratio | — | 3.46× | — | — |
| TruffleHog / scanner-rs ratio | — | — | 4.12× | — |
| Gitleaks / scanner-rs ratio | — | — | — | 25.96× |
Design: scanner-rs uses an anchor-first scanning strategy. The Vectorscan prefilter identifies literal anchor hits in a single SIMD pass. Only the narrow windows around anchor hits are fed to the full regex engine for validation. Most of the input buffer is never touched by regex at all.
Code:
src/engine/vectorscan_prefilter.rs— Prefilter DB compilation and match callbacksrc/engine/core.rs:30-44— Scan algorithm: prefilter seeds windows, regex only runs in hit windowssrc/engine/buffer_scan.rs— Window validation: regex runs only within seeded window bounds
Contrast: TruffleHog runs each detector's regex against every matched span, performing O(rules × spans) regex work. Gitleaks iterates all rules sequentially against the full input. Both approaches execute orders of magnitude more instructions because they cannot skip non-matching regions.
Key insight: Per-instruction rates (L1D miss rate, branch miss rate) can be misleading when comparing scanners that execute vastly different amounts of work. Gitleaks executes 26× more instructions than scanner-rs on the same input — a simple loop pattern yields good per-instruction rates but terrible absolute performance. The table below uses absolute counts which directly determine wall-clock time.
| # | Design Decision | Key Metric | scanner-rs | Closest Competitor | Advantage |
|---|---|---|---|---|---|
| 1 | Vectorscan multi-pattern DFA | Total cycles | 157,619,084,107 | Kingfisher: 532,509,784,422 | 3.4× fewer |
| 2 | Anchor-first scanning | Total instructions | 411,789,892,684 | Kingfisher: 1,426,719,985,424 | 3.5× fewer |
| 3 | Deterministic DFA transitions | Branch misses | 1,870,945,056 | TruffleHog: 7,879,836,883 | 4.2× fewer |
| 4 | Per-worker scratch (no sharing) | L2 refills | 328,843,938 | Kingfisher: 497,284,829 | 1.5× fewer |
| 5 | Compact packed metadata | L1D misses | 1,209,415,888 | Kingfisher: 2,447,094,121 | 2.0× fewer |
| 6 | Pre-allocated fixed-capacity pools | dTLB misses | 826,966,710 | Kingfisher: 1,533,221,981 | 1.9× fewer |
| 7 | Work-stealing + cache locality | Backend stall cycles | 64,097,407,942 | Kingfisher: 141,440,571,268 | 2.2× fewer |
| 8 | Cache-line aligned atomics | L2 writebacks | 713,524,135 | Kingfisher: 1,517,956,160 | 2.1× fewer |
Report generated by tools/bench-compare/analyze_perf.py from perf stat measurements on ARM Graviton3.