feat: Rust VIO implementation + cross-language numerical test suite by vlordier · Pull Request #1 · ETH-PBL/levio

vlordier · 2026-04-16T20:28:02Z

Summary

Full Rust port of the LEVIO visual-inertial odometry pipeline targeting GAP9 (160×120 image, 200 Hz IMU), with realtime-safety as a first-class constraint
Cross-language test suite (cross_lang_tests/) validating that C, Python, and Rust produce numerically identical results (< 2e-5, f32 tolerance) on all shared math vectors
Benchmark tooling comparing all three languages for throughput and heap allocation

levio_rust

Architecture

Trait-oriented static dispatch: FeatureDetector, DescriptorMatcher, Optimizer, Preintegrator
All core types #[repr(C)]: Mat3=[f32;9], Vec3=[f32;3], Pose=[f32;16], OrbDescriptor=[u32;8]
104 tests: 77 unit + 13 cross_lang + 14 integration; cargo clippy -D warnings clean

Realtime safety (hot-path allocations)

Operation	Cold heap	Warm heap
exp_so3, mat3_mul, skew, triangulate, imu_preintegrate (200 steps), gaussian solver	0 B	0 B
SVD 3×3	72 B	72 B (U + V Matrix)
OrbDetector::detect (160×120)	75.5 KB (4 scratch Vec)	0 B
BfMatcher::match (200×200)	3.8 KB (scratch Vec)	0 B

Performance vs C (-O2)

Primitive	C	Rust	Ratio
rodrigues_exp	17.5 ns	8.5 ns	0.49× (Rust 2× faster)
mat3_mul	4.4 ns	4.6 ns	≈1×
skew	3.9 ns	2.9 ns	0.74×
hamming (256-bit)	1.0 ns	1.1 ns	≈1×

E2E pipeline (Rust, release build): orb_detect 20 µs, bf_match 22 µs, imu_preintegrate×200 4.8 µs, triangulate 446 ns, SVD 3×3 164 ns.

cross_lang_tests

Three independent implementations of the same deterministic math, cross-validated:

bash cross_lang_tests/run_all.sh

Suite	Checks	Linter
`c/test_numerics.c`	617 assertions	`-Wall -Wextra -Werror -Wshadow -Wdouble-promotion -Wformat=2`
`python/test_numerics.py`	687 assertions	ruff (E W F I UP B C4 SIM RUF)
`levio_rust/tests/cross_lang.rs`	119 `#[test]` functions	`cargo clippy -D warnings`

Algebraic invariants tested: det(R)=+1, RᵀR=I, RRᵀ=I, tr(R)=1+2cos(θ), R(π)²=I, same-axis composition, isometry, DLT nullspace row·Xₕ=0.

Test plan

bash cross_lang_tests/run_all.sh — all lint + test steps pass
cargo test in levio_rust/ — 104 tests green
cargo clippy -- -D warnings — zero warnings
make bench in cross_lang_tests/c/ — timing output matches documented values

Complete Rust port of the LEVIO Visual-Inertial Odometry system with military-grade code quality: #![forbid(unsafe_code)], #![deny(clippy::all, clippy::pedantic)], zero clippy warnings, 54 passing tests (40 unit + 14 integration), and 10 Criterion benchmarks. Algorithms implemented: - ORB detection: FAST-9 + Harris scoring + BRIEF-256 descriptors - Brute-force Hamming matcher with flow filtering and cross-check - DLT triangulation (normalised coords + bare [R|t]) - EPnP RANSAC pose estimation with Sampson-distance essential matrix - Jacobi SVD, Gaussian elimination with partial pivoting, Householder QR - IMU preintegration with SO(3) exp map and bias-correction Jacobians - Windowed Gauss-Newton / LM pose graph optimizer with Huber weighting - Sliding-window pose graph with keyframe eviction and landmark ageing

ORB pipeline (orb.rs): - gaussian_blur: replace single 9-tap pass with separable two 3-tap passes (identical integer output, ~2× fewer multiply-add ops per pixel) - detect: remove two unnecessary Vec clones of the keypoints list - fast_detect: replace closure+.count() quick-rejection with direct pixel reads and integer sum — removes iterator overhead in the innermost loop - is_fast_corner: use a doubled [i8;32] ring so the consecutive-9 scan never touches % 16 modulo; all 8 comparisons explicit for auto-unroll - brief_descriptor: drop the per-bit bounds guard (FAST border 18 > max BRIEF offset 13 makes it unreachable); replace /32+%32 with >>5 and &31 Matcher (matcher.rs): - hamming_distance: fully unroll 8 XOR+popcount ops so LLVM can vectorise - match_features: break on dist==0 (perfect match) to skip remaining train IMU preintegration (imu_preintegration.rs): - integrate_step: hoist skew(omega) out of the 3×3 Jacobian loop (was recomputed 9× per step); hoist r_row and d_row out of the inner col loop 54/54 tests pass; cargo clippy -- -D warnings clean.

Performance: - Matrix::ata() — direct AᵀA in one pass exploiting symmetry (~half the multiplications vs full matmul). No transpose allocation required. - Matrix::ata_data() — returns the flat Vec directly for callers that need a mutable buffer (svd.rs, epnp.rs), eliminating an extra clone. - Matrix::matmul() — add i-k-j fast path for non-transposed matrices: stride-1 inner access on all three arrays; avoids the get_raw branch. Falls back to the general loop for transposed inputs. - svd: use ata_data() — 3 allocations (t+matmul+to_vec) → 1 - epnp: use ata_data() — same 3 → 1 saving for the 2n×12 normal matrix - essential: use ata() for the 8×9 → 9×9 null-space step (saves t() clone) - sampson_distance_sq: accept &[f32; 9] instead of &Matrix; caller extracts the matrix once per RANSAC hypothesis, not 18 get_raw calls per match Fixes: - count_positive_depth: d2_z = p2*0 + p2*0 + 1.0 is always 1.0 → dead code; removed the unreachable && branch - mat3_to_arr: replace core::array::from_fn with i/3, i%3 with explicit literal indices (9 direct get_raw calls, no divide/modulo) - sampson_distance_sq test: construct flat array directly, not a Matrix Chores: - Deduplicate SimpleRng (Xorshift32): was copy-pasted into both epnp.rs and essential.rs; now lives once in visual_odometry/mod.rs as pub(crate) - Matrix::t() doc: remove false "zero-copy" claim; data buffer is cloned 54/54 tests pass; cargo clippy -- -D warnings clean.

Hot path: - mat3_mul: replace core::array::from_fn + div/mod with explicit 27-FMA literal expansion, eliminating 18 integer divisions per call - exp_so3: CSE — precompute omc, kxy, kxz, kyz, skx, sky, skz to eliminate 6 redundant multiply chains in the Rodrigues formula - harris_response: accumulate ixx/iyy/ixy in i32 (max ~1.6M << 2^31), cast to i64 only for the final det/trace products; enables 32-bit SIMD Chores / fixes: - matrix: extract ata_inner() shared by ata() and ata_data(), removing the duplicated loop body - solver: stack-allocate the rhs buffer for n ≤ 16 (covers all VIO use cases, n=3), falling back to heap only for larger systems - pose_graph_optimizer: remove unused _state_dim parameter from accumulate_hessian and add_imu_regularisation and their call sites - svd: simplify V=I init to diagonal-only loop (Vec is already zeroed)

Hot path: - epnp: precompute dc⁻¹ once in compute_barycentrics; replace n × 3×3 Gaussian eliminations with n × 9-FMA mat-vec products (removes the per-point dc_copy allocation and solve_linear_gaussian call loop) - epnp: hoist mean_reprojection_error inside the best-hypothesis block so it is computed at most once (for the winner), not per iteration - epnp: replace mean_reprojection_error Vec<f32> collection with a direct accumulation, eliminating the intermediate heap allocation - imu: simplify trapezoidal bias correction to 0.5*(a+b)-bias, reducing 6 redundant subtractions per integration step Chores / fixes: - matrix: add #[inline] to ata_inner so it is inlined in release builds - epnp: drop now-unused solver import (compute_barycentrics no longer calls solve_linear_gaussian) - bench: remove unused CameraIntrinsics import from levio_bench.rs

Hot path: - orb: replace is_fast_corner ring-doubling ([i8;16]+[i8;32]+loop) with two u16 bitmasks + has_9_consecutive() branchless bit-rotation check; eliminates 48 bytes of stack and the memcopy; drop unused _center param - orb: replace saturating_sub in harris_response inner loop with direct subtraction (both accesses are guarded by the outer border check) - epnp: single-pass centroid in compute_control_points (one loop instead of three iterator passes), replace 1/n division with * inv_n - epnp: symmetric covariance (upper triangle + mirror: 6 muls/point vs 9) Typing / portability: - Add #[repr(C)] to Point2D, Point2Df, Point3D, ImuMeasurement, ImuBias, CameraIntrinsics, FeatureMatch, Observation, ImuFactor, ImuFactorWithBias — guarantees C-compatible layout for FFI and embedded interop - Change ORB_PATCH_RADIUS: i16 → u8; HARRIS_BLOCK_SIZE: i16 → u8 (both are always non-negative counts; u8 matches the C uint8_t type) Magic numbers → named constants (config.rs): - ORB_FAST_DETECT_BORDER = ORB_PATCH_RADIUS + 3 (was 18_usize literal) - BRIEF_MAX_OFFSET = 13 (was 13 literal) - EPNP_SAMPLE_SIZE = 6 (EPnP hypothesis sample) - ESSENTIAL_SAMPLE_SIZE = 8 (8-point algorithm sample) - SVD_JACOBI_N_FACTOR = 20 (was n * 20 literal) - OPTIMIZER_MAX_ITERS = 8 (was 8 literal) - OPTIMIZER_CONVERGENCE_THRESHOLD = 1e-6 (was 1e-6 literal) Module-level constants (epnp.rs): - EPNP_CTRL_PTS = 4, EPNP_DOF = 12 (replace all 4/12/144 literals) Other: - types: add NO_WP_ID: i16 = -1 sentinel constant; use in pose_graph.rs - essential: RansacFailed needed field now uses ESSENTIAL_SAMPLE_SIZE

Hot-path: - svd: merge two `for r in 0..n` loops in jacobi_eigen into one pass, updating V rotation and the symmetric matrix in the same iteration (halves loop overhead and improves cache locality on a/v buffers) - optimizer: eliminate 6 zero-multiplications in j_pt chain rule (dpx_dx[1]=0, dpy_dy[0]=0 are structural zeros, not runtime values) - optimizer: inline skew_xcam as direct cross-product expressions in j_pose_row closure; removes [[f32;3];3] stack allocation and all indirect index reads - accumulate_hessian: replace bounds-checked get/set with get_raw/set_raw throughout; caller already guards pose_idx<n_poses and wp_idx<n_pts, so per-element bounds checks in the hot per-observation loop are redundant; function now returns () instead of Result<(),LevioError> Typing / safety: - pose_graph: add debug_assert!(u16::try_from(...).is_ok()) before new_observations.len() as u16 cast in add_keyframe - pose_graph: add debug_assert!(u32::try_from(...).is_ok()) before n_new as u32 cast in add_keyframe - pose_graph: add debug_assert!(u16::try_from(idx).is_ok()) before idx as u16 cast in age_world_points

Hot-path: - imu_preintegration: replace 2×exp_so3 per integration step with the half-angle squaring trick: compute exp_so3(ω·dt/2) once, then d_rot = mat3_mul(half_rot, half_rot) and r_mid = mat3_mul(r_prev, half_rot); saves one sqrt+sin+cos (~50 ops) at the cost of one extra mat3_mul (27 FMAs) - triangulation: eliminate pose_to_projection and its two 3×4 Matrix heap allocations per call; DLT rows now index the Pose array directly (pose[8+j] for row 2, pose[4+j] for row 1, pose[j] for row 0) - matrix/ata_inner: add non-transposed fast path that bypasses the transposed branch in get_raw on every inner iteration; direct self.data[k*nc+i] * self.data[k*nc+j] indexing for the common case - imu_preintegration: add #[inline] to exp_so3, mat3_mul, mat3_vec, skew; these are small pure functions called in tight integration loops Magic numbers / constants: - config: add IMU_MAX_DT = 1.0 (previously hardcoded in integrate_step) - config: add EPNP_RNG_SEED = 42 (previously hardcoded in ransac_epnp_from_matches) - imu_preintegration: dt > 1.0 → dt > config::IMU_MAX_DT - epnp: 1e-10 tolerance → config::SVD_TOL_JACOBI (both jacobi_eigen calls) - epnp: 100 max-iter for 3×3 PCA → (SVD_MAX_ITER_JACOBI as usize).max(3*SVD_JACOBI_N_FACTOR) - epnp: 42 RNG seed → config::EPNP_RNG_SEED

Hot-path: - orb/harris_response: replace i64 loop counters with direct usize arithmetic; border guard guarantees y >= hb+1 and x >= hb+1, so y-hb+dy and x-hb+dx never underflow; removes 2 signed casts on every iteration of the 7×7 block - solver/solve_linear_qr: pre-allocate hh_buf (Vec<f32; m>) once before the Householder loop rather than allocating a fresh Vec per column; eliminates n heap allocations per QR call (n = state_dim for the normal equations) Correctness fix: - matcher/pixel_distance_sq: previous i32 path could panic in debug (overflow) when both keypoints are near u16::MAX: 65535² = 4_294_836_225 > i32::MAX; new path uses u16::abs_diff → u32, then (dx*dx).saturating_add(dy*dy) which is correct for all u16 inputs without any truncation Typing / magic numbers: - config: add HARRIS_SCORE_SCALE = 100 and HARRIS_K_INT = 4 documenting the integer-scaled form of Harris k = 0.04 (k × 100 = 4) - orb/harris_response: replace literals 100 and 4 with HARRIS_SCORE_SCALE and HARRIS_K_INT - orb/fast_detect: remove roundabout u16::from(FAST_THRESHOLD) → as u8; use config::FAST_THRESHOLD (u8) directly in saturating_sub/add - matcher: add debug_assert!(u16::try_from(qi/ti).is_ok()) before the index → u16 casts for feat_idx0 / feat_idx1 - matcher: replace best_dist.min(255) with best_dist.min(u32::from(u8::MAX)) to avoid the magic literal

matrix/to_contiguous: fast-path the non-transposed case as self.clone(); previously performed element-by-element get_raw/set_raw even when the data was already in contiguous row-major order; this eliminates a full double-loop copy on every call from QR and Gaussian solvers matrix/matvec: add non-transposed fast path using direct slice indexing data[i*k+j], bypassing the get_raw branch per element; the transposed path (rare) keeps the existing get_raw loop matrix/matmul: add !self.transposed && rhs.transposed fast path (ABᵀ); covers v.matmul(&u.t()), u.matmul(&w.t()), b_mat.matmul(&a.t()) in SVD, essential matrix, and EPnP; both rows (a[i*k+l] and b[j*bk+l]) are stride-1 in the inner l loop, avoiding all get_raw branches solver/solve_linear_qr: replace (0..m).map(|i| b.get_raw(i,0)).collect() with b.as_raw_slice().to_vec(); safe because a single-column (or its transpose) matrix has identical physical and logical element ordering orb/fast_detect: pre-allocate candidates Vec with capacity MAX_KEYPOINTS to avoid repeated reallocs during the per-frame pixel scan

…umber cleanup Trait abstractions: - Add FeatureDetector and DescriptorMatcher traits in features/mod.rs (static dispatch) - impl FeatureDetector for OrbDetector - impl DescriptorMatcher for BfMatcher; add match_features_into (pre-allocated output buffer, RT-safe) Realtime allocation elimination: - epnp: pre-alloc pool/s_world/s_img/mask/best_mask before 64-iter loop (192+ allocs → 0); inline Fisher-Yates; remove sample_k - essential: pre-alloc pool/mask/best_mask before 640-iter loop (1280+ allocs → 0); inline Fisher-Yates; remove sample_8 - harris_select: flat Vec<(kp, score, bin)> + single sort + linear pass replaces Vec<Vec<...>> bins (~300 inner allocs/frame → 0) No-magic-number cleanup: - config: add ORB_DESCRIPTOR_WORDS = 8 and GAUSSIAN_STACK_SOLVE_N = 16 - types: OrbDescriptor uses config::ORB_DESCRIPTOR_WORDS - solver: stack buffer size uses config::GAUSSIAN_STACK_SOLVE_N - orb/brief: >> 5 / & 31 replaced by / u32::BITS as usize / % u32::BITS as usize Robust tests: 10 new tests covering trait contracts, static dispatch, buffer reuse, descriptor word count, and per-patch feature count invariant. 61 total (47 unit + 14 integration).

… alloc elimination Typing / portability: - types.rs: add Mat3 = [f32; 9] type alias (C: float[9]); mirrors C layout explicitly - imu_preintegration: use Mat3 for all rotation matrix params/fields/returns - config: add IDENTITY_MAT3 const; use in ImuFactorWithBias::identity() New constants: - config: IMU_POSITION_WEIGHT = IMU_WEIGHT_SCALE / σ_a² (precomputed, eliminates magic division in add_imu_regularisation + imu_factor_cost, verified by test) New trait abstractions: - optimizer/mod.rs: Optimizer trait (optimize → Result<f32>) - optimizer/mod.rs: Preintegrator trait (process / extract_and_reset / reset) - impl Optimizer for PoseGraphOptimizer; impl Preintegrator for ImuPreintegrator New SO(3) utility: - imu_preintegration: mat3_transpose; used in exp_so3_rotation_is_orthogonal test Bug fix — age_world_points (pose_graph.rs): - Previous code: swap_remove(idx) then retain(wp_id != idx). After swap_remove, the element swapped in from the end retains its old wp_id in observations → wrong. - New code: stable compaction + remapping table (old_idx → new_idx) applied in one O(|obs|) retain+remap pass instead of O(k × |obs|) multi-retain. Test age_world_points_remaps_observations_correctly verifies this explicitly. Alloc elimination: - essential.rs eight_point: f_vec: Vec<f32> → [f32; 9] stack array (640 allocs/call removed) Tests: +8 new (Mat3, IDENTITY_MAT3, trait dispatch, observation remapping, noop fast-path). 69 total (55 unit + 14 integration). Clippy clean.

## levio_rust — realtime-safe VIO implementation Full Rust port of the LEVIO visual-inertial odometry pipeline targeting GAP9 (160×120, 200 Hz IMU). All 104 tests pass (77 unit + 13 cross_lang + 14 integration), cargo clippy -D warnings clean. Key highlights: - Trait-oriented static dispatch: FeatureDetector, DescriptorMatcher, Optimizer, Preintegrator — zero vtable overhead - Zero heap allocation on the hot path: pre-allocated scratch buffers in OrbDetector, BfMatcher, PoseGraphOptimizer, EpnpScratch; QrScratch field eliminates 1 large alloc per LM iteration - EPnP RANSAC (64 iters) + Essential RANSAC (640 iters) fully stack-based: EpnpScratch, inline AᵀA, Fisher-Yates partial shuffle, fixed-array SVD - IMU preintegration (200 steps): 4.8 µs, 0 heap bytes cold and warm - SVD: s returned as [f32; MAX_SVD_DIM] stack array; only U + V matrices heap-allocated (72 B per call) - stable age_world_points compaction (was O(k²) swap_remove with silent index corruption) - LevioError::NegativeDepth added; triangulate_point cheirality check ## cross_lang_tests — three-language numerical consistency suite Standalone test suite validating C / Python / Rust produce identical results to < 2e-5 (f32 tolerance) on all shared math vectors. - c/test_numerics.c 617 assertions, make bench for timing - python/test_numerics.py 687 assertions, --bench for timing - levio_rust/tests/cross_lang.rs 119 #[test] functions Functions: skew, mat3_mul, mat3_vec, mat3_transpose, exp_so3/rodrigues_exp, hamming_distance, DLT row construction. Algebraic invariants: det=+1, RᵀR=I, RRᵀ=I, tr(R)=1+2cos(θ), R(π)²=I, R(θn̂)·R(φn̂)=R((θ+φ)n̂), isometry, DLT nullspace. Linters (all warnings as errors): - C: -Wall -Wextra -Werror -Wshadow -Wdouble-promotion -Wformat=2 - Python: ruff (E W F I UP B C4 SIM RUF) - Rust: cargo clippy -D warnings Runner: bash cross_lang_tests/run_all.sh ## Benchmark tooling - plot_benchmarks.py Python vs C vs Rust throughput + heap - plot_rust_vs_c.py Rust vs C primitives + speedup ratio - plot_e2e_memory.py E2E pipeline timing + heap allocation per call - examples/alloc_bench.rs Counting GlobalAlloc; measures cold/warm heap per operation (orb_detect: 75 KB cold / 0 warm; SVD 3×3: 72 B every call; all else: 0 B)

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

vlordier · 2026-04-16T20:57:20Z

cc @Pafrak @Yaooooo

vlordier added 14 commits April 16, 2026 14:42

Remove build artifacts from tracking and add target/ to .gitignore

4b4942a

Copilot AI review requested due to automatic review settings April 16, 2026 20:28

Copilot started reviewing on behalf of vlordier April 16, 2026 20:28 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Rust VIO implementation + cross-language numerical test suite#1

feat: Rust VIO implementation + cross-language numerical test suite#1
vlordier wants to merge 14 commits into
ETH-PBL:mainfrom
vlordier:feat/rust-implementation

vlordier commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

vlordier commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vlordier commented Apr 16, 2026

Summary

levio_rust

Architecture

Realtime safety (hot-path allocations)

Performance vs C (-O2)

cross_lang_tests

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

vlordier commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants