Skip to content

feat: Rust VIO implementation + cross-language numerical test suite#1

Open
vlordier wants to merge 14 commits into
ETH-PBL:mainfrom
vlordier:feat/rust-implementation
Open

feat: Rust VIO implementation + cross-language numerical test suite#1
vlordier wants to merge 14 commits into
ETH-PBL:mainfrom
vlordier:feat/rust-implementation

Conversation

@vlordier
Copy link
Copy Markdown

Summary

  • Full Rust port of the LEVIO visual-inertial odometry pipeline targeting GAP9 (160×120 image, 200 Hz IMU), with realtime-safety as a first-class constraint
  • Cross-language test suite (cross_lang_tests/) validating that C, Python, and Rust produce numerically identical results (< 2e-5, f32 tolerance) on all shared math vectors
  • Benchmark tooling comparing all three languages for throughput and heap allocation

levio_rust

Architecture

  • Trait-oriented static dispatch: FeatureDetector, DescriptorMatcher, Optimizer, Preintegrator
  • All core types #[repr(C)]: Mat3=[f32;9], Vec3=[f32;3], Pose=[f32;16], OrbDescriptor=[u32;8]
  • 104 tests: 77 unit + 13 cross_lang + 14 integration; cargo clippy -D warnings clean

Realtime safety (hot-path allocations)

Operation Cold heap Warm heap
exp_so3, mat3_mul, skew, triangulate, imu_preintegrate (200 steps), gaussian solver 0 B 0 B
SVD 3×3 72 B 72 B (U + V Matrix)
OrbDetector::detect (160×120) 75.5 KB (4 scratch Vec) 0 B
BfMatcher::match (200×200) 3.8 KB (scratch Vec) 0 B

Performance vs C (-O2)

Primitive C Rust Ratio
rodrigues_exp 17.5 ns 8.5 ns 0.49× (Rust 2× faster)
mat3_mul 4.4 ns 4.6 ns ≈1×
skew 3.9 ns 2.9 ns 0.74×
hamming (256-bit) 1.0 ns 1.1 ns ≈1×

E2E pipeline (Rust, release build): orb_detect 20 µs, bf_match 22 µs, imu_preintegrate×200 4.8 µs, triangulate 446 ns, SVD 3×3 164 ns.

cross_lang_tests

Three independent implementations of the same deterministic math, cross-validated:

bash cross_lang_tests/run_all.sh
Suite Checks Linter
c/test_numerics.c 617 assertions -Wall -Wextra -Werror -Wshadow -Wdouble-promotion -Wformat=2
python/test_numerics.py 687 assertions ruff (E W F I UP B C4 SIM RUF)
levio_rust/tests/cross_lang.rs 119 #[test] functions cargo clippy -D warnings

Algebraic invariants tested: det(R)=+1, RᵀR=I, RRᵀ=I, tr(R)=1+2cos(θ), R(π)²=I, same-axis composition, isometry, DLT nullspace row·Xₕ=0.

Test plan

  • bash cross_lang_tests/run_all.sh — all lint + test steps pass
  • cargo test in levio_rust/ — 104 tests green
  • cargo clippy -- -D warnings — zero warnings
  • make bench in cross_lang_tests/c/ — timing output matches documented values

vlordier added 14 commits April 16, 2026 14:42
Complete Rust port of the LEVIO Visual-Inertial Odometry system with
military-grade code quality: #![forbid(unsafe_code)], #![deny(clippy::all,
clippy::pedantic)], zero clippy warnings, 54 passing tests (40 unit +
14 integration), and 10 Criterion benchmarks.

Algorithms implemented:
- ORB detection: FAST-9 + Harris scoring + BRIEF-256 descriptors
- Brute-force Hamming matcher with flow filtering and cross-check
- DLT triangulation (normalised coords + bare [R|t])
- EPnP RANSAC pose estimation with Sampson-distance essential matrix
- Jacobi SVD, Gaussian elimination with partial pivoting, Householder QR
- IMU preintegration with SO(3) exp map and bias-correction Jacobians
- Windowed Gauss-Newton / LM pose graph optimizer with Huber weighting
- Sliding-window pose graph with keyframe eviction and landmark ageing
ORB pipeline (orb.rs):
- gaussian_blur: replace single 9-tap pass with separable two 3-tap passes
  (identical integer output, ~2× fewer multiply-add ops per pixel)
- detect: remove two unnecessary Vec clones of the keypoints list
- fast_detect: replace closure+.count() quick-rejection with direct pixel
  reads and integer sum — removes iterator overhead in the innermost loop
- is_fast_corner: use a doubled [i8;32] ring so the consecutive-9 scan
  never touches % 16 modulo; all 8 comparisons explicit for auto-unroll
- brief_descriptor: drop the per-bit bounds guard (FAST border 18 > max
  BRIEF offset 13 makes it unreachable); replace /32+%32 with >>5 and &31

Matcher (matcher.rs):
- hamming_distance: fully unroll 8 XOR+popcount ops so LLVM can vectorise
- match_features: break on dist==0 (perfect match) to skip remaining train

IMU preintegration (imu_preintegration.rs):
- integrate_step: hoist skew(omega) out of the 3×3 Jacobian loop (was
  recomputed 9× per step); hoist r_row and d_row out of the inner col loop

54/54 tests pass; cargo clippy -- -D warnings clean.
Performance:
- Matrix::ata() — direct AᵀA in one pass exploiting symmetry (~half the
  multiplications vs full matmul). No transpose allocation required.
- Matrix::ata_data() — returns the flat Vec directly for callers that need a
  mutable buffer (svd.rs, epnp.rs), eliminating an extra clone.
- Matrix::matmul() — add i-k-j fast path for non-transposed matrices:
  stride-1 inner access on all three arrays; avoids the get_raw branch.
  Falls back to the general loop for transposed inputs.
- svd: use ata_data() — 3 allocations (t+matmul+to_vec) → 1
- epnp: use ata_data() — same 3 → 1 saving for the 2n×12 normal matrix
- essential: use ata() for the 8×9 → 9×9 null-space step (saves t() clone)
- sampson_distance_sq: accept &[f32; 9] instead of &Matrix; caller extracts
  the matrix once per RANSAC hypothesis, not 18 get_raw calls per match

Fixes:
- count_positive_depth: d2_z = p2*0 + p2*0 + 1.0 is always 1.0 → dead code;
  removed the unreachable && branch
- mat3_to_arr: replace core::array::from_fn with i/3, i%3 with explicit
  literal indices (9 direct get_raw calls, no divide/modulo)
- sampson_distance_sq test: construct flat array directly, not a Matrix

Chores:
- Deduplicate SimpleRng (Xorshift32): was copy-pasted into both epnp.rs and
  essential.rs; now lives once in visual_odometry/mod.rs as pub(crate)
- Matrix::t() doc: remove false "zero-copy" claim; data buffer is cloned

54/54 tests pass; cargo clippy -- -D warnings clean.
Hot path:
- mat3_mul: replace core::array::from_fn + div/mod with explicit 27-FMA
  literal expansion, eliminating 18 integer divisions per call
- exp_so3: CSE — precompute omc, kxy, kxz, kyz, skx, sky, skz to
  eliminate 6 redundant multiply chains in the Rodrigues formula
- harris_response: accumulate ixx/iyy/ixy in i32 (max ~1.6M << 2^31),
  cast to i64 only for the final det/trace products; enables 32-bit SIMD

Chores / fixes:
- matrix: extract ata_inner() shared by ata() and ata_data(), removing
  the duplicated loop body
- solver: stack-allocate the rhs buffer for n ≤ 16 (covers all VIO use
  cases, n=3), falling back to heap only for larger systems
- pose_graph_optimizer: remove unused _state_dim parameter from
  accumulate_hessian and add_imu_regularisation and their call sites
- svd: simplify V=I init to diagonal-only loop (Vec is already zeroed)
Hot path:
- epnp: precompute dc⁻¹ once in compute_barycentrics; replace n × 3×3
  Gaussian eliminations with n × 9-FMA mat-vec products (removes the
  per-point dc_copy allocation and solve_linear_gaussian call loop)
- epnp: hoist mean_reprojection_error inside the best-hypothesis block
  so it is computed at most once (for the winner), not per iteration
- epnp: replace mean_reprojection_error Vec<f32> collection with a
  direct accumulation, eliminating the intermediate heap allocation
- imu: simplify trapezoidal bias correction to 0.5*(a+b)-bias,
  reducing 6 redundant subtractions per integration step

Chores / fixes:
- matrix: add #[inline] to ata_inner so it is inlined in release builds
- epnp: drop now-unused solver import (compute_barycentrics no longer
  calls solve_linear_gaussian)
- bench: remove unused CameraIntrinsics import from levio_bench.rs
Hot path:
- orb: replace is_fast_corner ring-doubling ([i8;16]+[i8;32]+loop) with
  two u16 bitmasks + has_9_consecutive() branchless bit-rotation check;
  eliminates 48 bytes of stack and the memcopy; drop unused _center param
- orb: replace saturating_sub in harris_response inner loop with direct
  subtraction (both accesses are guarded by the outer border check)
- epnp: single-pass centroid in compute_control_points (one loop instead
  of three iterator passes), replace 1/n division with * inv_n
- epnp: symmetric covariance (upper triangle + mirror: 6 muls/point vs 9)

Typing / portability:
- Add #[repr(C)] to Point2D, Point2Df, Point3D, ImuMeasurement, ImuBias,
  CameraIntrinsics, FeatureMatch, Observation, ImuFactor, ImuFactorWithBias
  — guarantees C-compatible layout for FFI and embedded interop
- Change ORB_PATCH_RADIUS: i16 → u8; HARRIS_BLOCK_SIZE: i16 → u8
  (both are always non-negative counts; u8 matches the C uint8_t type)

Magic numbers → named constants (config.rs):
- ORB_FAST_DETECT_BORDER = ORB_PATCH_RADIUS + 3  (was 18_usize literal)
- BRIEF_MAX_OFFSET = 13                          (was 13 literal)
- EPNP_SAMPLE_SIZE = 6                           (EPnP hypothesis sample)
- ESSENTIAL_SAMPLE_SIZE = 8                      (8-point algorithm sample)
- SVD_JACOBI_N_FACTOR = 20                       (was n * 20 literal)
- OPTIMIZER_MAX_ITERS = 8                        (was 8 literal)
- OPTIMIZER_CONVERGENCE_THRESHOLD = 1e-6         (was 1e-6 literal)

Module-level constants (epnp.rs):
- EPNP_CTRL_PTS = 4, EPNP_DOF = 12              (replace all 4/12/144 literals)

Other:
- types: add NO_WP_ID: i16 = -1 sentinel constant; use in pose_graph.rs
- essential: RansacFailed needed field now uses ESSENTIAL_SAMPLE_SIZE
Hot-path:
- svd: merge two `for r in 0..n` loops in jacobi_eigen into one pass,
  updating V rotation and the symmetric matrix in the same iteration
  (halves loop overhead and improves cache locality on a/v buffers)
- optimizer: eliminate 6 zero-multiplications in j_pt chain rule
  (dpx_dx[1]=0, dpy_dy[0]=0 are structural zeros, not runtime values)
- optimizer: inline skew_xcam as direct cross-product expressions in
  j_pose_row closure; removes [[f32;3];3] stack allocation and all
  indirect index reads
- accumulate_hessian: replace bounds-checked get/set with get_raw/set_raw
  throughout; caller already guards pose_idx<n_poses and wp_idx<n_pts,
  so per-element bounds checks in the hot per-observation loop are redundant;
  function now returns () instead of Result<(),LevioError>

Typing / safety:
- pose_graph: add debug_assert!(u16::try_from(...).is_ok()) before
  new_observations.len() as u16 cast in add_keyframe
- pose_graph: add debug_assert!(u32::try_from(...).is_ok()) before
  n_new as u32 cast in add_keyframe
- pose_graph: add debug_assert!(u16::try_from(idx).is_ok()) before
  idx as u16 cast in age_world_points
Hot-path:
- imu_preintegration: replace 2×exp_so3 per integration step with the
  half-angle squaring trick: compute exp_so3(ω·dt/2) once, then
  d_rot = mat3_mul(half_rot, half_rot) and r_mid = mat3_mul(r_prev, half_rot);
  saves one sqrt+sin+cos (~50 ops) at the cost of one extra mat3_mul (27 FMAs)
- triangulation: eliminate pose_to_projection and its two 3×4 Matrix heap
  allocations per call; DLT rows now index the Pose array directly
  (pose[8+j] for row 2, pose[4+j] for row 1, pose[j] for row 0)
- matrix/ata_inner: add non-transposed fast path that bypasses the
  transposed branch in get_raw on every inner iteration; direct
  self.data[k*nc+i] * self.data[k*nc+j] indexing for the common case
- imu_preintegration: add #[inline] to exp_so3, mat3_mul, mat3_vec, skew;
  these are small pure functions called in tight integration loops

Magic numbers / constants:
- config: add IMU_MAX_DT = 1.0  (previously hardcoded in integrate_step)
- config: add EPNP_RNG_SEED = 42  (previously hardcoded in ransac_epnp_from_matches)
- imu_preintegration: dt > 1.0  →  dt > config::IMU_MAX_DT
- epnp: 1e-10 tolerance → config::SVD_TOL_JACOBI (both jacobi_eigen calls)
- epnp: 100 max-iter for 3×3 PCA → (SVD_MAX_ITER_JACOBI as usize).max(3*SVD_JACOBI_N_FACTOR)
- epnp: 42 RNG seed → config::EPNP_RNG_SEED
Hot-path:
- orb/harris_response: replace i64 loop counters with direct usize arithmetic;
  border guard guarantees y >= hb+1 and x >= hb+1, so y-hb+dy and x-hb+dx
  never underflow; removes 2 signed casts on every iteration of the 7×7 block
- solver/solve_linear_qr: pre-allocate hh_buf (Vec<f32; m>) once before the
  Householder loop rather than allocating a fresh Vec per column; eliminates
  n heap allocations per QR call (n = state_dim for the normal equations)

Correctness fix:
- matcher/pixel_distance_sq: previous i32 path could panic in debug (overflow)
  when both keypoints are near u16::MAX: 65535² = 4_294_836_225 > i32::MAX;
  new path uses u16::abs_diff → u32, then (dx*dx).saturating_add(dy*dy)
  which is correct for all u16 inputs without any truncation

Typing / magic numbers:
- config: add HARRIS_SCORE_SCALE = 100 and HARRIS_K_INT = 4 documenting the
  integer-scaled form of Harris k = 0.04 (k × 100 = 4)
- orb/harris_response: replace literals 100 and 4 with HARRIS_SCORE_SCALE
  and HARRIS_K_INT
- orb/fast_detect: remove roundabout u16::from(FAST_THRESHOLD) → as u8;
  use config::FAST_THRESHOLD (u8) directly in saturating_sub/add
- matcher: add debug_assert!(u16::try_from(qi/ti).is_ok()) before the
  index → u16 casts for feat_idx0 / feat_idx1
- matcher: replace best_dist.min(255) with best_dist.min(u32::from(u8::MAX))
  to avoid the magic literal
matrix/to_contiguous: fast-path the non-transposed case as self.clone();
  previously performed element-by-element get_raw/set_raw even when the
  data was already in contiguous row-major order; this eliminates a full
  double-loop copy on every call from QR and Gaussian solvers

matrix/matvec: add non-transposed fast path using direct slice indexing
  data[i*k+j], bypassing the get_raw branch per element; the transposed
  path (rare) keeps the existing get_raw loop

matrix/matmul: add !self.transposed && rhs.transposed fast path (ABᵀ);
  covers v.matmul(&u.t()), u.matmul(&w.t()), b_mat.matmul(&a.t()) in SVD,
  essential matrix, and EPnP; both rows (a[i*k+l] and b[j*bk+l]) are
  stride-1 in the inner l loop, avoiding all get_raw branches

solver/solve_linear_qr: replace (0..m).map(|i| b.get_raw(i,0)).collect()
  with b.as_raw_slice().to_vec(); safe because a single-column (or its
  transpose) matrix has identical physical and logical element ordering

orb/fast_detect: pre-allocate candidates Vec with capacity MAX_KEYPOINTS
  to avoid repeated reallocs during the per-frame pixel scan
…umber cleanup

Trait abstractions:
- Add FeatureDetector and DescriptorMatcher traits in features/mod.rs (static dispatch)
- impl FeatureDetector for OrbDetector
- impl DescriptorMatcher for BfMatcher; add match_features_into (pre-allocated output buffer, RT-safe)

Realtime allocation elimination:
- epnp: pre-alloc pool/s_world/s_img/mask/best_mask before 64-iter loop (192+ allocs → 0); inline Fisher-Yates; remove sample_k
- essential: pre-alloc pool/mask/best_mask before 640-iter loop (1280+ allocs → 0); inline Fisher-Yates; remove sample_8
- harris_select: flat Vec<(kp, score, bin)> + single sort + linear pass replaces Vec<Vec<...>> bins (~300 inner allocs/frame → 0)

No-magic-number cleanup:
- config: add ORB_DESCRIPTOR_WORDS = 8 and GAUSSIAN_STACK_SOLVE_N = 16
- types: OrbDescriptor uses config::ORB_DESCRIPTOR_WORDS
- solver: stack buffer size uses config::GAUSSIAN_STACK_SOLVE_N
- orb/brief: >> 5 / & 31 replaced by / u32::BITS as usize / % u32::BITS as usize

Robust tests: 10 new tests covering trait contracts, static dispatch, buffer reuse,
descriptor word count, and per-patch feature count invariant. 61 total (47 unit + 14 integration).
… alloc elimination

Typing / portability:
- types.rs: add Mat3 = [f32; 9] type alias (C: float[9]); mirrors C layout explicitly
- imu_preintegration: use Mat3 for all rotation matrix params/fields/returns
- config: add IDENTITY_MAT3 const; use in ImuFactorWithBias::identity()

New constants:
- config: IMU_POSITION_WEIGHT = IMU_WEIGHT_SCALE / σ_a² (precomputed, eliminates
  magic division in add_imu_regularisation + imu_factor_cost, verified by test)

New trait abstractions:
- optimizer/mod.rs: Optimizer trait (optimize → Result<f32>)
- optimizer/mod.rs: Preintegrator trait (process / extract_and_reset / reset)
- impl Optimizer for PoseGraphOptimizer; impl Preintegrator for ImuPreintegrator

New SO(3) utility:
- imu_preintegration: mat3_transpose; used in exp_so3_rotation_is_orthogonal test

Bug fix — age_world_points (pose_graph.rs):
- Previous code: swap_remove(idx) then retain(wp_id != idx). After swap_remove,
  the element swapped in from the end retains its old wp_id in observations → wrong.
- New code: stable compaction + remapping table (old_idx → new_idx) applied in
  one O(|obs|) retain+remap pass instead of O(k × |obs|) multi-retain.
  Test age_world_points_remaps_observations_correctly verifies this explicitly.

Alloc elimination:
- essential.rs eight_point: f_vec: Vec<f32> → [f32; 9] stack array (640 allocs/call removed)

Tests: +8 new (Mat3, IDENTITY_MAT3, trait dispatch, observation remapping, noop fast-path).
69 total (55 unit + 14 integration). Clippy clean.
## levio_rust — realtime-safe VIO implementation

Full Rust port of the LEVIO visual-inertial odometry pipeline targeting
GAP9 (160×120, 200 Hz IMU).  All 104 tests pass (77 unit + 13 cross_lang
+ 14 integration), cargo clippy -D warnings clean.

Key highlights:
- Trait-oriented static dispatch: FeatureDetector, DescriptorMatcher,
  Optimizer, Preintegrator — zero vtable overhead
- Zero heap allocation on the hot path: pre-allocated scratch buffers in
  OrbDetector, BfMatcher, PoseGraphOptimizer, EpnpScratch; QrScratch
  field eliminates 1 large alloc per LM iteration
- EPnP RANSAC (64 iters) + Essential RANSAC (640 iters) fully stack-based:
  EpnpScratch, inline AᵀA, Fisher-Yates partial shuffle, fixed-array SVD
- IMU preintegration (200 steps): 4.8 µs, 0 heap bytes cold and warm
- SVD: s returned as [f32; MAX_SVD_DIM] stack array; only U + V matrices
  heap-allocated (72 B per call)
- stable age_world_points compaction (was O(k²) swap_remove with silent
  index corruption)
- LevioError::NegativeDepth added; triangulate_point cheirality check

## cross_lang_tests — three-language numerical consistency suite

Standalone test suite validating C / Python / Rust produce identical
results to < 2e-5 (f32 tolerance) on all shared math vectors.

- c/test_numerics.c       617 assertions, make bench for timing
- python/test_numerics.py 687 assertions, --bench for timing
- levio_rust/tests/cross_lang.rs  119 #[test] functions

Functions: skew, mat3_mul, mat3_vec, mat3_transpose, exp_so3/rodrigues_exp,
hamming_distance, DLT row construction.

Algebraic invariants: det=+1, RᵀR=I, RRᵀ=I, tr(R)=1+2cos(θ),
R(π)²=I, R(θn̂)·R(φn̂)=R((θ+φ)n̂), isometry, DLT nullspace.

Linters (all warnings as errors):
- C: -Wall -Wextra -Werror -Wshadow -Wdouble-promotion -Wformat=2
- Python: ruff (E W F I UP B C4 SIM RUF)
- Rust: cargo clippy -D warnings

Runner: bash cross_lang_tests/run_all.sh

## Benchmark tooling

- plot_benchmarks.py      Python vs C vs Rust throughput + heap
- plot_rust_vs_c.py       Rust vs C primitives + speedup ratio
- plot_e2e_memory.py      E2E pipeline timing + heap allocation per call
- examples/alloc_bench.rs Counting GlobalAlloc; measures cold/warm heap
                          per operation (orb_detect: 75 KB cold / 0 warm;
                          SVD 3×3: 72 B every call; all else: 0 B)
Copilot AI review requested due to automatic review settings April 16, 2026 20:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@vlordier
Copy link
Copy Markdown
Author

cc @Pafrak @Yaooooo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants