feat: Rust VIO implementation + cross-language numerical test suite#1
Open
vlordier wants to merge 14 commits into
Open
feat: Rust VIO implementation + cross-language numerical test suite#1vlordier wants to merge 14 commits into
vlordier wants to merge 14 commits into
Conversation
Complete Rust port of the LEVIO Visual-Inertial Odometry system with military-grade code quality: #![forbid(unsafe_code)], #![deny(clippy::all, clippy::pedantic)], zero clippy warnings, 54 passing tests (40 unit + 14 integration), and 10 Criterion benchmarks. Algorithms implemented: - ORB detection: FAST-9 + Harris scoring + BRIEF-256 descriptors - Brute-force Hamming matcher with flow filtering and cross-check - DLT triangulation (normalised coords + bare [R|t]) - EPnP RANSAC pose estimation with Sampson-distance essential matrix - Jacobi SVD, Gaussian elimination with partial pivoting, Householder QR - IMU preintegration with SO(3) exp map and bias-correction Jacobians - Windowed Gauss-Newton / LM pose graph optimizer with Huber weighting - Sliding-window pose graph with keyframe eviction and landmark ageing
ORB pipeline (orb.rs): - gaussian_blur: replace single 9-tap pass with separable two 3-tap passes (identical integer output, ~2× fewer multiply-add ops per pixel) - detect: remove two unnecessary Vec clones of the keypoints list - fast_detect: replace closure+.count() quick-rejection with direct pixel reads and integer sum — removes iterator overhead in the innermost loop - is_fast_corner: use a doubled [i8;32] ring so the consecutive-9 scan never touches % 16 modulo; all 8 comparisons explicit for auto-unroll - brief_descriptor: drop the per-bit bounds guard (FAST border 18 > max BRIEF offset 13 makes it unreachable); replace /32+%32 with >>5 and &31 Matcher (matcher.rs): - hamming_distance: fully unroll 8 XOR+popcount ops so LLVM can vectorise - match_features: break on dist==0 (perfect match) to skip remaining train IMU preintegration (imu_preintegration.rs): - integrate_step: hoist skew(omega) out of the 3×3 Jacobian loop (was recomputed 9× per step); hoist r_row and d_row out of the inner col loop 54/54 tests pass; cargo clippy -- -D warnings clean.
Performance: - Matrix::ata() — direct AᵀA in one pass exploiting symmetry (~half the multiplications vs full matmul). No transpose allocation required. - Matrix::ata_data() — returns the flat Vec directly for callers that need a mutable buffer (svd.rs, epnp.rs), eliminating an extra clone. - Matrix::matmul() — add i-k-j fast path for non-transposed matrices: stride-1 inner access on all three arrays; avoids the get_raw branch. Falls back to the general loop for transposed inputs. - svd: use ata_data() — 3 allocations (t+matmul+to_vec) → 1 - epnp: use ata_data() — same 3 → 1 saving for the 2n×12 normal matrix - essential: use ata() for the 8×9 → 9×9 null-space step (saves t() clone) - sampson_distance_sq: accept &[f32; 9] instead of &Matrix; caller extracts the matrix once per RANSAC hypothesis, not 18 get_raw calls per match Fixes: - count_positive_depth: d2_z = p2*0 + p2*0 + 1.0 is always 1.0 → dead code; removed the unreachable && branch - mat3_to_arr: replace core::array::from_fn with i/3, i%3 with explicit literal indices (9 direct get_raw calls, no divide/modulo) - sampson_distance_sq test: construct flat array directly, not a Matrix Chores: - Deduplicate SimpleRng (Xorshift32): was copy-pasted into both epnp.rs and essential.rs; now lives once in visual_odometry/mod.rs as pub(crate) - Matrix::t() doc: remove false "zero-copy" claim; data buffer is cloned 54/54 tests pass; cargo clippy -- -D warnings clean.
Hot path: - mat3_mul: replace core::array::from_fn + div/mod with explicit 27-FMA literal expansion, eliminating 18 integer divisions per call - exp_so3: CSE — precompute omc, kxy, kxz, kyz, skx, sky, skz to eliminate 6 redundant multiply chains in the Rodrigues formula - harris_response: accumulate ixx/iyy/ixy in i32 (max ~1.6M << 2^31), cast to i64 only for the final det/trace products; enables 32-bit SIMD Chores / fixes: - matrix: extract ata_inner() shared by ata() and ata_data(), removing the duplicated loop body - solver: stack-allocate the rhs buffer for n ≤ 16 (covers all VIO use cases, n=3), falling back to heap only for larger systems - pose_graph_optimizer: remove unused _state_dim parameter from accumulate_hessian and add_imu_regularisation and their call sites - svd: simplify V=I init to diagonal-only loop (Vec is already zeroed)
Hot path: - epnp: precompute dc⁻¹ once in compute_barycentrics; replace n × 3×3 Gaussian eliminations with n × 9-FMA mat-vec products (removes the per-point dc_copy allocation and solve_linear_gaussian call loop) - epnp: hoist mean_reprojection_error inside the best-hypothesis block so it is computed at most once (for the winner), not per iteration - epnp: replace mean_reprojection_error Vec<f32> collection with a direct accumulation, eliminating the intermediate heap allocation - imu: simplify trapezoidal bias correction to 0.5*(a+b)-bias, reducing 6 redundant subtractions per integration step Chores / fixes: - matrix: add #[inline] to ata_inner so it is inlined in release builds - epnp: drop now-unused solver import (compute_barycentrics no longer calls solve_linear_gaussian) - bench: remove unused CameraIntrinsics import from levio_bench.rs
Hot path: - orb: replace is_fast_corner ring-doubling ([i8;16]+[i8;32]+loop) with two u16 bitmasks + has_9_consecutive() branchless bit-rotation check; eliminates 48 bytes of stack and the memcopy; drop unused _center param - orb: replace saturating_sub in harris_response inner loop with direct subtraction (both accesses are guarded by the outer border check) - epnp: single-pass centroid in compute_control_points (one loop instead of three iterator passes), replace 1/n division with * inv_n - epnp: symmetric covariance (upper triangle + mirror: 6 muls/point vs 9) Typing / portability: - Add #[repr(C)] to Point2D, Point2Df, Point3D, ImuMeasurement, ImuBias, CameraIntrinsics, FeatureMatch, Observation, ImuFactor, ImuFactorWithBias — guarantees C-compatible layout for FFI and embedded interop - Change ORB_PATCH_RADIUS: i16 → u8; HARRIS_BLOCK_SIZE: i16 → u8 (both are always non-negative counts; u8 matches the C uint8_t type) Magic numbers → named constants (config.rs): - ORB_FAST_DETECT_BORDER = ORB_PATCH_RADIUS + 3 (was 18_usize literal) - BRIEF_MAX_OFFSET = 13 (was 13 literal) - EPNP_SAMPLE_SIZE = 6 (EPnP hypothesis sample) - ESSENTIAL_SAMPLE_SIZE = 8 (8-point algorithm sample) - SVD_JACOBI_N_FACTOR = 20 (was n * 20 literal) - OPTIMIZER_MAX_ITERS = 8 (was 8 literal) - OPTIMIZER_CONVERGENCE_THRESHOLD = 1e-6 (was 1e-6 literal) Module-level constants (epnp.rs): - EPNP_CTRL_PTS = 4, EPNP_DOF = 12 (replace all 4/12/144 literals) Other: - types: add NO_WP_ID: i16 = -1 sentinel constant; use in pose_graph.rs - essential: RansacFailed needed field now uses ESSENTIAL_SAMPLE_SIZE
Hot-path: - svd: merge two `for r in 0..n` loops in jacobi_eigen into one pass, updating V rotation and the symmetric matrix in the same iteration (halves loop overhead and improves cache locality on a/v buffers) - optimizer: eliminate 6 zero-multiplications in j_pt chain rule (dpx_dx[1]=0, dpy_dy[0]=0 are structural zeros, not runtime values) - optimizer: inline skew_xcam as direct cross-product expressions in j_pose_row closure; removes [[f32;3];3] stack allocation and all indirect index reads - accumulate_hessian: replace bounds-checked get/set with get_raw/set_raw throughout; caller already guards pose_idx<n_poses and wp_idx<n_pts, so per-element bounds checks in the hot per-observation loop are redundant; function now returns () instead of Result<(),LevioError> Typing / safety: - pose_graph: add debug_assert!(u16::try_from(...).is_ok()) before new_observations.len() as u16 cast in add_keyframe - pose_graph: add debug_assert!(u32::try_from(...).is_ok()) before n_new as u32 cast in add_keyframe - pose_graph: add debug_assert!(u16::try_from(idx).is_ok()) before idx as u16 cast in age_world_points
Hot-path: - imu_preintegration: replace 2×exp_so3 per integration step with the half-angle squaring trick: compute exp_so3(ω·dt/2) once, then d_rot = mat3_mul(half_rot, half_rot) and r_mid = mat3_mul(r_prev, half_rot); saves one sqrt+sin+cos (~50 ops) at the cost of one extra mat3_mul (27 FMAs) - triangulation: eliminate pose_to_projection and its two 3×4 Matrix heap allocations per call; DLT rows now index the Pose array directly (pose[8+j] for row 2, pose[4+j] for row 1, pose[j] for row 0) - matrix/ata_inner: add non-transposed fast path that bypasses the transposed branch in get_raw on every inner iteration; direct self.data[k*nc+i] * self.data[k*nc+j] indexing for the common case - imu_preintegration: add #[inline] to exp_so3, mat3_mul, mat3_vec, skew; these are small pure functions called in tight integration loops Magic numbers / constants: - config: add IMU_MAX_DT = 1.0 (previously hardcoded in integrate_step) - config: add EPNP_RNG_SEED = 42 (previously hardcoded in ransac_epnp_from_matches) - imu_preintegration: dt > 1.0 → dt > config::IMU_MAX_DT - epnp: 1e-10 tolerance → config::SVD_TOL_JACOBI (both jacobi_eigen calls) - epnp: 100 max-iter for 3×3 PCA → (SVD_MAX_ITER_JACOBI as usize).max(3*SVD_JACOBI_N_FACTOR) - epnp: 42 RNG seed → config::EPNP_RNG_SEED
Hot-path: - orb/harris_response: replace i64 loop counters with direct usize arithmetic; border guard guarantees y >= hb+1 and x >= hb+1, so y-hb+dy and x-hb+dx never underflow; removes 2 signed casts on every iteration of the 7×7 block - solver/solve_linear_qr: pre-allocate hh_buf (Vec<f32; m>) once before the Householder loop rather than allocating a fresh Vec per column; eliminates n heap allocations per QR call (n = state_dim for the normal equations) Correctness fix: - matcher/pixel_distance_sq: previous i32 path could panic in debug (overflow) when both keypoints are near u16::MAX: 65535² = 4_294_836_225 > i32::MAX; new path uses u16::abs_diff → u32, then (dx*dx).saturating_add(dy*dy) which is correct for all u16 inputs without any truncation Typing / magic numbers: - config: add HARRIS_SCORE_SCALE = 100 and HARRIS_K_INT = 4 documenting the integer-scaled form of Harris k = 0.04 (k × 100 = 4) - orb/harris_response: replace literals 100 and 4 with HARRIS_SCORE_SCALE and HARRIS_K_INT - orb/fast_detect: remove roundabout u16::from(FAST_THRESHOLD) → as u8; use config::FAST_THRESHOLD (u8) directly in saturating_sub/add - matcher: add debug_assert!(u16::try_from(qi/ti).is_ok()) before the index → u16 casts for feat_idx0 / feat_idx1 - matcher: replace best_dist.min(255) with best_dist.min(u32::from(u8::MAX)) to avoid the magic literal
matrix/to_contiguous: fast-path the non-transposed case as self.clone(); previously performed element-by-element get_raw/set_raw even when the data was already in contiguous row-major order; this eliminates a full double-loop copy on every call from QR and Gaussian solvers matrix/matvec: add non-transposed fast path using direct slice indexing data[i*k+j], bypassing the get_raw branch per element; the transposed path (rare) keeps the existing get_raw loop matrix/matmul: add !self.transposed && rhs.transposed fast path (ABᵀ); covers v.matmul(&u.t()), u.matmul(&w.t()), b_mat.matmul(&a.t()) in SVD, essential matrix, and EPnP; both rows (a[i*k+l] and b[j*bk+l]) are stride-1 in the inner l loop, avoiding all get_raw branches solver/solve_linear_qr: replace (0..m).map(|i| b.get_raw(i,0)).collect() with b.as_raw_slice().to_vec(); safe because a single-column (or its transpose) matrix has identical physical and logical element ordering orb/fast_detect: pre-allocate candidates Vec with capacity MAX_KEYPOINTS to avoid repeated reallocs during the per-frame pixel scan
…umber cleanup Trait abstractions: - Add FeatureDetector and DescriptorMatcher traits in features/mod.rs (static dispatch) - impl FeatureDetector for OrbDetector - impl DescriptorMatcher for BfMatcher; add match_features_into (pre-allocated output buffer, RT-safe) Realtime allocation elimination: - epnp: pre-alloc pool/s_world/s_img/mask/best_mask before 64-iter loop (192+ allocs → 0); inline Fisher-Yates; remove sample_k - essential: pre-alloc pool/mask/best_mask before 640-iter loop (1280+ allocs → 0); inline Fisher-Yates; remove sample_8 - harris_select: flat Vec<(kp, score, bin)> + single sort + linear pass replaces Vec<Vec<...>> bins (~300 inner allocs/frame → 0) No-magic-number cleanup: - config: add ORB_DESCRIPTOR_WORDS = 8 and GAUSSIAN_STACK_SOLVE_N = 16 - types: OrbDescriptor uses config::ORB_DESCRIPTOR_WORDS - solver: stack buffer size uses config::GAUSSIAN_STACK_SOLVE_N - orb/brief: >> 5 / & 31 replaced by / u32::BITS as usize / % u32::BITS as usize Robust tests: 10 new tests covering trait contracts, static dispatch, buffer reuse, descriptor word count, and per-patch feature count invariant. 61 total (47 unit + 14 integration).
… alloc elimination Typing / portability: - types.rs: add Mat3 = [f32; 9] type alias (C: float[9]); mirrors C layout explicitly - imu_preintegration: use Mat3 for all rotation matrix params/fields/returns - config: add IDENTITY_MAT3 const; use in ImuFactorWithBias::identity() New constants: - config: IMU_POSITION_WEIGHT = IMU_WEIGHT_SCALE / σ_a² (precomputed, eliminates magic division in add_imu_regularisation + imu_factor_cost, verified by test) New trait abstractions: - optimizer/mod.rs: Optimizer trait (optimize → Result<f32>) - optimizer/mod.rs: Preintegrator trait (process / extract_and_reset / reset) - impl Optimizer for PoseGraphOptimizer; impl Preintegrator for ImuPreintegrator New SO(3) utility: - imu_preintegration: mat3_transpose; used in exp_so3_rotation_is_orthogonal test Bug fix — age_world_points (pose_graph.rs): - Previous code: swap_remove(idx) then retain(wp_id != idx). After swap_remove, the element swapped in from the end retains its old wp_id in observations → wrong. - New code: stable compaction + remapping table (old_idx → new_idx) applied in one O(|obs|) retain+remap pass instead of O(k × |obs|) multi-retain. Test age_world_points_remaps_observations_correctly verifies this explicitly. Alloc elimination: - essential.rs eight_point: f_vec: Vec<f32> → [f32; 9] stack array (640 allocs/call removed) Tests: +8 new (Mat3, IDENTITY_MAT3, trait dispatch, observation remapping, noop fast-path). 69 total (55 unit + 14 integration). Clippy clean.
## levio_rust — realtime-safe VIO implementation
Full Rust port of the LEVIO visual-inertial odometry pipeline targeting
GAP9 (160×120, 200 Hz IMU). All 104 tests pass (77 unit + 13 cross_lang
+ 14 integration), cargo clippy -D warnings clean.
Key highlights:
- Trait-oriented static dispatch: FeatureDetector, DescriptorMatcher,
Optimizer, Preintegrator — zero vtable overhead
- Zero heap allocation on the hot path: pre-allocated scratch buffers in
OrbDetector, BfMatcher, PoseGraphOptimizer, EpnpScratch; QrScratch
field eliminates 1 large alloc per LM iteration
- EPnP RANSAC (64 iters) + Essential RANSAC (640 iters) fully stack-based:
EpnpScratch, inline AᵀA, Fisher-Yates partial shuffle, fixed-array SVD
- IMU preintegration (200 steps): 4.8 µs, 0 heap bytes cold and warm
- SVD: s returned as [f32; MAX_SVD_DIM] stack array; only U + V matrices
heap-allocated (72 B per call)
- stable age_world_points compaction (was O(k²) swap_remove with silent
index corruption)
- LevioError::NegativeDepth added; triangulate_point cheirality check
## cross_lang_tests — three-language numerical consistency suite
Standalone test suite validating C / Python / Rust produce identical
results to < 2e-5 (f32 tolerance) on all shared math vectors.
- c/test_numerics.c 617 assertions, make bench for timing
- python/test_numerics.py 687 assertions, --bench for timing
- levio_rust/tests/cross_lang.rs 119 #[test] functions
Functions: skew, mat3_mul, mat3_vec, mat3_transpose, exp_so3/rodrigues_exp,
hamming_distance, DLT row construction.
Algebraic invariants: det=+1, RᵀR=I, RRᵀ=I, tr(R)=1+2cos(θ),
R(π)²=I, R(θn̂)·R(φn̂)=R((θ+φ)n̂), isometry, DLT nullspace.
Linters (all warnings as errors):
- C: -Wall -Wextra -Werror -Wshadow -Wdouble-promotion -Wformat=2
- Python: ruff (E W F I UP B C4 SIM RUF)
- Rust: cargo clippy -D warnings
Runner: bash cross_lang_tests/run_all.sh
## Benchmark tooling
- plot_benchmarks.py Python vs C vs Rust throughput + heap
- plot_rust_vs_c.py Rust vs C primitives + speedup ratio
- plot_e2e_memory.py E2E pipeline timing + heap allocation per call
- examples/alloc_bench.rs Counting GlobalAlloc; measures cold/warm heap
per operation (orb_detect: 75 KB cold / 0 warm;
SVD 3×3: 72 B every call; all else: 0 B)
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cross_lang_tests/) validating that C, Python, and Rust produce numerically identical results (< 2e-5, f32 tolerance) on all shared math vectorslevio_rust
Architecture
FeatureDetector,DescriptorMatcher,Optimizer,Preintegrator#[repr(C)]:Mat3=[f32;9],Vec3=[f32;3],Pose=[f32;16],OrbDescriptor=[u32;8]cargo clippy -D warningscleanRealtime safety (hot-path allocations)
Performance vs C (-O2)
E2E pipeline (Rust, release build): orb_detect 20 µs, bf_match 22 µs, imu_preintegrate×200 4.8 µs, triangulate 446 ns, SVD 3×3 164 ns.
cross_lang_tests
Three independent implementations of the same deterministic math, cross-validated:
c/test_numerics.c-Wall -Wextra -Werror -Wshadow -Wdouble-promotion -Wformat=2python/test_numerics.pylevio_rust/tests/cross_lang.rs#[test]functionscargo clippy -D warningsAlgebraic invariants tested:
det(R)=+1,RᵀR=I,RRᵀ=I,tr(R)=1+2cos(θ),R(π)²=I, same-axis composition, isometry, DLT nullspacerow·Xₕ=0.Test plan
bash cross_lang_tests/run_all.sh— all lint + test steps passcargo testinlevio_rust/— 104 tests greencargo clippy -- -D warnings— zero warningsmake benchincross_lang_tests/c/— timing output matches documented values