Skip to content

Full measurements in reconfigurator#9877

Open
labbott wants to merge 39 commits intomainfrom
labbott/full_measurement_reconfigurator
Open

Full measurements in reconfigurator#9877
labbott wants to merge 39 commits intomainfrom
labbott/full_measurement_reconfigurator

Conversation

@labbott
Copy link
Contributor

@labbott labbott commented Feb 18, 2026

This is the full stack of changes to support reference measurements. This is described more fully in RFD 512 but briefly: the goal here is to be able to distribute a set of reference measurements (hashes of what software we expect to be running on the rack) to sprockets so each end of a sprockets connection can appraise the measurements of the remote peer (compare what's actually running on the rack to the expected reference measurements).

Blueprints now have knowledge of measurements (see omicron#9718) and this PR stack is responsible for having reconfigurator update the measurements. Measurements are an artifact in the TUF repo. The high level goals here are

  • Handle going from a blueprint with Unknown measurements (the default when pulling out old blueprints) to a known set of measurements
  • Support reading measurements from the install dataset during a MUPdate override situation
  • Making sure the measurements included on each sled include all the measurements from the old TUF repo and the new TUF repo. This is the only set of software we ever expect to be running.
  • Updating the measurements before performing any other changes to the sled

Big points to check

  • MUPdate override must always work as this will be the recovery path. This will be even more important when measurements are enforced (right now it's logged but the sprockets connection continues)
  • Is this the right place for measurements to come first in reconfigurator? Right now we're only measuring RoT and SP but Host OS is coming soon. Will we need changes when that happens?
  • Are there enough test cases with the reconfigurator-cli?
  • The edit counts for measurements with the sled-editor never felt quite right. Is there a better way to check?

Automated test output to check

  • cmds-missing-measurement-manifest.txt (new test)
  • cmds-unknown-measurements.txt (new test)
  • cmds-mupdate-update-flow.txt (had to tweak output)

Tests added

  • nexus/reconfigurator/planning/tests/integration_tests/planner.rs
  • nexus/db-queries/src/db/datastore/deployment.rs

Most exciting files to review

  • nexus/reconfigurator/planning/src/planner.rs (this changes some of the logic of reconfigurator to have measurements come first)
  • nexus/reconfigurator/planning/src/planner/image_source.rs (MUPdate override logic, also needs to work with unknown measurements)
  • nexus/reconfigurator/planning/src/measurements.rs (actually very concise logic for planning measurements)
  • nexus/reconfigurator/planning/src/blueprint_editor/sled_editor/measurements.rs (logic for editing measurements)

@labbott labbott marked this pull request as draft February 18, 2026 13:35
@labbott labbott force-pushed the labbott/full_measurement_reconfigurator branch 2 times, most recently from 59f8771 to a1faf54 Compare February 24, 2026 17:38
@labbott labbott changed the base branch from main to labbott/measurement_blueprints February 24, 2026 20:46
@labbott labbott changed the base branch from labbott/measurement_blueprints to main February 24, 2026 20:47
@labbott labbott force-pushed the labbott/full_measurement_reconfigurator branch from d32f8be to b6ece6d Compare February 25, 2026 20:10
@labbott labbott changed the base branch from main to labbott/measurement_blueprints February 25, 2026 20:10
@labbott labbott changed the base branch from labbott/measurement_blueprints to main February 25, 2026 20:15
@labbott labbott force-pushed the labbott/full_measurement_reconfigurator branch 2 times, most recently from 664dfa8 to 072d8db Compare February 27, 2026 13:53
@labbott labbott changed the title WIP: full measurements in reconfigurator Full measurements in reconfigurator Feb 27, 2026
@labbott labbott marked this pull request as ready for review February 27, 2026 13:53
@labbott labbott added this to the 19 milestone Feb 27, 2026
)?;
sim_source.simulate_zone_errors(&source.with_zone_error)?;
Ok(SimTufRepoDescription::new(sim_source))
if source.with_measurement_manifest_error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten to the changes to SimTufRepoDescription yet, but looking at the surrounding code here I'm wondering if we could attach this to SimTufRepoSource instead? So both these branches would become something like

     sim_source.simulate_zone_errors(&source.with_zone_error)?;
+    sim_source.simulate_measurement_error("...error message...")?;

instead of needing to call different SimTufRepoDescription constructors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried at first. simulate_zone_errors works to simulate an error on individual zones in the manifest. What with_measurement_manifest_error is simulating is no manifest present at all (which is very important for us to simulate as that's what we're going to see in R18 -> R19)

panic!("did not converge after {MAX_PLANNING_ITERATIONS} iterations");
}

// This test case was based on an error I hit while developing!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... what was the error? 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one set of measurements be a subset of another wasn't detected correctly and we'd generate infinite blueprints. I 'll clarify that

}

#[test]
fn test_multiple_measurements() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test cover everything test_subset_measurements() covers (but with more measurements)? At a glance that look very similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_subset_measurements was based on a specific error case I ran into. test_multiple_measurements was designed to check the expected case but I do think they have a lot of overlap. I think I'd like to keep both with some clarification.

@jgallagher
Copy link
Contributor

Will give this another look soon - also would be good to get eyes from @sunshowers since this touches a lot of the mupdate override / recovery bits.

pub use planning_report::PlanningCockroachdbSettingsStepReport;
pub use planning_report::PlanningDecommissionStepReport;
pub use planning_report::PlanningExpungeStepReport;
pub use planning_report::PlanningMeasurementUpdatesStepReport;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has a few instances of measurment -- could you fix the typos in a followup?

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments, going to sleep now :)

Comment on lines 201 to 203
Err("reconfigurator-sim: simulated error \
validating zone image"
.to_owned())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "simulated error validating measurement" or similar?

let mut sim = ReconfiguratorCliTestState::new(TEST_NAME, &logctx.log);
sim.load_example_customized(|builder| builder.with_target_release_0_0_1())
.expect("loaded example system");
let blueprint1 = sim.assert_latest_blueprint_is_blippy_clean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add blippy checks for measurements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never looked at blippy before! I'll take a pass and see what makes sense.

Comment on lines +28 to +30
# We expect to see all sleds successfully show measurements in the artifact
# state including the sled with the measurement manifest error
blueprint-diff latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We expect to see all sleds successfully show measurements in the artifact
# state including the sled with the measurement manifest error
blueprint-diff latest
# We expect to see all sleds successfully show measurements in the artifact
# state, including the sled with the measurement manifest error.
blueprint-diff latest

Also, what happens if:

  • the mupdate override is cleared, setting measurements to InstallDataset
  • then noop conversion finds that the manifest is missing
  • which causes do_plan_measurements to be blocked?

I don't fully understand the state machine here I guess -- I don't think zone images have a missing state the way this does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measurements do behave differently than zones because we have to account for the upgrade case. Online update is now available to customers so we cannot guarantee a MUPdate to produce a manifest or measurements in the install dataset. This is okay and the code handles this update path appropriately.

It's impossible to tell the difference between a missing manifest because it never got one (expected) and a missing manifest because of another reason (unexpected). In the example given, the expectation is that if we have a mupdate override we must have done a mupdate which should install the proper measurements. If we are in a case where we're trying to boot with measurements set to InstallDataset but can't get the manifest this is going to be a hard error that requires a new MUPdate to fix. The nightmare scenario is something where a MUPdate cannot fix our measurement problems.

Co-authored-by: Rain <rain@oxide.computer>
/// This is the same as `set_value` but if the internal value is
/// still `Original` and `value` matches we will leave the
/// value as `Original`
pub(crate) fn set_value_if_unchanged(&mut self, value: T) -> Cow<'_, T>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this again and I think this function is actually unnecessary and I had accounting bugs elsewhere? Going to see about re-testing.

Comment on lines +1100 to +1101
Measurement updates:
Waiting on zone add/update blockers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher is this clearer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants