Clean up and standardize IRQ behavior in sequencer by nathanaelhuffman · Pull Request #467 · oxidecomputer/quartz

nathanaelhuffman · 2026-02-16T17:45:43Z

Adds a new common block to support standardized IRQ behavior.
Adds test cases exercising this new block.
Adds test cases around NICMAPO recovery from the SP5.
Update sequencer sm's to use RDL representations.

This has been integrated with hubris and tested on a cosmo successfully.

Change IFR to be W1C as part of this work.. Add ILR and IGR registers to see the live IRQ inputs and generate edge interrupts respectively. Wire up nic fault to the PCIe I/O expander. Allow PCIe-induced NIC MAPO recovery. Increase test coverage to cover checking MAPO clear and the various recovery cases.

@nathanaelhuffman

…2390) Depends on #2392 There are several things wrong with the current handling of sequencer FPGA interrupts (on Cosmo) and PMBus alerts in the Gimlet sequencer. To wit: 1. On Cosmo, the `cosmo_seq_server` task does not handle the `NIC_MAPO` interrupt from the sequencer FPGA. When the NIC MAPOs, we should be transitioning from A0+HP back to A0, rather than just ignoring the NIC MAPO event. Presently, ignoring NIC MAPOs result in an improper thermal shutdown (see #2384), as the `thermal` task continues to poll the NIC's temperature sensor despite it no longer being powered, and extrapolates a worst-case temperature trend in the face of I2C NACKs from the temperature sensor, which eventually result in a thermal shutdown. This isn't great. 2. Both the Cosmo sequencer FPGA interrupt and the Gimlet Vcore VRM PMBus alert handling are implemented using the STM32 external interrupt (EXTI) peripheral, which only supports edge-triggered interrupts (not level-triggered). On Cosmo, where the FPGA IRQ line can be triggered by multiple bits in the FPGA's Interrupt Flags Register (IFR) being set, this means that we can miss interrupts in a situation where one IFR bit is set and then, *after* Hubris reads the IFR, a second IFR bit is *also* set while the `FPGA1_TO_SP_IRQ_L` line remains asserted. In this case, any bits which are set after the initial IFR read are completely missed by Hubris. Sadly, it turns out that this is often what happens in the event of a voltage sag on `V12_SYS_A2` similar to mfg-quality#140. If the voltage dips enough to both trigger the RAA229620A's Vin undervolt warning **and** enough to MAPO the T6, the T6 MAPO tends to occur while we are in the interrupt handler but after the initial IFR read, so the NIC MAPO interrupt tends to be missed. Cool. 3. Our code that attempts to clear PMBus faults in the RAA229620As (on Cosmo)/RAA229618 (on Gimlet)...doesn't due that (see #2389), due to multiple _hilarious_ bugs: - We are currently sending the PMBus `CLEAR_FAULTS` command without a PMBus page (rail) associated with it. It turns out that this does _not_ actually result in clearing faults on all pages. Instead, it results in, uh, nothing happening. Quoth the PMBus standard: > Commands to clear a bit are gated by the PAGE command. The > CLEAR_FAULTS can be made to clear all faults on all pages by > setting the page command to FFh. > > --- _PMBus Power System Mgt Protocol Specification – Part II – > Revision 1.3.1_; §10.3 (p. 44) So, sending the command without a page doesn't actually do anything. That's great. - The second thing that stops this from working is _technically_ our fault but is due to a decision on the part of the vendor which is both profoundly bone-headed and somewhat poorly documented (as is traditional). As @nathanaelhuffman discovered, the RAA229620A documentation for the `VIN_UV_FAULT_RESPONSE` register, states the following: > Configures the input undervoltage fault response. **For a fault > to be considered cleared, the input voltage must rise by 1/16th > of the UV fault threshold value.** > > --- _R16DS0309EU0100 Rev.1.00_ §11.36 "VIN_OV_FAULT_RESPONSE" (p. > 57) (emphasis mine) When we configure the undervoltage warning, [we set the threshold to 11.75V][1]. If you remember fractions from...elementary school?[^1] you may have already realized that 1/16 * 11.75V = 0.734375V. This means that for the VRM to clear the fault, the input voltage must reach 12.484375V (since 11.75V + 0.734375V = 12.484375V). Yes, that is *almost half a volt above the nominal voltage for the 12V rail*. It isn't supposed to do that! It's supposed to be 12V!! So, even if a voltage sag has ended and the input voltage to the VRM has returned to 12V, the fault will not clear because the voltage has not increased by the requisite 1/16th of the undervolt threshold. Isn't that wonderful? 5. There was also a FPGA bug in which the `NIC_MAPO` bit did not actually get unset as expected. This branch unwinds the entire Litany of Sadness and Pain described above, by doing the following things: - Integrating with @nathanaelhuffman's changes in oxidecomputer/quartz#467, which fixes the bug with clearing the `IFR[NIC_MAPO]` bit, and also changes all the IFR flags to be write-1-to-clear. Now, we can clear them properly in the Cosmo sequencer's routine for handling FPGA IRQs. This required #2392 as the new FPGA register map wouldn't build without it. - Actually making the NIC MAPO interrupt transition the system state from A0+HP to A0, which (in conjunction with #2391) fixes #2384. - Changing the code for clearing RAA22960A and RAA229618 PMBus faults to do paged writes rather than unpaged writes, fixing #2389. - Changing the Vin undervolt warning threshold from 11.75V to 11V, so that the fault actually clears when the input voltage returns to normal. - On Cosmo, changing the code for handling the EXTI notification to continue reading the FPGA IFR register and clearing any set interrupt bits until the IRQ line is actually deasserted in a loop, to compensate for the possibility of additional bits being set whilst we are already handling an interrupt but after we have read the IFR. This does not apply to the PMBus alerts from the RAA229620As, as they are set in the IFR as long as the corresponding PMBus alert line to the FPGA is asserted, and cannot be cleared by writing back a 1 to the IFR. Instead, we mask them out by setting the corresponding bits in the FPGA Interrupt Enable Register (IER), so that they can remain asserted without the FPGA continuing to assert the IRQ. - Masking out the RAA229620A PMBus alerts _half-solves_ the problem, but it doesn't reset the IER when the PMBus fault actually does clear, so we wouldn't be notified of a _subsequent_ fault condition. Thus, we must also set a flag that tells the sequencer task to continue trying to clear faults in the VRM(s) that asserted the PMBus alert periodically until the fault actually goes away. If it does, we then re-enable IRQs for the corresponding `PMALERT_L` pin(s). - A slightly simpler version of that also had to happen on Gimlet. It's simpler because Gimlet only monitors a single RAA229618. Whew okay yeah, I think that's everything. This was a whole thing. Fixes #2384 Fixes #2389 [1]: https://github.com/oxidecomputer/hubris/blob/939c48a1ff8e3ab059d8813f6a58b82fdaccc562/drv/cosmo-seq-server/src/vcore.rs#L85-L91 [^1]: Or middle school? When do children learn fractions?

@nathanaelhuffman

…xidecomputer#2390) Depends on oxidecomputer#2392 There are several things wrong with the current handling of sequencer FPGA interrupts (on Cosmo) and PMBus alerts in the Gimlet sequencer. To wit: 1. On Cosmo, the `cosmo_seq_server` task does not handle the `NIC_MAPO` interrupt from the sequencer FPGA. When the NIC MAPOs, we should be transitioning from A0+HP back to A0, rather than just ignoring the NIC MAPO event. Presently, ignoring NIC MAPOs result in an improper thermal shutdown (see oxidecomputer#2384), as the `thermal` task continues to poll the NIC's temperature sensor despite it no longer being powered, and extrapolates a worst-case temperature trend in the face of I2C NACKs from the temperature sensor, which eventually result in a thermal shutdown. This isn't great. 2. Both the Cosmo sequencer FPGA interrupt and the Gimlet Vcore VRM PMBus alert handling are implemented using the STM32 external interrupt (EXTI) peripheral, which only supports edge-triggered interrupts (not level-triggered). On Cosmo, where the FPGA IRQ line can be triggered by multiple bits in the FPGA's Interrupt Flags Register (IFR) being set, this means that we can miss interrupts in a situation where one IFR bit is set and then, *after* Hubris reads the IFR, a second IFR bit is *also* set while the `FPGA1_TO_SP_IRQ_L` line remains asserted. In this case, any bits which are set after the initial IFR read are completely missed by Hubris. Sadly, it turns out that this is often what happens in the event of a voltage sag on `V12_SYS_A2` similar to mfg-quality#140. If the voltage dips enough to both trigger the RAA229620A's Vin undervolt warning **and** enough to MAPO the T6, the T6 MAPO tends to occur while we are in the interrupt handler but after the initial IFR read, so the NIC MAPO interrupt tends to be missed. Cool. 3. Our code that attempts to clear PMBus faults in the RAA229620As (on Cosmo)/RAA229618 (on Gimlet)...doesn't due that (see oxidecomputer#2389), due to multiple _hilarious_ bugs: - We are currently sending the PMBus `CLEAR_FAULTS` command without a PMBus page (rail) associated with it. It turns out that this does _not_ actually result in clearing faults on all pages. Instead, it results in, uh, nothing happening. Quoth the PMBus standard: > Commands to clear a bit are gated by the PAGE command. The > CLEAR_FAULTS can be made to clear all faults on all pages by > setting the page command to FFh. > > --- _PMBus Power System Mgt Protocol Specification – Part II – > Revision 1.3.1_; §10.3 (p. 44) So, sending the command without a page doesn't actually do anything. That's great. - The second thing that stops this from working is _technically_ our fault but is due to a decision on the part of the vendor which is both profoundly bone-headed and somewhat poorly documented (as is traditional). As @nathanaelhuffman discovered, the RAA229620A documentation for the `VIN_UV_FAULT_RESPONSE` register, states the following: > Configures the input undervoltage fault response. **For a fault > to be considered cleared, the input voltage must rise by 1/16th > of the UV fault threshold value.** > > --- _R16DS0309EU0100 Rev.1.00_ §11.36 "VIN_OV_FAULT_RESPONSE" (p. > 57) (emphasis mine) When we configure the undervoltage warning, [we set the threshold to 11.75V][1]. If you remember fractions from...elementary school?[^1] you may have already realized that 1/16 * 11.75V = 0.734375V. This means that for the VRM to clear the fault, the input voltage must reach 12.484375V (since 11.75V + 0.734375V = 12.484375V). Yes, that is *almost half a volt above the nominal voltage for the 12V rail*. It isn't supposed to do that! It's supposed to be 12V!! So, even if a voltage sag has ended and the input voltage to the VRM has returned to 12V, the fault will not clear because the voltage has not increased by the requisite 1/16th of the undervolt threshold. Isn't that wonderful? 5. There was also a FPGA bug in which the `NIC_MAPO` bit did not actually get unset as expected. This branch unwinds the entire Litany of Sadness and Pain described above, by doing the following things: - Integrating with @nathanaelhuffman's changes in oxidecomputer/quartz#467, which fixes the bug with clearing the `IFR[NIC_MAPO]` bit, and also changes all the IFR flags to be write-1-to-clear. Now, we can clear them properly in the Cosmo sequencer's routine for handling FPGA IRQs. This required oxidecomputer#2392 as the new FPGA register map wouldn't build without it. - Actually making the NIC MAPO interrupt transition the system state from A0+HP to A0, which (in conjunction with oxidecomputer#2391) fixes oxidecomputer#2384. - Changing the code for clearing RAA22960A and RAA229618 PMBus faults to do paged writes rather than unpaged writes, fixing oxidecomputer#2389. - Changing the Vin undervolt warning threshold from 11.75V to 11V, so that the fault actually clears when the input voltage returns to normal. - On Cosmo, changing the code for handling the EXTI notification to continue reading the FPGA IFR register and clearing any set interrupt bits until the IRQ line is actually deasserted in a loop, to compensate for the possibility of additional bits being set whilst we are already handling an interrupt but after we have read the IFR. This does not apply to the PMBus alerts from the RAA229620As, as they are set in the IFR as long as the corresponding PMBus alert line to the FPGA is asserted, and cannot be cleared by writing back a 1 to the IFR. Instead, we mask them out by setting the corresponding bits in the FPGA Interrupt Enable Register (IER), so that they can remain asserted without the FPGA continuing to assert the IRQ. - Masking out the RAA229620A PMBus alerts _half-solves_ the problem, but it doesn't reset the IER when the PMBus fault actually does clear, so we wouldn't be notified of a _subsequent_ fault condition. Thus, we must also set a flag that tells the sequencer task to continue trying to clear faults in the VRM(s) that asserted the PMBus alert periodically until the fault actually goes away. If it does, we then re-enable IRQs for the corresponding `PMALERT_L` pin(s). - A slightly simpler version of that also had to happen on Gimlet. It's simpler because Gimlet only monitors a single RAA229618. Whew okay yeah, I think that's everything. This was a whole thing. Fixes oxidecomputer#2384 Fixes oxidecomputer#2389 [1]: https://github.com/oxidecomputer/hubris/blob/939c48a1ff8e3ab059d8813f6a58b82fdaccc562/drv/cosmo-seq-server/src/vcore.rs#L85-L91 [^1]: Or middle school? When do children learn fractions?

nathanaelhuffman added 2 commits February 16, 2026 12:43

get nic mapo connected to pcie power fault. adjust irq strategy

74bf8e2

nathanaelhuffman requested a review from Aaron-Hartwig February 16, 2026 17:45

nathanaelhuffman merged commit 2783ae8 into main Feb 17, 2026
11 of 13 checks passed

nathanaelhuffman deleted the ndh/nic-mapo-irqs-pcie branch February 17, 2026 15:07

hawkw mentioned this pull request Feb 17, 2026

cpu_seq: Fix a variety of hilarious bugs in sequencer IRQ handling oxidecomputer/hubris#2390

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up and standardize IRQ behavior in sequencer#467

Clean up and standardize IRQ behavior in sequencer#467
nathanaelhuffman merged 2 commits intomainfrom
ndh/nic-mapo-irqs-pcie

nathanaelhuffman commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nathanaelhuffman commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant