Skip to content

Clean up and standardize IRQ behavior in sequencer#467

Merged
nathanaelhuffman merged 2 commits intomainfrom
ndh/nic-mapo-irqs-pcie
Feb 17, 2026
Merged

Clean up and standardize IRQ behavior in sequencer#467
nathanaelhuffman merged 2 commits intomainfrom
ndh/nic-mapo-irqs-pcie

Conversation

@nathanaelhuffman
Copy link
Collaborator

Adds a new common block to support standardized IRQ behavior.
Adds test cases exercising this new block.
Adds test cases around NICMAPO recovery from the SP5.
Update sequencer sm's to use RDL representations.

This has been integrated with hubris and tested on a cosmo successfully.

Change IFR to be W1C as part of this work..
Add ILR and IGR registers to see the live IRQ inputs and generate edge interrupts respectively.
Wire up nic fault to the PCIe I/O expander.
Allow PCIe-induced NIC MAPO recovery.
Increase test coverage to cover checking MAPO clear and the various recovery cases.
@nathanaelhuffman nathanaelhuffman merged commit 2783ae8 into main Feb 17, 2026
11 of 13 checks passed
@nathanaelhuffman nathanaelhuffman deleted the ndh/nic-mapo-irqs-pcie branch February 17, 2026 15:07
hawkw added a commit to oxidecomputer/hubris that referenced this pull request Feb 20, 2026
…2390)

Depends on #2392

There are several things wrong with the current handling of sequencer
FPGA interrupts (on Cosmo) and PMBus alerts in the Gimlet sequencer. To
wit:

1. On Cosmo, the `cosmo_seq_server` task does not handle the `NIC_MAPO`
   interrupt from the sequencer FPGA. When the NIC MAPOs, we should be
   transitioning from A0+HP back to A0, rather than just ignoring the
   NIC MAPO event. Presently, ignoring NIC MAPOs result in an improper
   thermal shutdown (see #2384), as the `thermal` task continues to poll
   the NIC's temperature sensor despite it no longer being powered, and
   extrapolates a worst-case temperature trend in the face of I2C NACKs
   from the temperature sensor, which eventually result in a thermal
   shutdown. This isn't great.
2. Both the Cosmo sequencer FPGA interrupt and the Gimlet
   V<sub>core</sub> VRM PMBus alert handling are implemented using the
   STM32 external interrupt (EXTI) peripheral, which only supports
   edge-triggered interrupts (not level-triggered). On Cosmo, where the
   FPGA IRQ line can be triggered by multiple bits in the FPGA's
   Interrupt Flags Register (IFR) being set, this means that we can miss
   interrupts in a situation where one IFR bit is set and then, *after*
   Hubris reads the IFR, a second IFR bit is *also* set while the
   `FPGA1_TO_SP_IRQ_L` line remains asserted. In this case, any bits
   which are set after the initial IFR read are completely missed by
   Hubris.

   Sadly, it turns out that this is often what happens in the event of a
   voltage sag on `V12_SYS_A2` similar to mfg-quality#140. If the
   voltage dips enough to both trigger the RAA229620A's V<sub>in</sub>
   undervolt warning **and** enough to MAPO the T6, the T6 MAPO tends to
   occur while we are in the interrupt handler but after the initial IFR
   read, so the NIC MAPO interrupt tends to be missed. Cool.
3. Our code that attempts to clear PMBus faults in the RAA229620As (on
   Cosmo)/RAA229618 (on Gimlet)...doesn't due that (see #2389), due to
   multiple _hilarious_ bugs:
   - We are currently sending the PMBus `CLEAR_FAULTS` command without a
     PMBus page (rail) associated with it. It turns out that this does
     _not_ actually result in clearing faults on all pages. Instead, it
     results in, uh, nothing happening.

     Quoth the PMBus standard:
     > Commands to clear a bit are gated by the PAGE command. The
     > CLEAR_FAULTS can be made to clear all faults on all pages by
     > setting the page command to FFh.
     >
     > --- _PMBus Power System Mgt Protocol Specification – Part II –
     > Revision 1.3.1_; §10.3 (p. 44)

     So, sending the command without a page doesn't actually do
     anything. That's great.
   - The second thing that stops this from working is _technically_ our
     fault but is due to a decision on the part of the vendor which is
     both profoundly bone-headed and somewhat poorly documented (as is
     traditional). As @nathanaelhuffman discovered, the RAA229620A
     documentation for the `VIN_UV_FAULT_RESPONSE` register, states the
     following:

     > Configures the input undervoltage fault response. **For a fault
     > to be considered cleared, the input voltage must rise by 1/16th
     > of the UV fault threshold value.**
     >
     > --- _R16DS0309EU0100 Rev.1.00_ §11.36 "VIN_OV_FAULT_RESPONSE" (p.
     > 57) (emphasis mine)

     When we configure the undervoltage warning, [we set the threshold
     to 11.75V][1]. If you remember fractions from...elementary
     school?[^1] you may have already realized that 1/16 * 11.75V =
     0.734375V. This means that for the VRM to clear the fault, the
     input voltage must reach 12.484375V (since 11.75V + 0.734375V =
     12.484375V).

     Yes, that is *almost half a volt above the nominal voltage for the
     12V rail*. It isn't supposed to do that! It's supposed to be 12V!!
     So, even if a voltage sag has ended and the input voltage to the
     VRM has returned to 12V, the fault will not clear because the
     voltage has not increased by the requisite 1/16th of the undervolt
     threshold. Isn't that wonderful?
5. There was also a FPGA bug in which the `NIC_MAPO` bit did not
   actually get unset as expected.

This branch unwinds the entire Litany of Sadness and Pain described
above, by doing the following things:

- Integrating with @nathanaelhuffman's changes in
  oxidecomputer/quartz#467, which fixes the bug with clearing the
  `IFR[NIC_MAPO]` bit, and also changes all the IFR flags to be
  write-1-to-clear. Now, we can clear them properly in the Cosmo
  sequencer's routine for handling FPGA IRQs. This required #2392 as the
  new FPGA register map wouldn't build without it.
- Actually making the NIC MAPO interrupt transition the system state
  from A0+HP to A0, which (in conjunction with #2391) fixes #2384.
- Changing the code for clearing RAA22960A and RAA229618 PMBus faults to
  do paged writes rather than unpaged writes, fixing #2389.
- Changing the V<sub>in</sub> undervolt warning threshold from 11.75V to
  11V, so that the fault actually clears when the input voltage returns
  to normal.
- On Cosmo, changing the code for handling the EXTI notification to
  continue reading the FPGA IFR register and clearing any set interrupt
  bits until the IRQ line is actually deasserted in a loop, to
  compensate for the possibility of additional bits being set whilst we
  are already handling an interrupt but after we have read the IFR.

  This does not apply to the PMBus alerts from the RAA229620As, as they
  are set in the IFR as long as the corresponding PMBus alert line to
  the FPGA is asserted, and cannot be cleared by writing back a 1 to the
  IFR. Instead, we mask them out by setting the corresponding bits in
  the FPGA Interrupt Enable Register (IER), so that they can remain
  asserted without the FPGA continuing to assert the IRQ.
- Masking out the RAA229620A PMBus alerts _half-solves_ the problem, but
  it doesn't reset the IER when the PMBus fault actually does clear, so
  we wouldn't be notified of a _subsequent_ fault condition. Thus, we
  must also set a flag that tells the sequencer task to continue trying
  to clear faults in the VRM(s) that asserted the PMBus alert
  periodically until the fault actually goes away. If it does, we then
  re-enable IRQs for the corresponding `PMALERT_L` pin(s).
- A slightly simpler version of that also had to happen on Gimlet. It's
  simpler because Gimlet only monitors a single RAA229618.

Whew okay yeah, I think that's everything. This was a whole thing.

Fixes #2384
Fixes #2389

[1]:
https://github.com/oxidecomputer/hubris/blob/939c48a1ff8e3ab059d8813f6a58b82fdaccc562/drv/cosmo-seq-server/src/vcore.rs#L85-L91
[^1]: Or middle school? When do children learn fractions?
brittonr pushed a commit to brittonr/hubris that referenced this pull request Feb 25, 2026
…xidecomputer#2390)

Depends on oxidecomputer#2392

There are several things wrong with the current handling of sequencer
FPGA interrupts (on Cosmo) and PMBus alerts in the Gimlet sequencer. To
wit:

1. On Cosmo, the `cosmo_seq_server` task does not handle the `NIC_MAPO`
   interrupt from the sequencer FPGA. When the NIC MAPOs, we should be
   transitioning from A0+HP back to A0, rather than just ignoring the
   NIC MAPO event. Presently, ignoring NIC MAPOs result in an improper
   thermal shutdown (see oxidecomputer#2384), as the `thermal` task continues to poll
   the NIC's temperature sensor despite it no longer being powered, and
   extrapolates a worst-case temperature trend in the face of I2C NACKs
   from the temperature sensor, which eventually result in a thermal
   shutdown. This isn't great.
2. Both the Cosmo sequencer FPGA interrupt and the Gimlet
   V<sub>core</sub> VRM PMBus alert handling are implemented using the
   STM32 external interrupt (EXTI) peripheral, which only supports
   edge-triggered interrupts (not level-triggered). On Cosmo, where the
   FPGA IRQ line can be triggered by multiple bits in the FPGA's
   Interrupt Flags Register (IFR) being set, this means that we can miss
   interrupts in a situation where one IFR bit is set and then, *after*
   Hubris reads the IFR, a second IFR bit is *also* set while the
   `FPGA1_TO_SP_IRQ_L` line remains asserted. In this case, any bits
   which are set after the initial IFR read are completely missed by
   Hubris.

   Sadly, it turns out that this is often what happens in the event of a
   voltage sag on `V12_SYS_A2` similar to mfg-quality#140. If the
   voltage dips enough to both trigger the RAA229620A's V<sub>in</sub>
   undervolt warning **and** enough to MAPO the T6, the T6 MAPO tends to
   occur while we are in the interrupt handler but after the initial IFR
   read, so the NIC MAPO interrupt tends to be missed. Cool.
3. Our code that attempts to clear PMBus faults in the RAA229620As (on
   Cosmo)/RAA229618 (on Gimlet)...doesn't due that (see oxidecomputer#2389), due to
   multiple _hilarious_ bugs:
   - We are currently sending the PMBus `CLEAR_FAULTS` command without a
     PMBus page (rail) associated with it. It turns out that this does
     _not_ actually result in clearing faults on all pages. Instead, it
     results in, uh, nothing happening.

     Quoth the PMBus standard:
     > Commands to clear a bit are gated by the PAGE command. The
     > CLEAR_FAULTS can be made to clear all faults on all pages by
     > setting the page command to FFh.
     >
     > --- _PMBus Power System Mgt Protocol Specification – Part II –
     > Revision 1.3.1_; §10.3 (p. 44)

     So, sending the command without a page doesn't actually do
     anything. That's great.
   - The second thing that stops this from working is _technically_ our
     fault but is due to a decision on the part of the vendor which is
     both profoundly bone-headed and somewhat poorly documented (as is
     traditional). As @nathanaelhuffman discovered, the RAA229620A
     documentation for the `VIN_UV_FAULT_RESPONSE` register, states the
     following:

     > Configures the input undervoltage fault response. **For a fault
     > to be considered cleared, the input voltage must rise by 1/16th
     > of the UV fault threshold value.**
     >
     > --- _R16DS0309EU0100 Rev.1.00_ §11.36 "VIN_OV_FAULT_RESPONSE" (p.
     > 57) (emphasis mine)

     When we configure the undervoltage warning, [we set the threshold
     to 11.75V][1]. If you remember fractions from...elementary
     school?[^1] you may have already realized that 1/16 * 11.75V =
     0.734375V. This means that for the VRM to clear the fault, the
     input voltage must reach 12.484375V (since 11.75V + 0.734375V =
     12.484375V).

     Yes, that is *almost half a volt above the nominal voltage for the
     12V rail*. It isn't supposed to do that! It's supposed to be 12V!!
     So, even if a voltage sag has ended and the input voltage to the
     VRM has returned to 12V, the fault will not clear because the
     voltage has not increased by the requisite 1/16th of the undervolt
     threshold. Isn't that wonderful?
5. There was also a FPGA bug in which the `NIC_MAPO` bit did not
   actually get unset as expected.

This branch unwinds the entire Litany of Sadness and Pain described
above, by doing the following things:

- Integrating with @nathanaelhuffman's changes in
  oxidecomputer/quartz#467, which fixes the bug with clearing the
  `IFR[NIC_MAPO]` bit, and also changes all the IFR flags to be
  write-1-to-clear. Now, we can clear them properly in the Cosmo
  sequencer's routine for handling FPGA IRQs. This required oxidecomputer#2392 as the
  new FPGA register map wouldn't build without it.
- Actually making the NIC MAPO interrupt transition the system state
  from A0+HP to A0, which (in conjunction with oxidecomputer#2391) fixes oxidecomputer#2384.
- Changing the code for clearing RAA22960A and RAA229618 PMBus faults to
  do paged writes rather than unpaged writes, fixing oxidecomputer#2389.
- Changing the V<sub>in</sub> undervolt warning threshold from 11.75V to
  11V, so that the fault actually clears when the input voltage returns
  to normal.
- On Cosmo, changing the code for handling the EXTI notification to
  continue reading the FPGA IFR register and clearing any set interrupt
  bits until the IRQ line is actually deasserted in a loop, to
  compensate for the possibility of additional bits being set whilst we
  are already handling an interrupt but after we have read the IFR.

  This does not apply to the PMBus alerts from the RAA229620As, as they
  are set in the IFR as long as the corresponding PMBus alert line to
  the FPGA is asserted, and cannot be cleared by writing back a 1 to the
  IFR. Instead, we mask them out by setting the corresponding bits in
  the FPGA Interrupt Enable Register (IER), so that they can remain
  asserted without the FPGA continuing to assert the IRQ.
- Masking out the RAA229620A PMBus alerts _half-solves_ the problem, but
  it doesn't reset the IER when the PMBus fault actually does clear, so
  we wouldn't be notified of a _subsequent_ fault condition. Thus, we
  must also set a flag that tells the sequencer task to continue trying
  to clear faults in the VRM(s) that asserted the PMBus alert
  periodically until the fault actually goes away. If it does, we then
  re-enable IRQs for the corresponding `PMALERT_L` pin(s).
- A slightly simpler version of that also had to happen on Gimlet. It's
  simpler because Gimlet only monitors a single RAA229618.

Whew okay yeah, I think that's everything. This was a whole thing.

Fixes oxidecomputer#2384
Fixes oxidecomputer#2389

[1]:
https://github.com/oxidecomputer/hubris/blob/939c48a1ff8e3ab059d8813f6a58b82fdaccc562/drv/cosmo-seq-server/src/vcore.rs#L85-L91
[^1]: Or middle school? When do children learn fractions?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant