Clean up and standardize IRQ behavior in sequencer#467
Merged
nathanaelhuffman merged 2 commits intomainfrom Feb 17, 2026
Merged
Clean up and standardize IRQ behavior in sequencer#467nathanaelhuffman merged 2 commits intomainfrom
nathanaelhuffman merged 2 commits intomainfrom
Conversation
Change IFR to be W1C as part of this work.. Add ILR and IGR registers to see the live IRQ inputs and generate edge interrupts respectively. Wire up nic fault to the PCIe I/O expander. Allow PCIe-induced NIC MAPO recovery. Increase test coverage to cover checking MAPO clear and the various recovery cases.
hawkw
added a commit
to oxidecomputer/hubris
that referenced
this pull request
Feb 20, 2026
…2390) Depends on #2392 There are several things wrong with the current handling of sequencer FPGA interrupts (on Cosmo) and PMBus alerts in the Gimlet sequencer. To wit: 1. On Cosmo, the `cosmo_seq_server` task does not handle the `NIC_MAPO` interrupt from the sequencer FPGA. When the NIC MAPOs, we should be transitioning from A0+HP back to A0, rather than just ignoring the NIC MAPO event. Presently, ignoring NIC MAPOs result in an improper thermal shutdown (see #2384), as the `thermal` task continues to poll the NIC's temperature sensor despite it no longer being powered, and extrapolates a worst-case temperature trend in the face of I2C NACKs from the temperature sensor, which eventually result in a thermal shutdown. This isn't great. 2. Both the Cosmo sequencer FPGA interrupt and the Gimlet V<sub>core</sub> VRM PMBus alert handling are implemented using the STM32 external interrupt (EXTI) peripheral, which only supports edge-triggered interrupts (not level-triggered). On Cosmo, where the FPGA IRQ line can be triggered by multiple bits in the FPGA's Interrupt Flags Register (IFR) being set, this means that we can miss interrupts in a situation where one IFR bit is set and then, *after* Hubris reads the IFR, a second IFR bit is *also* set while the `FPGA1_TO_SP_IRQ_L` line remains asserted. In this case, any bits which are set after the initial IFR read are completely missed by Hubris. Sadly, it turns out that this is often what happens in the event of a voltage sag on `V12_SYS_A2` similar to mfg-quality#140. If the voltage dips enough to both trigger the RAA229620A's V<sub>in</sub> undervolt warning **and** enough to MAPO the T6, the T6 MAPO tends to occur while we are in the interrupt handler but after the initial IFR read, so the NIC MAPO interrupt tends to be missed. Cool. 3. Our code that attempts to clear PMBus faults in the RAA229620As (on Cosmo)/RAA229618 (on Gimlet)...doesn't due that (see #2389), due to multiple _hilarious_ bugs: - We are currently sending the PMBus `CLEAR_FAULTS` command without a PMBus page (rail) associated with it. It turns out that this does _not_ actually result in clearing faults on all pages. Instead, it results in, uh, nothing happening. Quoth the PMBus standard: > Commands to clear a bit are gated by the PAGE command. The > CLEAR_FAULTS can be made to clear all faults on all pages by > setting the page command to FFh. > > --- _PMBus Power System Mgt Protocol Specification – Part II – > Revision 1.3.1_; §10.3 (p. 44) So, sending the command without a page doesn't actually do anything. That's great. - The second thing that stops this from working is _technically_ our fault but is due to a decision on the part of the vendor which is both profoundly bone-headed and somewhat poorly documented (as is traditional). As @nathanaelhuffman discovered, the RAA229620A documentation for the `VIN_UV_FAULT_RESPONSE` register, states the following: > Configures the input undervoltage fault response. **For a fault > to be considered cleared, the input voltage must rise by 1/16th > of the UV fault threshold value.** > > --- _R16DS0309EU0100 Rev.1.00_ §11.36 "VIN_OV_FAULT_RESPONSE" (p. > 57) (emphasis mine) When we configure the undervoltage warning, [we set the threshold to 11.75V][1]. If you remember fractions from...elementary school?[^1] you may have already realized that 1/16 * 11.75V = 0.734375V. This means that for the VRM to clear the fault, the input voltage must reach 12.484375V (since 11.75V + 0.734375V = 12.484375V). Yes, that is *almost half a volt above the nominal voltage for the 12V rail*. It isn't supposed to do that! It's supposed to be 12V!! So, even if a voltage sag has ended and the input voltage to the VRM has returned to 12V, the fault will not clear because the voltage has not increased by the requisite 1/16th of the undervolt threshold. Isn't that wonderful? 5. There was also a FPGA bug in which the `NIC_MAPO` bit did not actually get unset as expected. This branch unwinds the entire Litany of Sadness and Pain described above, by doing the following things: - Integrating with @nathanaelhuffman's changes in oxidecomputer/quartz#467, which fixes the bug with clearing the `IFR[NIC_MAPO]` bit, and also changes all the IFR flags to be write-1-to-clear. Now, we can clear them properly in the Cosmo sequencer's routine for handling FPGA IRQs. This required #2392 as the new FPGA register map wouldn't build without it. - Actually making the NIC MAPO interrupt transition the system state from A0+HP to A0, which (in conjunction with #2391) fixes #2384. - Changing the code for clearing RAA22960A and RAA229618 PMBus faults to do paged writes rather than unpaged writes, fixing #2389. - Changing the V<sub>in</sub> undervolt warning threshold from 11.75V to 11V, so that the fault actually clears when the input voltage returns to normal. - On Cosmo, changing the code for handling the EXTI notification to continue reading the FPGA IFR register and clearing any set interrupt bits until the IRQ line is actually deasserted in a loop, to compensate for the possibility of additional bits being set whilst we are already handling an interrupt but after we have read the IFR. This does not apply to the PMBus alerts from the RAA229620As, as they are set in the IFR as long as the corresponding PMBus alert line to the FPGA is asserted, and cannot be cleared by writing back a 1 to the IFR. Instead, we mask them out by setting the corresponding bits in the FPGA Interrupt Enable Register (IER), so that they can remain asserted without the FPGA continuing to assert the IRQ. - Masking out the RAA229620A PMBus alerts _half-solves_ the problem, but it doesn't reset the IER when the PMBus fault actually does clear, so we wouldn't be notified of a _subsequent_ fault condition. Thus, we must also set a flag that tells the sequencer task to continue trying to clear faults in the VRM(s) that asserted the PMBus alert periodically until the fault actually goes away. If it does, we then re-enable IRQs for the corresponding `PMALERT_L` pin(s). - A slightly simpler version of that also had to happen on Gimlet. It's simpler because Gimlet only monitors a single RAA229618. Whew okay yeah, I think that's everything. This was a whole thing. Fixes #2384 Fixes #2389 [1]: https://github.com/oxidecomputer/hubris/blob/939c48a1ff8e3ab059d8813f6a58b82fdaccc562/drv/cosmo-seq-server/src/vcore.rs#L85-L91 [^1]: Or middle school? When do children learn fractions?
brittonr
pushed a commit
to brittonr/hubris
that referenced
this pull request
Feb 25, 2026
…xidecomputer#2390) Depends on oxidecomputer#2392 There are several things wrong with the current handling of sequencer FPGA interrupts (on Cosmo) and PMBus alerts in the Gimlet sequencer. To wit: 1. On Cosmo, the `cosmo_seq_server` task does not handle the `NIC_MAPO` interrupt from the sequencer FPGA. When the NIC MAPOs, we should be transitioning from A0+HP back to A0, rather than just ignoring the NIC MAPO event. Presently, ignoring NIC MAPOs result in an improper thermal shutdown (see oxidecomputer#2384), as the `thermal` task continues to poll the NIC's temperature sensor despite it no longer being powered, and extrapolates a worst-case temperature trend in the face of I2C NACKs from the temperature sensor, which eventually result in a thermal shutdown. This isn't great. 2. Both the Cosmo sequencer FPGA interrupt and the Gimlet V<sub>core</sub> VRM PMBus alert handling are implemented using the STM32 external interrupt (EXTI) peripheral, which only supports edge-triggered interrupts (not level-triggered). On Cosmo, where the FPGA IRQ line can be triggered by multiple bits in the FPGA's Interrupt Flags Register (IFR) being set, this means that we can miss interrupts in a situation where one IFR bit is set and then, *after* Hubris reads the IFR, a second IFR bit is *also* set while the `FPGA1_TO_SP_IRQ_L` line remains asserted. In this case, any bits which are set after the initial IFR read are completely missed by Hubris. Sadly, it turns out that this is often what happens in the event of a voltage sag on `V12_SYS_A2` similar to mfg-quality#140. If the voltage dips enough to both trigger the RAA229620A's V<sub>in</sub> undervolt warning **and** enough to MAPO the T6, the T6 MAPO tends to occur while we are in the interrupt handler but after the initial IFR read, so the NIC MAPO interrupt tends to be missed. Cool. 3. Our code that attempts to clear PMBus faults in the RAA229620As (on Cosmo)/RAA229618 (on Gimlet)...doesn't due that (see oxidecomputer#2389), due to multiple _hilarious_ bugs: - We are currently sending the PMBus `CLEAR_FAULTS` command without a PMBus page (rail) associated with it. It turns out that this does _not_ actually result in clearing faults on all pages. Instead, it results in, uh, nothing happening. Quoth the PMBus standard: > Commands to clear a bit are gated by the PAGE command. The > CLEAR_FAULTS can be made to clear all faults on all pages by > setting the page command to FFh. > > --- _PMBus Power System Mgt Protocol Specification – Part II – > Revision 1.3.1_; §10.3 (p. 44) So, sending the command without a page doesn't actually do anything. That's great. - The second thing that stops this from working is _technically_ our fault but is due to a decision on the part of the vendor which is both profoundly bone-headed and somewhat poorly documented (as is traditional). As @nathanaelhuffman discovered, the RAA229620A documentation for the `VIN_UV_FAULT_RESPONSE` register, states the following: > Configures the input undervoltage fault response. **For a fault > to be considered cleared, the input voltage must rise by 1/16th > of the UV fault threshold value.** > > --- _R16DS0309EU0100 Rev.1.00_ §11.36 "VIN_OV_FAULT_RESPONSE" (p. > 57) (emphasis mine) When we configure the undervoltage warning, [we set the threshold to 11.75V][1]. If you remember fractions from...elementary school?[^1] you may have already realized that 1/16 * 11.75V = 0.734375V. This means that for the VRM to clear the fault, the input voltage must reach 12.484375V (since 11.75V + 0.734375V = 12.484375V). Yes, that is *almost half a volt above the nominal voltage for the 12V rail*. It isn't supposed to do that! It's supposed to be 12V!! So, even if a voltage sag has ended and the input voltage to the VRM has returned to 12V, the fault will not clear because the voltage has not increased by the requisite 1/16th of the undervolt threshold. Isn't that wonderful? 5. There was also a FPGA bug in which the `NIC_MAPO` bit did not actually get unset as expected. This branch unwinds the entire Litany of Sadness and Pain described above, by doing the following things: - Integrating with @nathanaelhuffman's changes in oxidecomputer/quartz#467, which fixes the bug with clearing the `IFR[NIC_MAPO]` bit, and also changes all the IFR flags to be write-1-to-clear. Now, we can clear them properly in the Cosmo sequencer's routine for handling FPGA IRQs. This required oxidecomputer#2392 as the new FPGA register map wouldn't build without it. - Actually making the NIC MAPO interrupt transition the system state from A0+HP to A0, which (in conjunction with oxidecomputer#2391) fixes oxidecomputer#2384. - Changing the code for clearing RAA22960A and RAA229618 PMBus faults to do paged writes rather than unpaged writes, fixing oxidecomputer#2389. - Changing the V<sub>in</sub> undervolt warning threshold from 11.75V to 11V, so that the fault actually clears when the input voltage returns to normal. - On Cosmo, changing the code for handling the EXTI notification to continue reading the FPGA IFR register and clearing any set interrupt bits until the IRQ line is actually deasserted in a loop, to compensate for the possibility of additional bits being set whilst we are already handling an interrupt but after we have read the IFR. This does not apply to the PMBus alerts from the RAA229620As, as they are set in the IFR as long as the corresponding PMBus alert line to the FPGA is asserted, and cannot be cleared by writing back a 1 to the IFR. Instead, we mask them out by setting the corresponding bits in the FPGA Interrupt Enable Register (IER), so that they can remain asserted without the FPGA continuing to assert the IRQ. - Masking out the RAA229620A PMBus alerts _half-solves_ the problem, but it doesn't reset the IER when the PMBus fault actually does clear, so we wouldn't be notified of a _subsequent_ fault condition. Thus, we must also set a flag that tells the sequencer task to continue trying to clear faults in the VRM(s) that asserted the PMBus alert periodically until the fault actually goes away. If it does, we then re-enable IRQs for the corresponding `PMALERT_L` pin(s). - A slightly simpler version of that also had to happen on Gimlet. It's simpler because Gimlet only monitors a single RAA229618. Whew okay yeah, I think that's everything. This was a whole thing. Fixes oxidecomputer#2384 Fixes oxidecomputer#2389 [1]: https://github.com/oxidecomputer/hubris/blob/939c48a1ff8e3ab059d8813f6a58b82fdaccc562/drv/cosmo-seq-server/src/vcore.rs#L85-L91 [^1]: Or middle school? When do children learn fractions?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new common block to support standardized IRQ behavior.
Adds test cases exercising this new block.
Adds test cases around NICMAPO recovery from the SP5.
Update sequencer sm's to use RDL representations.
This has been integrated with hubris and tested on a cosmo successfully.