Skip to content

Draft: Improved p-values for smooth terms (Wood 2013b). For #163.#586

Open
vsl366 wants to merge 8 commits into
dswah:mainfrom
vsl366:gsoc-vsl366
Open

Draft: Improved p-values for smooth terms (Wood 2013b). For #163.#586
vsl366 wants to merge 8 commits into
dswah:mainfrom
vsl366:gsoc-vsl366

Conversation

@vsl366

@vsl366 vsl366 commented Jun 6, 2026

Copy link
Copy Markdown

Draft PR. Replaces the current method of computing p values in PyGAM with the method mentioned in Simon Wood's "On p-values for smooth components of an extended generalized additive model" (Biometrika 100:221-228), addressing issue #163.

This is work in progress. A basic implementation has been finished with minimal optimization and limited testing, which includes seperate testing using monte carlo, and empirical validation against mgcv's reference, with many different categories of test cases.

Changes:

  1. Added edf1_per_coef to model statistics: the effective degrees of freedom per coefficient, computed as 2 * diag(F) - diag(F^2) per Wood (2013) eq. 4. The existing 'edof' (sum of diag(F)) is unchanged and still used for AIC, GCV, CIs.
  2. Added two helper methods on the GAM class:
    • _liu2: chi-squared mixture tail probability using the moment matching approximation from Liu, Tang, Zhang (2009).
    • _liu2_scaled_quadrature: integrates _liu2 over scaled chi-squared quantiles for the estimated-scale case.
  3. Added _woodteststat, which implements the test statistic and reference distribution from the paper.
    • Get R from Xj
    • Stage A: eigendecompose R V_β R.T to get its principal directions and their variances. mgcv's sign convention is applied to the eigenvectors so the results match the reference implementation exactly.
    • Stage B: split τ into an integer part k = floor(τ) and a fractional part ν = τ - k.
    • Stage C: discard near-zero eigenvalues. If τ exceeds the usable rank, trim it down. If no usable directions remain (degenerate V_β), default with p = 1.
    • Stage D: build the rank-τ pseudoinverse in factored form. The top k-1 directions get fully inverted (divided by sqrt of their eigenvalue). The last two directions are handled together as a boundary pair, partially inverted via the B-tilde block from the paper (a 2x2 matrix mixing them by a factor that depends on ν).
    • Stage E: compute T as the squared norm of the projection of β̂ R onto these scaled directions. Because the matrix square-root of B-tilde has a sign ambiguity, two versions of T are computed (with the two sign orientations of the boundary block) and the resulting p-values are averaged, as in the paper.
    • Stage F: compute the p-value. For fractional τ with known scale, the reference distribution is a weighted chi-squared mixture (weights 1, 1, ..., ν₁, ν₂ where ν₁ + ν₂ = ν + 1) and the tail is evaluated by _liu2. For estimated scale, _liu2_scaled_quadrature is used. For integer τ, the function falls through to the simpler chi-squared (known scale) or F (estimated scale) reference, since the boundary block isn't needed.
  4. Modified _compute_p_value to use the above helper function and implement the changes.
  5. Added new tests in the test_pvalues file, validating all created functions
  6. Other small adjustments
  7. Removed known bug warning

References:

  • Wood (2013) Biometrika 100:221-228
  • Liu, Tang, Zhang (2009) Comp Stat Data Analysis 53:853-856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant