Site Health: Cache the full results and a timestamp in the status transient#12181
Site Health: Cache the full results and a timestamp in the status transient#12181gziolo wants to merge 16 commits into
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Hi there! 👋 Thank you for your contribution to WordPress! 💖 It looks like this is your first pull request to No one monitors this repository for new pull requests. Pull requests must be attached to a Trac ticket to be considered for inclusion in WordPress Core. To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. Pull requests are never merged on GitHub. The WordPress codebase continues to be managed through the SVN repository that this GitHub repository mirrors. Please feel free to open pull requests to work on any contribution you are making. More information about how GitHub pull requests can be used to contribute to WordPress can be found in the Core Handbook. Please include automated tests. Including tests in your pull request is one way to help your patch be considered faster. To learn about WordPress' test suites, visit the Automated Testing page in the handbook. If you have not had a chance, please review the Contribute with Code page in the WordPress Core Handbook. The Developer Hub also documents the various coding standards that are followed:
Thank you, |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
westonruter
left a comment
There was a problem hiding this comment.
In addition to my comment about letting the ajax-stored result be preferred, here's a critical observation that Claude made: the lack of an $expiration means that the transient will be autoloaded.
Since it now includes all of the results which could potentially be large, this would run the risk of increasing the bucket of autoloaded options beyond the 1MB limit imposed by Memcached, right?
Full review from Claude
Confirmed the critical detail: set_transient() is called with no expiration ($expiration = 0), so without an external object cache the option is stored with $autoload = true (option.php:1547-1552). The full results blob is therefore autoloaded on every page load site-wide, not just on Site Health screens. That makes the size concern much more severe than the finders estimated.
I have enough verified context. Here are the findings, ranked most-severe first.
Code Review: PR #12181 — Cache Site Health full results + timestamp
Findings
1. Full results blob is autoloaded on every page request (perf regression) — class-wp-site-health.php:3492
set_transient( 'health-check-site-status-result', wp_json_encode( $site_status ) ) is called with no expiration. With no external object cache, set_transient() stores the option with autoload = true (option.php:1547-1552). Previously the value was ~60 bytes (three ints); now it's the entire $results array including every test's HTML description, actions, and badge. That payload is now loaded into memory on every front-end and admin request for every visitor, via wp_load_alloptions() — a site-wide cost paid to store data only the Site Health screen reads (and only the 3 counts from it). Passing an expiration, or storing under a non-autoloaded option, would avoid this.
2. Large results payload can exceed object-cache item limits → counter silently reverts to 0 — class-wp-site-health.php:3489
On installs with a persistent object cache (e.g. Memcached, default 1MB item limit), a site with many site_status_tests (plugins add them, with arbitrary HTML) can produce a JSON blob exceeding the backend's per-item limit. wp_cache_set() then silently fails, get_transient() returns false, and the admin-menu/dashboard counters fall back to 0/0/0 even though valid counts existed. No size cap or truncation is applied.
3. Race between scheduled cron and the AJAX counts update can discard freshly collected results — src/wp-admin/includes/ajax-actions.php:5488
The AJAX handler does a non-atomic read-modify-write: it reads the cached transient, splices in the POSTed counts, and re-writes. If wp_cron_scheduled_check() writes fresh results between the handler's get_transient() and set_transient() (admin loads Site Health while cron runs), the handler overwrites with the older (or absent) results it read earlier — silently dropping the detailed results the ticket exists to preserve, until the next weekly cron.
4. New status guard miscounts unknown/empty statuses and diverges from results count — class-wp-site-health.php:3471
The guard if ( ! is_array( $result ) || ! isset( $result['status'] ) ) continue; only checks that status is set. A result with status => '' or a custom value like 'info' still falls through to the final else and increments good, mislabeling it as a passing check. Separately, a result missing status was previously counted as good (old loop had no guard) but is now skipped entirely — so the cached results array length (e.g. 3) can no longer equal good + recommended + critical, which will surprise any consumer that assumes they match.
5. timestamp freshness mechanism is half-built — class-wp-site-health.php:3490
The comment says the timestamp lets freshness "be evaluated," but: the transient is set with no expiry, nothing in core ever compares the timestamp to now, and the AJAX handler preserves the old cron timestamp on every counts update (ajax-actions.php:5495-5497). So the stored timestamp reflects the last full cron run, not the last update — any future freshness check built on it would be subtly wrong, and the promised mechanism doesn't actually exist yet.
6. Payload schema duplicated across writers/readers with no canonical accessor — src/wp-admin/includes/ajax-actions.php:5477
The count-normalization array( 'good' => (int) ( $x['good'] ?? 0 ), 'recommended' => …, 'critical' => … ) is copied verbatim into ajax-actions.php:5477-5481 and class-wp-site-health.php:126-130, and the get_transient() + json_decode( …, true ) idiom is duplicated too. The transient's shape is now defined by convention across five sites (cron writer, ajax merge, enqueue reader, menu.php, dashboard.php). Adding a status bucket or renaming a key requires editing several hand-maintained literals; missing one yields counts that silently disagree between the menu counter, dashboard widget, and Site Health screen. A single read/write/normalize helper would centralize it.
7. Enqueue decodes the entire results blob just to read three integers — class-wp-site-health.php:117
On every Site Health and Dashboard page render, json_decode() parses the full cached payload (all HTML results) only to extract good/recommended/critical — the comment itself notes "only the aggregate counts are needed." The decode cost scales with the size/number of test results, for no benefit. (Mitigated if findings 1 & 2 are addressed by not storing results in the same hot transient.)
8. AJAX handler re-encodes the full results blob on every counts POST — src/wp-admin/includes/ajax-actions.php:5501
Each time an admin loads the Site Health Status tab, the JS posts the counts and this handler reads, decodes, and re-wp_json_encode()s the entire results payload back into the transient — repeated large serialization writes per page view, not just per weekly cron run (write amplification).
Note on a dropped candidate: finders flagged wp_json_encode( $site_status ) returning false (and storing false) on invalid UTF-8 in result HTML. I refuted it — wp_json_encode() runs _wp_json_sanity_check()/_wp_json_convert_string() to repair invalid UTF-8 and re-encode (functions.php:4411-4417), so the common case recovers; only a >512-depth structure would fail, which Site Health results never reach.
Findings 1–4 are the ones I'd block on; finding 1 in particular looks like a meaningful site-wide performance regression worth raising on the PR. Want me to post these as inline PR comments (/code-review --comment) or open a discussion on the specific lines?
|
Thanks for the review! Reworked this substantially in 5bb9fee and updated the PR description. Key change: counts stay in the small autoloaded transient, full results move to a separate non-autoloaded detail cache — which also clears the race and write-amplification points. The Site Health screen is now authoritative for the detailed results (incl. async tests) with cron as a backstop, plus per-result timestamps. Two design calls I'd value your take on before going further: the merge policy (1-week grace window vs. browser strictly outranking cron) and whether to cap the detail-cache size. Both noted in the description. Disclosure: I used Claude Code and Codex to develop and validate these changes (reviewed and verified by me). 🙏 |
5bb9fee to
70b0566
Compare
| /* | ||
| * When not authoritative, keep a recent existing entry rather than overwriting | ||
| * it. This preserves results collected from the Site Health screen (including | ||
| * the asynchronous tests) when the scheduled check runs afterwards. | ||
| */ | ||
| if ( ! $authoritative | ||
| && isset( $stored[ $test ]['timestamp'] ) | ||
| && ( $now - (int) $stored[ $test ]['timestamp'] ) < WEEK_IN_SECONDS | ||
| ) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
I'm wondering if we need this. Wouldn't it always be better to store whatever is the most fresh?
There was a problem hiding this comment.
Good question. We could store the freshest result, with one caveat.
Cron runs the async tests over loopback HTTP, which fails on some hosts. When that happens, cron falls back to a recommended result labeled "A test is unavailable". If cron always overwrote, that fallback would replace an accurate result the browser just collected.
So if we want cron to override recent browser results, we first need to check whether cron could actually verify the test. That part is detectable, since the fallback is built in one place when the loopback request fails. Then cron would overwrite only when it has a real result, and skip the update when it could not verify the test.
Related to this, my suggestion to reduce the number of stored keys to just |
…nsient. Previously the `health-check-site-status-result` transient stored only the aggregate `good`, `recommended`, and `critical` counts, so any consumer that needed the underlying results had to re-run the Site Health tests synchronously. The scheduled check now caches the complete results array and the time they were collected alongside the counts. The Site Health screen AJAX handler validates the submitted counts and preserves the cached results and timestamp while refreshing the counts, and `enqueue_scripts()` only localizes the counts to avoid embedding the full results in every Site Health and Dashboard page. This provides a reusable cached source of detailed Site Health data for the dashboard, the admin menu, and future REST/Abilities consumers without triggering synchronous tests. Adds unit tests covering the scheduled-check caching and the AJAX result handler. See #65232. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…|string`, but `mixed` given.
…ansient. Following review feedback, store only the aggregate counts in the autoloaded `health-check-site-status-result` transient and cache the full per-test results separately in `health-check-site-status-detail`, which has an expiration so the larger payload is not autoloaded on every request. The Site Health screen submits the full results — including the asynchronous tests that require JavaScript to run — and they are sanitized and cached as the authoritative detailed results. The scheduled check refreshes only missing or stale entries so it does not discard fresher results collected from the screen. Each result carries its own timestamp, and entries that have not been refreshed within a month are dropped. Adds WP_Site_Health::get_site_status_counts(), get_site_status_detail(), and merge_site_status_detail() as the canonical accessors, and updates the unit and Ajax tests accordingly. See #65232. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
70b0566 to
3ca1452
Compare
Co-authored-by: Weston Ruter <westonruter@gmail.com>
Address review feedback on the detailed results cache. - Describe the cached result shape instead of `array<string, mixed>`, so consumers and static analysis know each field. - Drop the dead `status` guard in the counts helper. It only runs on already sanitized results, so the status key is always set. - Rename `$authoritative` to `$includes_async`. The Site Health screen takes precedence because it carries the JavaScript-only tests, not because its results are inherently more trustworthy. - Add the test name back when reading the cache. It is stored as the array key to avoid duplication, so consumers still get it on read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Initialize `site_status.results` in the localized data, next to `direct`, `async`, and `issues`. With the array always defined, the client no longer needs to create it on first use, and the upload guard can simply check for collected results. - Add `'results' => array()` to the localized `SiteHealth` data. - Drop the lazy `typeof` init before the first push. - Send the detailed results only when there are some, using `.length`. This also avoids posting an empty array that would clear the cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Store just the test name, status, and timestamp for each result. Drop the label and the HTML fields (description, actions, badge). Those strings are translated for the current user's locale, and the cache is a single site-wide option. Storing them could show one user's locale to another. This also keeps the cached option small. Consumers should resolve labels at read time in the current locale. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review! I have applied all the feedback in the latest commits. A quick summary:
One open question remains: the merge policy when the scheduled check runs after a screen visit. Storing the freshest result is simplest, but cron runs the async tests over loopback, which fails on some hosts and falls back to "unavailable". One option is to let cron overwrite only when it could actually verify the test, instead of the current one-week grace window. Happy to go either way. I also revised the PR description so it reflects the current state of the code. |
Pull the two repeated patterns in the status cache helpers into small private methods. - `read_status_cache()` reads a transient and decodes it to an array or null, replacing the get_transient plus json_decode pair in three readers. - `normalize_status_counts()` returns the good, recommended, and critical counts as non-negative integers, replacing the same shaping copied across the counts getter and setter. Behavior is unchanged. Each cache shape now lives in one place. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trac ticket: https://core.trac.wordpress.org/ticket/65232
What & why
Trac #65232 wants a fast, cache-only Site Health summary that never runs the synchronous tests. Today the
health-check-site-status-resulttransient stores only the aggregategood/recommended/criticalcounts, so any consumer that needs per-test results has to re-run the tests.This PR adds a second cache that records, per test, its status and a timestamp, without re-running anything. The Abilities
site_healthfield is a follow-up that reads this cache and resolves human-readable labels at read time, in the current locale.Design
Two caches, with separate jobs:
health-check-site-status-result— unchanged shape: only{ good, recommended, critical }. Small, still autoloaded, backward-compatible for the admin-menu counter and Dashboard widget.health-check-site-status-detail(new) — per-test entries keyed by test name, each with astatusand its owntimestamp. Stored with an expiration so it is not autoloaded, since only the counts are needed on the hot path.get_site_status_detail()also returnscountsderived from those same entries, so a consumer reads one internally consistent view.Only locale-independent data is cached
Each entry stores just the test name, status, and timestamp. The label, description, actions, and badge are left out on purpose. Those strings are translated for the current user's locale, and this cache is a single site-wide option, so storing them could show one user's locale to another. It also keeps the option small. Consumers resolve labels at read time in the current locale.
Who writes what
Detail cache shape
{ "results": { "<test>": { "status": "good|recommended|critical", "timestamp": 1715714399 } }, "counts": { "good": 12, "recommended": 3, "critical": 0 }, "timestamp": 1715714399 }get_site_status_detail()returnsresultsas a list and re-injects the test name from the key.Canonical accessors (on
WP_Site_Health)get_site_status_counts()/set_site_status_counts()— the legacy UI counts (the small autoloaded transient).get_site_status_detail()—{ results, counts, timestamp }, internally consistent.update_site_status_detail( $results, $includes_async )— sanitizes and merges into the detail cache.Backward compatibility
The counts transient keeps its existing shape, so
menu.php, the Dashboard widget, and the localizedSiteHealthdata are unchanged. NoWP_Site_Healthload is forced on the admin-menu path.Open question for reviewers
Merge policy. When the scheduled check runs after a screen visit, should it overwrite the browser-collected results? Storing the freshest result is simplest, but cron runs the async tests over loopback HTTP, which fails on some hosts and falls back to "unavailable". One option under discussion: let cron overwrite only when it could actually verify the test, instead of the current one-week grace window.
Testing
tests/phpunit/tests/admin/wpSiteHealth.php— the scheduled-check counts vs. detail split, fresh-preserve / stale-refresh / stale-drop behavior, derived detail counts, that only locale-independent fields are stored, and the unavailable-async fallback cached under its test identifier.tests/phpunit/tests/ajax/wpAjaxHealthCheckSiteStatusResult.php— independent counts-only and results-only requests, that translated and HTML fields are dropped, and capability enforcement.phpcs, PHPStan, and the full
site-healthandajaxtest groups pass locally.Use of AI Tools
AI assistance: Yes
Tool(s): Claude Code (Claude Opus 4.8), Codex
Used for: Development and validation; reviewed and verified by me.
🤖 Generated with Claude Code