trsh: drop checkfile on SCF/Unconverged/BasisSet failures#880
trsh: drop checkfile on SCF/Unconverged/BasisSet failures#880
Conversation
There was a problem hiding this comment.
Pull request overview
Updates Gaussian ESS troubleshooting so ARC discards an inherited checkpoint file not only on CheckFile errors, but also on wavefunction-quality failures (SCF, Unconverged) and basis projection failures (BasisSet), preventing reruns from auto-triggering guess=read on unusable MOs.
Changes:
- Extend
trsh_keyword_checkfile()to drop the chk forSCF,Unconverged, andBasisSetkeywords (in addition toCheckFile). - Add a regression test ensuring chk-dropping occurs for wavefunction/basis-related failures and does not occur for unrelated failures.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
arc/job/trsh.py |
Expand chk-dropping guard to cover additional Gaussian failure keywords associated with unusable wavefunctions / basis projection. |
arc/job/trsh_test.py |
Add targeted regression test to validate chk-dropping behavior per keyword and prevent shared-list masking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #880 +/- ##
==========================================
- Coverage 60.44% 60.40% -0.04%
==========================================
Files 102 102
Lines 31096 31111 +15
Branches 8103 8108 +5
==========================================
- Hits 18796 18794 -2
- Misses 9957 9971 +14
- Partials 2343 2346 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR hardens ARC’s Gaussian troubleshooting flow by ensuring corrupted/unsafe checkpoint files are not reused after wavefunction-quality failures, and by adding detection for Gaussian L601 chk/rwf collisions so reruns can self-heal without wasting extra invocations.
Changes:
- Expand
trsh_keyword_checkfileto drop the checkfile on additional wavefunction-quality failure keywords (SCF/Unconverged/BasisSet). - Improve troubleshooting log messages to reflect the triggering keyword when dropping the checkpoint.
- Add Gaussian L601 (
l601.exe) detection and a new fixture/test to verify correct keyword routing.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
arc/job/trsh.py |
Adds L601 detection, widens chk-drop conditions, and refines chk-drop logging. |
arc/job/trsh_test.py |
Adds regression coverage for chk-drop behavior and L601 routing. |
arc/testing/trsh/gaussian/l601.out |
New Gaussian output fixture for L601 chk/rwf collision detection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if remove_checkfile: | ||
| logger_info.append('that failed with "Basis set data is not on the checkpoint file" by removing the checkfile.') | ||
| chk_drop_keywords = job_status.get('keywords', []) or [] | ||
| if 'CheckFile' in chk_drop_keywords: |
| Drops the checkfile when the prior job either could not read it | ||
| ('CheckFile' from L301/L401), produced a non-converged wavefunction | ||
| ('SCF' from L502, 'Unconverged' from L508), or reported a failed | ||
| basis projection ('BasisSet' from L401). Reusing MOs from a |
Previously trsh_keyword_checkfile only fired on the 'CheckFile' keyword,
so a job that died with unconverged SCF (L502) had its partial check.chk
inherited by the rerun. The next job's gaussian adapter then auto-emitted
guess=read against MOs from a non-converged iteration, typically failing
with L401 ("Basis set data is not on the checkpoint file") and wasting a
full Gaussian invocation before self-healing back to guess=mix.
Extend the guard to also drop the chk on 'SCF', 'Unconverged', and
'BasisSet' keywords. MaxOptCycles, InternalCoordinateError, DiskSpace,
OptOrientation and other non-wavefunction failures remain unaffected, so
the warm-start cycle savings on those paths are preserved.
Adds a regression test that exercises a fresh ess_trsh_methods list for
each drop-keyword (the existing test_trsh_ess_job masked the bug because
its sub-cases mutated a shared list) and pins the keep-keyword behavior
so the warm-start path can't regress silently.
After widening trsh_keyword_checkfile to fire on SCF/Unconverged/BasisSet in addition to CheckFile, the hardcoded log line "that failed with 'Basis set data is not on the checkpoint file' by removing the checkfile." falsely claimed a chk-data failure for unrelated wavefunction failures. Pick the explanatory phrase from job_status['keywords'] so the log reflects the actual failure mode (chk-read error, basis projection, unconverged SCF, or generic unconverged wavefunction).
Concurrent Gaussian jobs sharing a checkpoint or read-write file path can trigger L601 with an RdWrB1 garbage-pointer error. ARC currently has no l601.exe handler in determine_ess_status, so these failures fall through to the generic 'Unknown' bucket and bypass the chk-drop logic even though the chk is exactly what's at fault. Route L601 to the 'CheckFile' keyword so the existing chk-drop plumbing clears the chk before the rerun, and add a small fixture + regression test covering the new detection path.
01cd60e to
582c012
Compare
Summary
Make ARC's Gaussian checkpoint handling robust to wavefunction-quality failures.
trsh_keyword_checkfileonly fired on theCheckFilekeyword, so a job that died with unconverged SCF (L502) had its partialcheck.chkinherited by the rerun. The next job's Gaussian adapter then auto-emittedguess=readagainst MOs from a non-converged iteration, typically failing with L401 ("Basis set data is not on the checkpoint file") and wasting a full Gaussian invocation before self-healing back toguess=mix.This PR widens the chk-drop guard, fixes the misleading log line that resulted, and adds a missing detector for L601 chk/rwf collisions.
Observed in the wild
Paper_ROTORS_TS/rmg_rxn_1142(Zeus): opt_a6355 → SCF unconverged → opt_a6358 ran withguess=readagainst the partial chk → L401 → opt_a6359 finally fell back toguess=mixafter the existingCheckFile-keyword self-heal. After this PR, opt_a6358 goes straight toguess=mixwithscf=(qc) nosymm, skipping the L401 detour.Commits in this PR
1.
trsh: drop checkfile on SCF/Unconverged/BasisSet failuresExtend
trsh_keyword_checkfileto drop the chk on four wavefunction-quality keywords:CheckFileSCFUnconvergedBasisSetNon-wavefunction failures (
MaxOptCycles,InternalCoordinateError,DiskSpace,OptOrientation, etc.) remain unaffected, so the warm-start cycle savings on those paths are preserved.level_of_theoryis read-only insidetrsh.py, so there's no method/charge/multiplicity-change path during trsh that needs guarding.2.
trsh: branch the chk-drop log message by triggering keywordAfter widening the guard, the hardcoded log line "that failed with 'Basis set data is not on the checkpoint file' by removing the checkfile." falsely claimed a chk-data failure for unrelated wavefunction failures. Pick the explanatory phrase from
job_status['keywords']so the log reflects the actual failure mode (chk-read error, basis projection, unconverged SCF, or generic unconverged wavefunction).3.
trsh: detect Gaussian L601 (RdWrB1) chk/rwf collisionConcurrent Gaussian jobs sharing a checkpoint or read-write file path can trigger L601 with an RdWrB1 garbage-pointer error. ARC had no
l601.exehandler indetermine_ess_status, so these failures fell through to the genericUnknownbucket and bypassed the chk-drop logic even though the chk is exactly what's at fault. Routes L601 to theCheckFilekeyword so the existing plumbing clears the chk before the rerun.Test plan
pytest arc/job/trsh_test.pypasses (9 tests, including new regressions)test_trsh_keyword_checkfile_drops_on_bad_wavefunctioncovers a freshess_trsh_methodslist per drop-keyword (the existingtest_trsh_ess_jobmasked the bug because sub-cases mutated a shared list) and pins keep-keyword behavior so the warm-start path can't regress silentlytest_determine_ess_statusextended with a newl601.outfixture verifying L601 routes to['CheckFile', 'GL601']Out of scope (tracking separately)
Related cleanups surfaced while reading wang/leng Gaussian-error references but kept out of this PR to preserve scope:
no_xqc(L508 keyword) vsno_qc(ess_trsh_method) wiring mismatch intrsh_keyword_scf— would let ARC de-escalatescf=(qc)when L508 says QC was the problemstable=optfor TS first-opt to catch saddle-point SCF convergences (Class 4 wrong-solution)species.checkfilein the first place