Skip to content

trsh: drop checkfile on SCF/Unconverged/BasisSet failures#880

Open
calvinp0 wants to merge 3 commits intomainfrom
trsh/drop-chk-on-bad-wavefunction
Open

trsh: drop checkfile on SCF/Unconverged/BasisSet failures#880
calvinp0 wants to merge 3 commits intomainfrom
trsh/drop-chk-on-bad-wavefunction

Conversation

@calvinp0
Copy link
Copy Markdown
Member

@calvinp0 calvinp0 commented Apr 26, 2026

Summary

Make ARC's Gaussian checkpoint handling robust to wavefunction-quality failures.

trsh_keyword_checkfile only fired on the CheckFile keyword, so a job that died with unconverged SCF (L502) had its partial check.chk inherited by the rerun. The next job's Gaussian adapter then auto-emitted guess=read against MOs from a non-converged iteration, typically failing with L401 ("Basis set data is not on the checkpoint file") and wasting a full Gaussian invocation before self-healing back to guess=mix.

This PR widens the chk-drop guard, fixes the misleading log line that resulted, and adds a missing detector for L601 chk/rwf collisions.

Observed in the wild

Paper_ROTORS_TS/rmg_rxn_1142 (Zeus): opt_a6355 → SCF unconverged → opt_a6358 ran with guess=read against the partial chk → L401 → opt_a6359 finally fell back to guess=mix after the existing CheckFile-keyword self-heal. After this PR, opt_a6358 goes straight to guess=mix with scf=(qc) nosymm, skipping the L401 detour.

Commits in this PR

1. trsh: drop checkfile on SCF/Unconverged/BasisSet failures

Extend trsh_keyword_checkfile to drop the chk on four wavefunction-quality keywords:

Keyword Source Why drop chk
CheckFile L301/L401 chk-data missing Existing — chk is structurally broken
SCF L502 unconverged SCF New — MOs are mid-iteration garbage
Unconverged L508 / GL9999 generic New — wavefunction quality is the issue
BasisSet L401 basis projection failed New — old MOs incompatible with new basis

Non-wavefunction failures (MaxOptCycles, InternalCoordinateError, DiskSpace, OptOrientation, etc.) remain unaffected, so the warm-start cycle savings on those paths are preserved. level_of_theory is read-only inside trsh.py, so there's no method/charge/multiplicity-change path during trsh that needs guarding.

2. trsh: branch the chk-drop log message by triggering keyword

After widening the guard, the hardcoded log line "that failed with 'Basis set data is not on the checkpoint file' by removing the checkfile." falsely claimed a chk-data failure for unrelated wavefunction failures. Pick the explanatory phrase from job_status['keywords'] so the log reflects the actual failure mode (chk-read error, basis projection, unconverged SCF, or generic unconverged wavefunction).

3. trsh: detect Gaussian L601 (RdWrB1) chk/rwf collision

Concurrent Gaussian jobs sharing a checkpoint or read-write file path can trigger L601 with an RdWrB1 garbage-pointer error. ARC had no l601.exe handler in determine_ess_status, so these failures fell through to the generic Unknown bucket and bypassed the chk-drop logic even though the chk is exactly what's at fault. Routes L601 to the CheckFile keyword so the existing plumbing clears the chk before the rerun.

Test plan

  • pytest arc/job/trsh_test.py passes (9 tests, including new regressions)
  • New test_trsh_keyword_checkfile_drops_on_bad_wavefunction covers a fresh ess_trsh_methods list per drop-keyword (the existing test_trsh_ess_job masked the bug because sub-cases mutated a shared list) and pins keep-keyword behavior so the warm-start path can't regress silently
  • test_determine_ess_status extended with a new l601.out fixture verifying L601 routes to ['CheckFile', 'GL601']
  • Re-run a known SCF-failure case end-to-end on Zeus to confirm trsh chain skips the L401 detour

Out of scope (tracking separately)

Related cleanups surfaced while reading wang/leng Gaussian-error references but kept out of this PR to preserve scope:

  • Possible no_xqc (L508 keyword) vs no_qc (ess_trsh_method) wiring mismatch in trsh_keyword_scf — would let ARC de-escalate scf=(qc) when L508 says QC was the problem
  • stable=opt for TS first-opt to catch saddle-point SCF convergences (Class 4 wrong-solution)
  • Energy-trajectory parsing to distinguish SCF oscillation vs drift vs divergence (lets us keep the chk on Class 2 drift)
  • Scheduler-side guard preventing failed jobs from promoting their chk to species.checkfile in the first place

Copilot AI review requested due to automatic review settings April 26, 2026 13:25
@github-actions github-actions Bot added the Module: trsh Troubleshooting label Apr 26, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Gaussian ESS troubleshooting so ARC discards an inherited checkpoint file not only on CheckFile errors, but also on wavefunction-quality failures (SCF, Unconverged) and basis projection failures (BasisSet), preventing reruns from auto-triggering guess=read on unusable MOs.

Changes:

  • Extend trsh_keyword_checkfile() to drop the chk for SCF, Unconverged, and BasisSet keywords (in addition to CheckFile).
  • Add a regression test ensuring chk-dropping occurs for wavefunction/basis-related failures and does not occur for unrelated failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
arc/job/trsh.py Expand chk-dropping guard to cover additional Gaussian failure keywords associated with unusable wavefunctions / basis projection.
arc/job/trsh_test.py Add targeted regression test to validate chk-dropping behavior per keyword and prevent shared-list masking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.40%. Comparing base (0f5d3ad) to head (01cd60e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #880      +/-   ##
==========================================
- Coverage   60.44%   60.40%   -0.04%     
==========================================
  Files         102      102              
  Lines       31096    31111      +15     
  Branches     8103     8108       +5     
==========================================
- Hits        18796    18794       -2     
- Misses       9957     9971      +14     
- Partials     2343     2346       +3     
Flag Coverage Δ
functionaltests 60.40% <ø> (-0.04%) ⬇️
unittests 60.40% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens ARC’s Gaussian troubleshooting flow by ensuring corrupted/unsafe checkpoint files are not reused after wavefunction-quality failures, and by adding detection for Gaussian L601 chk/rwf collisions so reruns can self-heal without wasting extra invocations.

Changes:

  • Expand trsh_keyword_checkfile to drop the checkfile on additional wavefunction-quality failure keywords (SCF/Unconverged/BasisSet).
  • Improve troubleshooting log messages to reflect the triggering keyword when dropping the checkpoint.
  • Add Gaussian L601 (l601.exe) detection and a new fixture/test to verify correct keyword routing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
arc/job/trsh.py Adds L601 detection, widens chk-drop conditions, and refines chk-drop logging.
arc/job/trsh_test.py Adds regression coverage for chk-drop behavior and L601 routing.
arc/testing/trsh/gaussian/l601.out New Gaussian output fixture for L601 chk/rwf collision detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/job/trsh.py
if remove_checkfile:
logger_info.append('that failed with "Basis set data is not on the checkpoint file" by removing the checkfile.')
chk_drop_keywords = job_status.get('keywords', []) or []
if 'CheckFile' in chk_drop_keywords:
Comment thread arc/job/trsh.py
Comment on lines +1776 to +1779
Drops the checkfile when the prior job either could not read it
('CheckFile' from L301/L401), produced a non-converged wavefunction
('SCF' from L502, 'Unconverged' from L508), or reported a failed
basis projection ('BasisSet' from L401). Reusing MOs from a
Previously trsh_keyword_checkfile only fired on the 'CheckFile' keyword,
so a job that died with unconverged SCF (L502) had its partial check.chk
inherited by the rerun. The next job's gaussian adapter then auto-emitted
guess=read against MOs from a non-converged iteration, typically failing
with L401 ("Basis set data is not on the checkpoint file") and wasting a
full Gaussian invocation before self-healing back to guess=mix.

Extend the guard to also drop the chk on 'SCF', 'Unconverged', and
'BasisSet' keywords. MaxOptCycles, InternalCoordinateError, DiskSpace,
OptOrientation and other non-wavefunction failures remain unaffected, so
the warm-start cycle savings on those paths are preserved.

Adds a regression test that exercises a fresh ess_trsh_methods list for
each drop-keyword (the existing test_trsh_ess_job masked the bug because
its sub-cases mutated a shared list) and pins the keep-keyword behavior
so the warm-start path can't regress silently.
After widening trsh_keyword_checkfile to fire on SCF/Unconverged/BasisSet
in addition to CheckFile, the hardcoded log line "that failed with
'Basis set data is not on the checkpoint file' by removing the checkfile."
falsely claimed a chk-data failure for unrelated wavefunction failures.

Pick the explanatory phrase from job_status['keywords'] so the log
reflects the actual failure mode (chk-read error, basis projection,
unconverged SCF, or generic unconverged wavefunction).
Concurrent Gaussian jobs sharing a checkpoint or read-write file path
can trigger L601 with an RdWrB1 garbage-pointer error. ARC currently
has no l601.exe handler in determine_ess_status, so these failures fall
through to the generic 'Unknown' bucket and bypass the chk-drop logic
even though the chk is exactly what's at fault.

Route L601 to the 'CheckFile' keyword so the existing chk-drop plumbing
clears the chk before the rerun, and add a small fixture + regression
test covering the new detection path.
@calvinp0 calvinp0 force-pushed the trsh/drop-chk-on-bad-wavefunction branch from 01cd60e to 582c012 Compare April 28, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Module: trsh Troubleshooting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants