-
Notifications
You must be signed in to change notification settings - Fork 133
ci: remove nick-fields/retry wrapper and add shared sinfo-based GPU partition selection #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+419
−476
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
9956b4f
ci: replace nick-fields/retry with plain run step; deprioritize v100 …
2fb93db
ci: remove unused RETRY_VALIDATE_CMD from retry_build
24185f8
ci: shared sinfo-based GPU partition selection for tests and benchmarks
946161d
bench: update Phoenix tmpbuild path to project storage
2384287
bench: require 2 idle/mix nodes for parallel benchmark GPU partition …
60bcfaa
ci: restore RETRY_VALIDATE_CMD support in retry_build
a608a9f
ci: exclude dead GPU node atl1-1-03-002-29-0 (cuInit error 999)
72c9c86
ci: unify job submission, test, and bench scripts across clusters
49d3a7b
ci: remove dead rm -rf build, add gpu_partition default
8825dff
ci: use strict shell mode in common/test.sh and common/bench.sh
c68d6d6
ci: validate existing Phoenix build against node ISA before reuse
7f70c2e
ci: always nuke build/ on Phoenix to avoid stale ISA mismatches
a3c37ce
ci: nuke stale Phoenix bench builds; trap TMPDIR cleanup on exit
5efa827
ci: widen TMPDIR random range to reduce collision risk
b7e8e75
ci: remove no-op -j flag from mfc.sh bench invocations
5a46a3e
ci: add space after SBATCH -o for consistency with SLURM docs
61b84cf
ci: fix bench master job failing to find common/bench.sh
sbryngelson df60f82
ci: drop gpu-rtx6000 from partition list (too slow for test time limit)
sbryngelson 1c4bd73
ci: deprioritize gpu-v100 to last in partition selection
sbryngelson b51f214
ci: pass build_opts (GPU interface flag) to live test command
sbryngelson 698bd2e
ci: fix wait + set -e race that orphans parallel jobs
sbryngelson 1d4d79f
ci: fix stale comments from incremental refactoring
sbryngelson 6cba935
ci: fix duplicate --gpu flag on Frontier GPU test commands
sbryngelson 4915884
ci: remove stale .out file before SLURM submission
sbryngelson e1e0f42
ci: cap case-optimization build jobs at 8 to match prebuild
sbryngelson 97ecc23
ci: submit case-opt prebuild to CPU partition (no GPU needed for comp…
sbryngelson 1793498
ci: add CPU architecture to build cache key to prevent SIGILL
sbryngelson dbe6a5e
ci: remove build cache from GitHub runners (prevents SIGILL, negligib…
sbryngelson fde5fc4
ci: switch Frontier to service partition and develop QOS
sbryngelson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| #!/bin/bash | ||
| # Select the best available Phoenix GPU partition using sinfo. | ||
| # Sources into caller: exports SELECTED_GPU_PARTITION. | ||
| # | ||
| # Priority order prefers partitions most likely to have availability. | ||
| # V100 is last due to slower performance near the test time limit. | ||
| # Falls back to gpu-l40s if no partition meets the idle node threshold. | ||
| # RTX 6000 nodes are excluded (too slow for the test suite time limit). | ||
| # | ||
| # Optional: set GPU_PARTITION_MIN_NODES before sourcing to require a minimum | ||
| # number of idle/mix nodes (e.g. GPU_PARTITION_MIN_NODES=2 for parallel bench jobs). | ||
| # | ||
| # Usage: source .github/scripts/select-gpu-partition.sh | ||
|
|
||
| _GPU_PARTITION_PRIORITY="gpu-l40s gpu-h200 gpu-h100 gpu-a100 gpu-v100" | ||
| _GPU_PARTITION_FALLBACK="gpu-l40s" | ||
| _GPU_PARTITION_MIN_NODES="${GPU_PARTITION_MIN_NODES:-1}" | ||
|
|
||
| SELECTED_GPU_PARTITION="" | ||
| for _part in $_GPU_PARTITION_PRIORITY; do | ||
| _idle=$(sinfo -p "$_part" --noheader -o "%t" 2>/dev/null | grep -cE "^(idle|mix)" || true) | ||
| if [ "${_idle:-0}" -ge "$_GPU_PARTITION_MIN_NODES" ]; then | ||
| SELECTED_GPU_PARTITION="$_part" | ||
qodo-code-review[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| echo "Selected GPU partition: $SELECTED_GPU_PARTITION ($_idle idle/mix nodes)" | ||
| break | ||
| fi | ||
| done | ||
|
|
||
| if [ -z "$SELECTED_GPU_PARTITION" ]; then | ||
| echo "WARNING: No idle GPU partition found; falling back to $_GPU_PARTITION_FALLBACK (may queue)" | ||
| SELECTED_GPU_PARTITION="$_GPU_PARTITION_FALLBACK" | ||
| fi | ||
|
|
||
| export SELECTED_GPU_PARTITION | ||
| unset _GPU_PARTITION_PRIORITY _GPU_PARTITION_FALLBACK _GPU_PARTITION_MIN_NODES _part _idle | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,207 @@ | ||
| #!/bin/bash | ||
| # Unified SLURM job submission and monitoring for all clusters. | ||
| # Submits a script as a SLURM batch job, then monitors it until completion. | ||
| # Rerun-safe: cancels stale jobs from previous runs before resubmission. | ||
| # | ||
| # Usage: submit-slurm-job.sh <script.sh> <cpu|gpu> <none|acc|omp> <cluster> [shard] | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| # Ignore SIGHUP to survive login node session drops | ||
| trap '' HUP | ||
|
|
||
| usage() { | ||
| echo "Usage: $0 <script.sh> <cpu|gpu> <none|acc|omp> <cluster> [shard]" | ||
| } | ||
|
|
||
| script_path="${1:-}" | ||
| device="${2:-}" | ||
| interface="${3:-}" | ||
| cluster="${4:-}" | ||
| shard="${5:-}" | ||
|
|
||
| if [ -z "$script_path" ] || [ -z "$device" ] || [ -z "$interface" ] || [ -z "$cluster" ]; then | ||
| usage | ||
| exit 1 | ||
| fi | ||
|
|
||
| sbatch_script_contents=$(cat "$script_path") | ||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| # Detect job type from submitted script basename | ||
| script_basename="$(basename "$script_path" .sh)" | ||
| case "$script_basename" in | ||
| bench*) job_type="bench" ;; | ||
| *) job_type="test" ;; | ||
| esac | ||
|
|
||
| # --- Cluster configuration --- | ||
| case "$cluster" in | ||
| phoenix) | ||
| compiler_flag="p" | ||
| account="gts-sbryngelson3" | ||
| job_prefix="shb" | ||
| qos="embers" | ||
| extra_sbatch="#SBATCH --requeue" | ||
| test_time="03:00:00" | ||
| bench_time="04:00:00" | ||
| gpu_partition_dynamic=true | ||
| ;; | ||
| frontier) | ||
| compiler_flag="f" | ||
| account="CFD154" | ||
| job_prefix="MFC" | ||
| qos="develop" | ||
| extra_sbatch="" | ||
| test_time="01:59:00" | ||
| bench_time="01:59:00" | ||
| gpu_partition_dynamic=false | ||
| ;; | ||
| frontier_amd) | ||
| compiler_flag="famd" | ||
| account="CFD154" | ||
| job_prefix="MFC" | ||
| qos="develop" | ||
| extra_sbatch="" | ||
| test_time="01:59:00" | ||
| bench_time="01:59:00" | ||
| gpu_partition_dynamic=false | ||
| ;; | ||
| *) | ||
| echo "ERROR: Unknown cluster '$cluster'" | ||
| exit 1 | ||
| ;; | ||
| esac | ||
|
|
||
| # --- Time limit --- | ||
| if [ "$job_type" = "bench" ]; then | ||
| sbatch_time="#SBATCH -t $bench_time" | ||
| else | ||
| sbatch_time="#SBATCH -t $test_time" | ||
| fi | ||
|
|
||
| # --- Device-specific SBATCH options --- | ||
| if [ "$device" = "cpu" ]; then | ||
| case "$cluster" in | ||
| phoenix) | ||
| sbatch_device_opts="\ | ||
| #SBATCH -p cpu-small | ||
| #SBATCH --ntasks-per-node=24 | ||
| #SBATCH --mem-per-cpu=2G" | ||
| ;; | ||
| frontier|frontier_amd) | ||
| sbatch_device_opts="\ | ||
| #SBATCH -n 32 | ||
| #SBATCH -p service" | ||
| ;; | ||
| esac | ||
| elif [ "$device" = "gpu" ]; then | ||
| # Determine GPU partition | ||
| gpu_partition="batch" | ||
| if [ "$gpu_partition_dynamic" = "true" ]; then | ||
| # Use pre-selected bench partition if available, otherwise query sinfo | ||
| if [ -n "${BENCH_GPU_PARTITION:-}" ]; then | ||
| gpu_partition="$BENCH_GPU_PARTITION" | ||
| echo "Using pre-selected bench partition: $gpu_partition (PR/master consistency)" | ||
| else | ||
| source "${SCRIPT_DIR}/select-gpu-partition.sh" | ||
| gpu_partition="$SELECTED_GPU_PARTITION" | ||
| fi | ||
| fi | ||
|
|
||
| case "$cluster" in | ||
| phoenix) | ||
| sbatch_device_opts="\ | ||
| #SBATCH -p $gpu_partition | ||
| #SBATCH --ntasks-per-node=4 | ||
| #SBATCH -G2 | ||
| #SBATCH --exclude=atl1-1-03-002-29-0" | ||
| ;; | ||
| frontier|frontier_amd) | ||
| sbatch_device_opts="\ | ||
| #SBATCH -n 8 | ||
| #SBATCH -p service" | ||
| ;; | ||
| esac | ||
| else | ||
| usage | ||
| exit 1 | ||
| fi | ||
|
|
||
| # --- Job slug --- | ||
| shard_suffix="" | ||
| if [ -n "$shard" ]; then | ||
| shard_suffix="-$(echo "$shard" | sed 's|/|-of-|')" | ||
| fi | ||
| job_slug="$(basename "$script_path" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g')-${device}-${interface}${shard_suffix}" | ||
| output_file="$job_slug.out" | ||
| id_file="${job_slug}.slurm_job_id" | ||
|
|
||
| # --- Idempotency: cancel stale jobs from previous runs --- | ||
| if [ -f "$id_file" ]; then | ||
| existing_id=$(cat "$id_file") | ||
| state=$(sacct -j "$existing_id" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1 | tr -d ' ' || true) | ||
| case "${state:-UNKNOWN}" in | ||
| RUNNING|PENDING|REQUEUED|COMPLETING) | ||
| echo "Cancelling stale SLURM job $existing_id (state=$state) before resubmission" | ||
| scancel "$existing_id" 2>/dev/null || true | ||
| ;; | ||
| *) | ||
| echo "Stale job $existing_id (state=${state:-UNKNOWN}) — submitting fresh" | ||
| ;; | ||
| esac | ||
| rm -f "$id_file" | ||
| fi | ||
|
|
||
| # Remove stale output file so the monitor doesn't pick up old content | ||
| # (a previous SLURM job's epilog can write to the .out file after our | ||
| # stale-job check, polluting the new job's output stream). | ||
| rm -f "$output_file" | ||
|
|
||
| # --- Module load mode (short form) --- | ||
| module_mode=$([ "$device" = "gpu" ] && echo "g" || echo "c") | ||
|
|
||
| # --- Submit --- | ||
| submit_output=$(sbatch <<EOT | ||
| #!/bin/bash | ||
| #SBATCH -J ${job_prefix}-${job_slug} | ||
| #SBATCH --account=${account} | ||
| #SBATCH -N 1 | ||
| ${sbatch_device_opts} | ||
| ${sbatch_time} | ||
| #SBATCH --qos=${qos} | ||
| ${extra_sbatch} | ||
| #SBATCH -o ${output_file} | ||
|
|
||
| set -e | ||
| set -x | ||
|
|
||
| cd "\$SLURM_SUBMIT_DIR" | ||
| echo "Running in \$(pwd):" | ||
|
|
||
| job_slug="$job_slug" | ||
| job_device="$device" | ||
| job_interface="$interface" | ||
| job_shard="$shard" | ||
| job_cluster="$cluster" | ||
|
|
||
| . ./mfc.sh load -c $compiler_flag -m $module_mode | ||
|
|
||
| $sbatch_script_contents | ||
|
|
||
| EOT | ||
| ) | ||
|
|
||
| job_id=$(echo "$submit_output" | grep -oE '[0-9]+') | ||
| if [ -z "$job_id" ]; then | ||
| echo "ERROR: Failed to submit job. sbatch output:" | ||
| echo "$submit_output" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Submitted batch job $job_id" | ||
| echo "$job_id" > "$id_file" | ||
| echo "Job ID written to $id_file" | ||
|
|
||
| # --- Monitor --- | ||
| bash "$SCRIPT_DIR/run_monitored_slurm_job.sh" "$job_id" "$output_file" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This comment was marked as off-topic.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.