Skip to content

Fix gVisor workflow: Add proper health checks for Squid and Envoy#5237

Open
lpcox wants to merge 8 commits into
mainfrom
fix/gvisor-workflow-healthchecks
Open

Fix gVisor workflow: Add proper health checks for Squid and Envoy#5237
lpcox wants to merge 8 commits into
mainfrom
fix/gvisor-workflow-healthchecks

Conversation

@lpcox

@lpcox lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Problem

The gVisor firewall comparison workflow failed with spurious connection errors because Squid and Envoy containers weren't fully started before agent containers tried to use them.

Failed run: https://github.com/github/gh-aw-firewall/actions/runs/27737950523

Symptoms:

=== Test 1: Allowed domain via forward proxy (github.com) ===
❌ FAIL: github.com blocked

Even though HTTPS_PROXY was set correctly, curl couldn't connect because Squid wasn't listening yet.

Solution

Replace sleep 3 with proper health check loops:

  • Squid tests: Wait up to 30s for port 3128 to accept proxy requests
  • Envoy tests: Wait up to 30s for admin /ready endpoint to return 200

Each health check:

  • ✅ Uses lightweight curlimages/curl container for network checks
  • ✅ Retries for up to 30 seconds with 1-second intervals
  • ✅ Shows container logs on failure for debugging
  • ✅ Exits with error code if proxy fails to start

Changes

  • test-squid-runc: Added Squid health check (proxy port)
  • test-squid-gvisor: Added Squid health check (proxy port)
  • test-envoy-iptables-runc: Added Envoy health check (admin /ready)
  • test-envoy-gvisor: Added Envoy health check (admin /ready)
  • performance-comparison: Added health checks for both benchmarks

Testing

This PR branch will trigger the workflow to verify the fixes work.

lpcox and others added 2 commits June 17, 2026 21:42
- Compare Squid vs Envoy proxy approaches
- Test under both runc and gVisor runtimes
- Verify iptables DNAT/redirect compatibility with gVisor
- Benchmark performance (latency comparison)
- Generate summary report with recommendations

Tests answer key questions:
1. Does gVisor support iptables DNAT for traffic redirection?
2. Which proxy approach works better with gVisor?
3. Can AWF keep current Squid architecture or need Envoy?

Related to issue #3264 (gVisor compatibility investigation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes all 19 review comments:

1. Remove ineffective setup job (ran on different runner)
2. Pin actions/checkout by SHA for supply-chain hardening
3. Add set -euo pipefail and EXIT traps to all test steps
4. Make test assertions fail with exit 1 instead of just logging
5. Add DNAT fallback tests with proxy env disabled
6. Fix benchmark outputs to write to $GITHUB_OUTPUT
7. Fix benchmark to use explicit proxy (-x flag)
8. Fix Envoy gVisor config to match runc (add dynamic_forward_proxy)
9. Clarify HTTPS expectations for Envoy (known limitation)
10. Add job outputs for performance comparison
11. Add header note explaining defense-in-depth test approach

Key changes:
- All test jobs now properly propagate failures
- DNAT verification tests actually check enforcement (not just rule acceptance)
- Performance benchmarks capture and output latency correctly
- Cleanup happens reliably via EXIT traps
- Envoy configs consistent between runc and gVisor tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 05:09
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Workflow triggered on PR branch: https://github.com/github/gh-aw-firewall/actions/runs/27738138136

This run will test the health check fixes. Expected improvements:

  1. Squid tests should wait for proxy port to be ready (no more premature connection attempts)
  2. Envoy tests should wait for admin /ready endpoint (proper startup validation)
  3. All tests will show container logs if startup fails (better debugging)

The workflow will provide empirical results for the gVisor compatibility question: Does gVisor's userspace network stack support iptables DNAT in a way compatible with AWF's architecture?

@github-actions

Copy link
Copy Markdown
Contributor

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 97.57% 97.61% 📈 +0.04%
Statements 97.50% 97.54% 📈 +0.04%
Functions 98.84% 98.84% ➡️ +0.00%
Branches 92.95% 92.98% 📈 +0.03%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/workdir-setup.ts 92.7% → 94.5% (+1.82%) 92.7% → 94.5% (+1.82%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new GitHub Actions workflow to exercise and compare two proxy/firewall approaches (Squid forward proxy + iptables DNAT vs Envoy transparent proxy + iptables redirect) under both standard runc and gVisor (runsc) runtimes, plus a simple latency benchmark and summary report.

Changes:

  • Introduces end-to-end test jobs for Squid (runc + gVisor) and Envoy (runc + gVisor) using Docker networks and in-container iptables rules.
  • Adds a performance comparison job that runs 100 HTTP requests through each proxy and exports average latency as job outputs.
  • Adds a summary job that prints consolidated results and recommendations.
Show a summary per file
File Description
.github/workflows/test-gvisor-firewall-comparison.yml New workflow implementing Squid/Envoy comparison tests and a latency benchmark across runc and gVisor.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 7

Comment on lines +60 to +62
ubuntu/squid:latest

sleep 3
Comment on lines +183 to +185
ubuntu/squid:latest

sleep 3
Comment on lines +319 to +321
-c /etc/envoy/envoy.yaml

sleep 3
Comment on lines +444 to +446
envoyproxy/envoy:v1.28-latest -c /etc/envoy/envoy.yaml

sleep 3
Comment on lines +508 to +510
ubuntu/squid:latest

sleep 3
Comment on lines +589 to +591
envoyproxy/envoy:v1.28-latest -c /etc/envoy/envoy.yaml

sleep 3

echo ""
echo "=== Testing HTTPS through Envoy (expected to fail) ==="
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 https://github.com 2>&1 || echo "000")
Replaces 'sleep 3' with proper health check loops that wait up to 30 seconds
for proxies to be ready before running tests.

Root cause: Squid/Envoy containers were not fully initialized before agent
containers tried to connect, causing spurious test failures.

Changes:
- Squid runc test: Wait for proxy port 3128 to respond
- Squid gVisor test: Wait for proxy port 3128 to respond
- Envoy runc test: Wait for admin /ready endpoint
- Envoy gVisor test: Wait for admin /ready endpoint
- Squid perf test: Wait for proxy port 3128 to respond
- Envoy perf test: Wait for admin /ready endpoint

Each health check:
- Retries for up to 30 seconds
- Uses lightweight curl container for network checks
- Shows container logs on failure
- Exits with error if proxy fails to start

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Previous run used wrong commit (old PR before health check fixes were added).

New workflow run with health check fixes: https://github.com/github/gh-aw-firewall/actions/runs/27738319710

Commit 0454937 now includes proper Squid and Envoy health checks with 30-second timeout.

Squid v6.13 rejects configs with both '.github.com' and 'github.com' in the same ACL:
  ERROR: '.github.com' is a subdomain of 'github.com'
  FATAL: Bungled /etc/squid/squid.conf

Solution: Use only '.github.com' which matches both github.com and all subdomains.

This fixes the 30-second timeout where Squid failed to start due to config error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Fixed Squid configuration error that caused startup failures.

Root cause: Squid v6.13 rejects ACLs with both '.github.com' and 'github.com':

ERROR: '.github.com' is a subdomain of 'github.com'
FATAL: Bungled /etc/squid/squid.conf

Solution: Use only '.github.com' which matches both github.com and all subdomains.

New workflow run: https://github.com/github/gh-aw-firewall/actions/runs/27763948134

This should fix the 30-second timeouts where Squid failed to start.

Problem: Health checks were timing out after 30 seconds because they tried
to test the full proxy path (curl -> Squid -> example.com), which is slow
and unreliable due to:
- DNS resolution delays
- External dependency (example.com)
- Network latency
- Curl container startup overhead

Each attempt could take 5-10 seconds, eating up the 30-second budget.

Solution: Replace with simple TCP port check using busybox that just verifies
Squid port 3128 is listening. This is:
- Fast (< 1 second per attempt)
- Reliable (no external dependencies)
- Accurate (tests exactly what we need: is Squid accepting connections)

Changed health checks for:
- Squid runc test
- Squid gVisor test
- Squid performance test

Envoy tests already use /ready endpoint which is fast and reliable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Failure Analysis Summary

Root Cause: Health checks were using a complex end-to-end proxy test that was too slow and unreliable.

Previous Health Check (Flawed):

docker run curlimages/curl curl -x http://172.30.0.10:3128 http://example.com

Problems:

  1. External dependency on example.com (DNS, latency, availability)
  2. Each attempt took 5-10 seconds
  3. 30 attempts x 10 seconds = timeout
  4. Complex failure modes (DNS, proxy, target site)

New Health Check (Simple):

docker run busybox timeout 2 sh -c 'cat < /dev/null > /dev/tcp/172.30.0.10/3128'

Benefits:

  • Fast: <1 second per attempt
  • Reliable: No external dependencies
  • Accurate: Tests exactly what we need (port 3128 listening)

New Run:

https://github.com/github/gh-aw-firewall/actions/runs/27764269059

SECURITY RESEARCHER PERSPECTIVE: Test must verify gVisor+Envoy replicates
ALL security guarantees of runc+Squid. Any gaps = security vulnerability.

New tests added (10 security scenarios):
1. ✅ Allowed domain (github.com) - baseline functionality
2. ✅ Blocked domain (google.com) - core firewall feature
3. ✅ DNAT fallback - defense-in-depth when proxy env ignored
4. ✅ Port blocking (SSH 22) - prevent lateral movement
5. 🔒 IP address bypass - prevent ACL bypass via IPs (e.g., curl 8.8.8.8)
6. 🔒 Subdomain verification - verify .github.com includes api.github.com
7. 🔒 Similar domain blocked - githubstatus.com should not work
8. 🔒 Dangerous ports - comprehensive blocklist (SSH, DB, Redis, etc.)
9. 🔒 Local network isolation - RFC1918, container gateway blocked
10. 🔒 Protocol bypass - ICMP/UDP blocked

Enhanced iptables rules to match AWF production security model:
- DNS restricted to approved resolvers only (8.8.8.8, 8.8.4.4)
- All RFC1918 ranges blocked (10/8, 172.16/12, 192.168/16, 169.254/16)
- Comprehensive dangerous port blocklist (22,23,25,3306,5432,6379,27017,445,1433)
- ICMP completely blocked (no ping)
- UDP blocked except DNS
- Localhost allowed (for MCP stdio servers)

Applied to both:
- Squid + runc (baseline - current AWF)
- Squid + gVisor (validate gVisor doesn't break security)

TODO: Apply same tests to Envoy variants in follow-up.

Success criteria: All 4 configurations (Squid+runc, Squid+gVisor,
Envoy+runc, Envoy+gVisor) must produce IDENTICAL security results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

🔒 Comprehensive Security Tests Added

Expanded the test to validate ALL security guarantees from a security researcher perspective. The goal: prove gVisor+Envoy can replicate the exact security behavior of runc+Squid.

New Security Tests (10 scenarios)

Core Functionality

  1. Allowed domain (github.com) - baseline
  2. Blocked domain (google.com) - firewall works
  3. DNAT fallback - defense-in-depth

Attack Prevention

  1. Port blocking - SSH blocked
  2. 🔒 IP address bypass - Direct IP access blocked (curl 8.8.8.8)
  3. 🔒 Subdomain verification - .github.com includes api.github.com
  4. 🔒 Similar domain blocked - githubstatus.com fails
  5. 🔒 Dangerous ports - Comprehensive blocklist (SSH, DBs, Redis, MongoDB, SMB, MSSQL)
  6. 🔒 Local network isolation - RFC1918 + container gateway blocked
  7. 🔒 Protocol bypass - ICMP/UDP blocked

Enhanced iptables Rules

Now matches AWF production security model:

  • ✅ DNS restricted to approved resolvers only (8.8.8.8, 8.8.4.4)
  • ✅ All RFC1918 ranges blocked (10/8, 172.16/12, 192.168/16, 169.254/16)
  • ✅ Dangerous ports: 22,23,25,3306,5432,6379,27017,445,1433
  • ✅ ICMP completely blocked (no ping)
  • ✅ UDP blocked except DNS
  • ✅ Localhost allowed (for MCP stdio servers)

Success Criteria

All 4 configurations must produce identical security results:

  1. Squid + runc (baseline - current AWF)
  2. Squid + gVisor (validate gVisor doesn't break security)
  3. Envoy + runc (validate Envoy matches Squid behavior)
  4. Envoy + gVisor (target - proposed solution)

Any behavioral difference = security gap = FAIL.

Next Steps

  • ✅ Squid tests updated (runc + gVisor)
  • ⏳ TODO: Apply same tests to Envoy variants
  • ⏳ TODO: Run workflow and compare all 4 configurations

@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

🔄 Workflow re-triggered with comprehensive security tests

Run: https://github.com/github/gh-aw-firewall/actions/runs/27767923283

This run includes:

  • ✅ Fixed health checks (simple TCP port check)
  • ✅ 10 comprehensive security tests
  • ✅ Enhanced iptables matching AWF production security model

Will validate that both runc and gVisor configurations pass all security tests.

Root Cause: busybox uses ash shell, which doesn't support bash's /dev/tcp
pseudo-device. The health check command was silently failing.

Solution: Use 'nc -zv -w 2' (netcat) which IS available in busybox and
provides reliable TCP port checking.

Changes:
- Squid runc health check: /dev/tcp → nc -zv
- Squid gVisor health check: /dev/tcp → nc -zv
- Performance test health check: /dev/tcp → nc -zv

nc flags:
  -z: Zero-I/O mode (just check if port is open)
  -v: Verbose (output 'open' or 'succeeded')
  -w 2: 2-second timeout

grep for 'open|succeeded' to detect success across different nc versions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

🐛 Root Cause Found: Busybox Shell Incompatibility

The Problem

Health checks were timing out because busybox uses ash shell, which doesn't support bash's /dev/tcp pseudo-device.

# This was SILENTLY FAILING in busybox:
sh -c 'cat < /dev/null > /dev/tcp/172.30.0.10/3128'

The Fix

Use nc (netcat) which is available in busybox:

# Now uses netcat for reliable TCP port checking:
nc -zv -w 2 172.30.0.10 3128 2>&1 | grep -q 'open|succeeded'

Flags:

  • -z: Zero-I/O mode (just check if port is open)
  • -v: Verbose (outputs 'open' or 'succeeded')
  • -w 2: 2-second timeout

New Run

https://github.com/github/gh-aw-firewall/actions/runs/27768282438

This should fix the health check timeouts for all 3 Squid tests (runc, gVisor, performance).

@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

⚠️ Previous Run Used Wrong Commit

Run 27768282438 was triggered before the netcat fix was pushed, so it ran with the old /dev/tcp code.

✅ New Run with Netcat Fix

https://github.com/github/gh-aw-firewall/actions/runs/27768454494

This run uses commit 4eeab32b which has the netcat fix:

nc -zv -w 2 172.30.0.10 3128 2>&1 | grep -q 'open|succeeded'

Root Cause: When curl fails, it outputs 000 AND the || echo "000" runs,
resulting in "000000" being captured. This broke the status code comparison.

Solution:
- Use `2>/dev/null || true` instead of `|| echo "000"`
- Check for empty string in addition to "000" and "403"
- Use ${HTTP_CODE:-error} in output to show "error" if empty

This fixes the test logic for:
- google.com (blocked domain)
- githubstatus.com (similar domain)
- 8.8.8.8 (IP address bypass)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

🐛 Fixed HTTP Status Code Bug

The Problem

When curl failed, it output "000" AND the fallback || echo "000" also ran, resulting in "000000" being captured. This broke status code comparison.

The Fix

# Old (broken):
HTTP_CODE=$(curl ... || echo "000")  # Results in "000000" on error

# New (fixed):
HTTP_CODE=$(curl ... 2>/dev/null || true)  # Results in "000" or empty
if [ "$HTTP_CODE" = "403" ] || [ "$HTTP_CODE" = "000" ] || [ -z "$HTTP_CODE" ]; then

New Run

https://github.com/github/gh-aw-firewall/actions/runs/27768657814

This should fix the test failures for blocked domain checks.

@lpcox

lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

⚠️ Previous Run Used Wrong Commit (Again)

Run 27768657814 was triggered before the force push completed, so it ran with commit 4eeab32b (netcat fix) instead of caffabd6 (HTTP code fix).

✅ New Run with Both Fixes

https://github.com/github/gh-aw-firewall/actions/runs/27768781761

This run uses commit caffabd6 which has BOTH fixes:

  1. ✅ Netcat instead of /dev/tcp
  2. ✅ HTTP status code handling fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants