Fix gVisor workflow: Add proper health checks for Squid and Envoy#5237
Fix gVisor workflow: Add proper health checks for Squid and Envoy#5237lpcox wants to merge 8 commits into
Conversation
- Compare Squid vs Envoy proxy approaches - Test under both runc and gVisor runtimes - Verify iptables DNAT/redirect compatibility with gVisor - Benchmark performance (latency comparison) - Generate summary report with recommendations Tests answer key questions: 1. Does gVisor support iptables DNAT for traffic redirection? 2. Which proxy approach works better with gVisor? 3. Can AWF keep current Squid architecture or need Envoy? Related to issue #3264 (gVisor compatibility investigation) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes all 19 review comments: 1. Remove ineffective setup job (ran on different runner) 2. Pin actions/checkout by SHA for supply-chain hardening 3. Add set -euo pipefail and EXIT traps to all test steps 4. Make test assertions fail with exit 1 instead of just logging 5. Add DNAT fallback tests with proxy env disabled 6. Fix benchmark outputs to write to $GITHUB_OUTPUT 7. Fix benchmark to use explicit proxy (-x flag) 8. Fix Envoy gVisor config to match runc (add dynamic_forward_proxy) 9. Clarify HTTPS expectations for Envoy (known limitation) 10. Add job outputs for performance comparison 11. Add header note explaining defense-in-depth test approach Key changes: - All test jobs now properly propagate failures - DNAT verification tests actually check enforcement (not just rule acceptance) - Performance benchmarks capture and output latency correctly - Cleanup happens reliably via EXIT traps - Envoy configs consistent between runc and gVisor tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
✅ Workflow triggered on PR branch: https://github.com/github/gh-aw-firewall/actions/runs/27738138136 This run will test the health check fixes. Expected improvements:
The workflow will provide empirical results for the gVisor compatibility question: Does gVisor's userspace network stack support iptables DNAT in a way compatible with AWF's architecture? |
✅ Coverage Check PassedOverall Coverage
📁 Per-file Coverage Changes (1 files)
Coverage comparison generated by |
There was a problem hiding this comment.
Pull request overview
Adds a new GitHub Actions workflow to exercise and compare two proxy/firewall approaches (Squid forward proxy + iptables DNAT vs Envoy transparent proxy + iptables redirect) under both standard runc and gVisor (runsc) runtimes, plus a simple latency benchmark and summary report.
Changes:
- Introduces end-to-end test jobs for Squid (runc + gVisor) and Envoy (runc + gVisor) using Docker networks and in-container iptables rules.
- Adds a performance comparison job that runs 100 HTTP requests through each proxy and exports average latency as job outputs.
- Adds a summary job that prints consolidated results and recommendations.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/test-gvisor-firewall-comparison.yml | New workflow implementing Squid/Envoy comparison tests and a latency benchmark across runc and gVisor. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 1/1 changed files
- Comments generated: 7
| ubuntu/squid:latest | ||
|
|
||
| sleep 3 |
| ubuntu/squid:latest | ||
|
|
||
| sleep 3 |
| -c /etc/envoy/envoy.yaml | ||
|
|
||
| sleep 3 |
| envoyproxy/envoy:v1.28-latest -c /etc/envoy/envoy.yaml | ||
|
|
||
| sleep 3 |
| ubuntu/squid:latest | ||
|
|
||
| sleep 3 |
| envoyproxy/envoy:v1.28-latest -c /etc/envoy/envoy.yaml | ||
|
|
||
| sleep 3 |
|
|
||
| echo "" | ||
| echo "=== Testing HTTPS through Envoy (expected to fail) ===" | ||
| HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 https://github.com 2>&1 || echo "000") |
Replaces 'sleep 3' with proper health check loops that wait up to 30 seconds for proxies to be ready before running tests. Root cause: Squid/Envoy containers were not fully initialized before agent containers tried to connect, causing spurious test failures. Changes: - Squid runc test: Wait for proxy port 3128 to respond - Squid gVisor test: Wait for proxy port 3128 to respond - Envoy runc test: Wait for admin /ready endpoint - Envoy gVisor test: Wait for admin /ready endpoint - Squid perf test: Wait for proxy port 3128 to respond - Envoy perf test: Wait for admin /ready endpoint Each health check: - Retries for up to 30 seconds - Uses lightweight curl container for network checks - Shows container logs on failure - Exits with error if proxy fails to start Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Previous run used wrong commit (old PR before health check fixes were added). New workflow run with health check fixes: https://github.com/github/gh-aw-firewall/actions/runs/27738319710 Commit 0454937 now includes proper Squid and Envoy health checks with 30-second timeout. |
Squid v6.13 rejects configs with both '.github.com' and 'github.com' in the same ACL: ERROR: '.github.com' is a subdomain of 'github.com' FATAL: Bungled /etc/squid/squid.conf Solution: Use only '.github.com' which matches both github.com and all subdomains. This fixes the 30-second timeout where Squid failed to start due to config error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Fixed Squid configuration error that caused startup failures. Root cause: Squid v6.13 rejects ACLs with both '.github.com' and 'github.com': Solution: Use only '.github.com' which matches both github.com and all subdomains. New workflow run: https://github.com/github/gh-aw-firewall/actions/runs/27763948134 This should fix the 30-second timeouts where Squid failed to start. |
Problem: Health checks were timing out after 30 seconds because they tried to test the full proxy path (curl -> Squid -> example.com), which is slow and unreliable due to: - DNS resolution delays - External dependency (example.com) - Network latency - Curl container startup overhead Each attempt could take 5-10 seconds, eating up the 30-second budget. Solution: Replace with simple TCP port check using busybox that just verifies Squid port 3128 is listening. This is: - Fast (< 1 second per attempt) - Reliable (no external dependencies) - Accurate (tests exactly what we need: is Squid accepting connections) Changed health checks for: - Squid runc test - Squid gVisor test - Squid performance test Envoy tests already use /ready endpoint which is fast and reliable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Failure Analysis SummaryRoot Cause: Health checks were using a complex end-to-end proxy test that was too slow and unreliable. Previous Health Check (Flawed):docker run curlimages/curl curl -x http://172.30.0.10:3128 http://example.comProblems:
New Health Check (Simple):docker run busybox timeout 2 sh -c 'cat < /dev/null > /dev/tcp/172.30.0.10/3128'Benefits:
New Run:https://github.com/github/gh-aw-firewall/actions/runs/27764269059 |
SECURITY RESEARCHER PERSPECTIVE: Test must verify gVisor+Envoy replicates ALL security guarantees of runc+Squid. Any gaps = security vulnerability. New tests added (10 security scenarios): 1. ✅ Allowed domain (github.com) - baseline functionality 2. ✅ Blocked domain (google.com) - core firewall feature 3. ✅ DNAT fallback - defense-in-depth when proxy env ignored 4. ✅ Port blocking (SSH 22) - prevent lateral movement 5. 🔒 IP address bypass - prevent ACL bypass via IPs (e.g., curl 8.8.8.8) 6. 🔒 Subdomain verification - verify .github.com includes api.github.com 7. 🔒 Similar domain blocked - githubstatus.com should not work 8. 🔒 Dangerous ports - comprehensive blocklist (SSH, DB, Redis, etc.) 9. 🔒 Local network isolation - RFC1918, container gateway blocked 10. 🔒 Protocol bypass - ICMP/UDP blocked Enhanced iptables rules to match AWF production security model: - DNS restricted to approved resolvers only (8.8.8.8, 8.8.4.4) - All RFC1918 ranges blocked (10/8, 172.16/12, 192.168/16, 169.254/16) - Comprehensive dangerous port blocklist (22,23,25,3306,5432,6379,27017,445,1433) - ICMP completely blocked (no ping) - UDP blocked except DNS - Localhost allowed (for MCP stdio servers) Applied to both: - Squid + runc (baseline - current AWF) - Squid + gVisor (validate gVisor doesn't break security) TODO: Apply same tests to Envoy variants in follow-up. Success criteria: All 4 configurations (Squid+runc, Squid+gVisor, Envoy+runc, Envoy+gVisor) must produce IDENTICAL security results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔒 Comprehensive Security Tests AddedExpanded the test to validate ALL security guarantees from a security researcher perspective. The goal: prove gVisor+Envoy can replicate the exact security behavior of runc+Squid. New Security Tests (10 scenarios)Core Functionality
Attack Prevention
Enhanced iptables RulesNow matches AWF production security model:
Success CriteriaAll 4 configurations must produce identical security results:
Any behavioral difference = security gap = FAIL. Next Steps
|
|
🔄 Workflow re-triggered with comprehensive security tests Run: https://github.com/github/gh-aw-firewall/actions/runs/27767923283 This run includes:
Will validate that both runc and gVisor configurations pass all security tests. |
Root Cause: busybox uses ash shell, which doesn't support bash's /dev/tcp pseudo-device. The health check command was silently failing. Solution: Use 'nc -zv -w 2' (netcat) which IS available in busybox and provides reliable TCP port checking. Changes: - Squid runc health check: /dev/tcp → nc -zv - Squid gVisor health check: /dev/tcp → nc -zv - Performance test health check: /dev/tcp → nc -zv nc flags: -z: Zero-I/O mode (just check if port is open) -v: Verbose (output 'open' or 'succeeded') -w 2: 2-second timeout grep for 'open|succeeded' to detect success across different nc versions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🐛 Root Cause Found: Busybox Shell IncompatibilityThe ProblemHealth checks were timing out because busybox uses ash shell, which doesn't support bash's # This was SILENTLY FAILING in busybox:
sh -c 'cat < /dev/null > /dev/tcp/172.30.0.10/3128'The FixUse # Now uses netcat for reliable TCP port checking:
nc -zv -w 2 172.30.0.10 3128 2>&1 | grep -q 'open|succeeded'Flags:
New Runhttps://github.com/github/gh-aw-firewall/actions/runs/27768282438 This should fix the health check timeouts for all 3 Squid tests (runc, gVisor, performance). |
|
Root Cause: When curl fails, it outputs 000 AND the || echo "000" runs,
resulting in "000000" being captured. This broke the status code comparison.
Solution:
- Use `2>/dev/null || true` instead of `|| echo "000"`
- Check for empty string in addition to "000" and "403"
- Use ${HTTP_CODE:-error} in output to show "error" if empty
This fixes the test logic for:
- google.com (blocked domain)
- githubstatus.com (similar domain)
- 8.8.8.8 (IP address bypass)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🐛 Fixed HTTP Status Code BugThe ProblemWhen curl failed, it output "000" AND the fallback The Fix# Old (broken):
HTTP_CODE=$(curl ... || echo "000") # Results in "000000" on error
# New (fixed):
HTTP_CODE=$(curl ... 2>/dev/null || true) # Results in "000" or empty
if [ "$HTTP_CODE" = "403" ] || [ "$HTTP_CODE" = "000" ] || [ -z "$HTTP_CODE" ]; thenNew Runhttps://github.com/github/gh-aw-firewall/actions/runs/27768657814 This should fix the test failures for blocked domain checks. |
|
Problem
The gVisor firewall comparison workflow failed with spurious connection errors because Squid and Envoy containers weren't fully started before agent containers tried to use them.
Failed run: https://github.com/github/gh-aw-firewall/actions/runs/27737950523
Symptoms:
Even though
HTTPS_PROXYwas set correctly, curl couldn't connect because Squid wasn't listening yet.Solution
Replace
sleep 3with proper health check loops:/readyendpoint to return 200Each health check:
curlimages/curlcontainer for network checksChanges
test-squid-runc: Added Squid health check (proxy port)test-squid-gvisor: Added Squid health check (proxy port)test-envoy-iptables-runc: Added Envoy health check (admin /ready)test-envoy-gvisor: Added Envoy health check (admin /ready)performance-comparison: Added health checks for both benchmarksTesting
This PR branch will trigger the workflow to verify the fixes work.