Fix: Increase startup probe timeout for influxdb3-enterprise 3.2+#801
Conversation
- Increase startup probe initialDelaySeconds from 10s to 30s - Increase startup probe periodSeconds from 5s to 10s - Increase startup probe timeoutSeconds from 5s to 10s - Increase startup probe failureThreshold from 12 to 20 - Total startup window increased from 70s to 230s This change addresses startup failures in InfluxDB 3.2+ where initialization time increased to ~100 seconds due to: - Python Processing Engine initialization - WAL recovery operations - V2 compaction consumer preloading The new configuration provides a 2.3x safety buffer for typical 100s startups, compared to the previous 4-5x buffer for 15s startups. Fixes pod CrashLoopBackOff with exit code 137 (SIGKILL) on 3.2 deployments. Bump chart version to 0.1.1
Additional Evidence: Compactor AnalysisAfter the ingester fix was applied, reported that the compactor is now crashing. Analysis of the compactor logs reveals the same root cause. Compactor Log AnalysisFile: \influx-enterprise-influxdb3-enterprise-compactor-0-1782237784657590000.log\ Timeline of Normal Operation\
Last log entry before termination: Pattern Confirmation
Why Compactor Needs Longer StartupThe compactor has additional initialization overhead:
Expected compactor startup: 100-120 seconds (longer than ingester's 97s) Search for ErrorsSearched the entire log for:
ConclusionThe compactor is not crashing - it's operating perfectly but being killed by Kubernetes health probes. This confirms the fix in this PR applies to all components, not just the ingester. The global \probes\ configuration in \�alues.yaml\ will resolve this issue for ingester, querier, compactor, and processor simultaneously. **Recommendation **: Apply the probe fix from this PR to resolve both ingester and compactor issues. |
PR #801 Update: OOM Issue DiscoveredSummaryWhile investigating the startup probe timeout issue, we discovered that the compactor is also experiencing OOM (Out of Memory) kills. This is a separate issue from the probe timeout, but affects the same customer (VAST). EvidenceCustomer provided lastState:
terminated:
exitCode: 137
reason: OOMKilled # ← NOT probe failure
finishedAt: "2026-06-23T22:48:11Z"
startedAt: "2026-06-23T22:10:31Z"Key observations:
Root Cause: Memory OvercommitmentThe customer's configuration has excessive cache allocation: resources:
limits:
memory: 95Gi
env:
- name: INFLUXDB3_PT_FILE_CACHE_MAX_BYTES
value: 50% # 47.5GB
- name: INFLUXDB3_OBJECT_STORE_CACHE_MAX_SIZE_BYTES
value: "34359738368" # 32GB
- name: INFLUXDB3_PT_COMPACTOR_INPUT_SIZE_BUDGET
value: 12GB
- name: INFLUXDB3_FORCE_SNAPSHOT_MEM_THRESHOLD
value: 70% # 66.5GBMemory budget: Caches alone consume 96% of available memory, leaving only 3.5GB for:
Impact on This PRThis PR (startup probe timeout fix) is still valid and should be merged. However:
Documentation CreatedI've created comprehensive documentation for the OOM issue:
These documents are separate from the probe timeout fix and should be provided to customers experiencing OOM issues. Recommended ActionsFor This PR (#801)
For OOM Issue (Separate)
How to Distinguish the IssuesStartup Probe Timeout: exitCode: 137
reason: Error # or "Unhealthy"
startedAt: "..."
finishedAt: "..." # Within 2 minutes of startOOM Kill: exitCode: 137
reason: OOMKilled # Explicit OOM
startedAt: "..."
finishedAt: "..." # 20+ minutes after startCheck: kubectl get pod <pod-name> -o json | \
jq '.status.containerStatuses[0].lastState.terminated | {exitCode, reason, duration: (.finishedAt - .startedAt)}'Customer CommunicationWhen customers report exit code 137:
Files AvailableBoth documentation files are available in the
Questions?Please let me know if:
Note: The probe timeout fix in this PR is orthogonal to the OOM issue and should be merged independently. Both issues can occur, but they have different root causes and different solutions. |
Description
This PR fixes CrashLoopBackOff issues across all components (ingester, querier, compactor, processor) in InfluxDB 3.2+ deployments by increasing the startup probe timeout to accommodate longer initialization times.
Fixes #800
Changes
Chart Version