Fix: Increase startup probe timeout for influxdb3-enterprise 3.2+ by dburton-influxdata · Pull Request #801 · influxdata/helm-charts

dburton-influxdata · 2026-06-18T19:47:48Z

Description

This PR fixes CrashLoopBackOff issues across all components (ingester, querier, compactor, processor) in InfluxDB 3.2+ deployments by increasing the startup probe timeout to accommodate longer initialization times.

Fixes #800

Changes

Chart Version

Bump version from \

- Increase startup probe initialDelaySeconds from 10s to 30s - Increase startup probe periodSeconds from 5s to 10s - Increase startup probe timeoutSeconds from 5s to 10s - Increase startup probe failureThreshold from 12 to 20 - Total startup window increased from 70s to 230s This change addresses startup failures in InfluxDB 3.2+ where initialization time increased to ~100 seconds due to: - Python Processing Engine initialization - WAL recovery operations - V2 compaction consumer preloading The new configuration provides a 2.3x safety buffer for typical 100s startups, compared to the previous 4-5x buffer for 15s startups. Fixes pod CrashLoopBackOff with exit code 137 (SIGKILL) on 3.2 deployments. Bump chart version to 0.1.1

dburton-influxdata · 2026-06-23T18:53:26Z

Additional Evidence: Compactor Analysis

After the ingester fix was applied, reported that the compactor is now crashing. Analysis of the compactor logs reveals the same root cause.

Compactor Log Analysis

File: \influx-enterprise-influxdb3-enterprise-compactor-0-1782237784657590000.log\
Total lines: 5,000 lines of debug output
Errors found: ZERO

Timeline of Normal Operation

\
17:47:31 - 17:48:00 (29 seconds of logged activity)

✅ Bloom filter creation for 200+ tags
✅ Run set merges executing (48 active jobs)
✅ Processing tens of thousands of rows
✅ Slice completions with peak memory 2-36 MB
✅ Snapshot discovery from multiple nodes
✅ Primary lease management succeeding
✅ Compaction plans completing without failures
\\

Last log entry before termination:
\
2026-06-23T17:48:00.382842Z DEBUG influxdb3_pacha_tree::file_layout::pt_merger: Merge processed 10636 total rows
stream closed EOF for influx-enterprise/influxdb3-enterprise-compactor-0 (influxdb3)
\\

Pattern Confirmation

Component	Log Lines	Errors	Termination	Root Cause
Ingester	Normal startup	0	Abrupt EOF	Probe timeout
Compactor	5,000 lines normal ops	0	Abrupt EOF	Probe timeout

Why Compactor Needs Longer Startup

The compactor has additional initialization overhead:

Primary Lease Acquisition
- Only ONE compactor allowed per cluster (single-node constraint)
- Must acquire exclusive lease from catalog
- Coordination delays in multi-node environment
Snapshot Discovery
- Polls all ingester nodes for snapshots
- Builds initial compaction plan state
- Log shows discovering snapshots from nodes 2, 3, 5
Run Set Index Building
- Extensive bloom filter creation (seen in logs: 200+ tag_ids)
- Table index construction with multiple size groups
- File ID range: 1.3M+ (mature deployment)
Checkpoint Recovery
- Loading compaction checkpoint
- Preloading warm run sets
- File cache initialization

Expected compactor startup: 100-120 seconds (longer than ingester's 97s)

Search for Errors

Searched the entire log for:

\ERROR: 0 matches
\panic: 0 matches
\atal: 0 matches
\crash: 0 matches
\killed: 0 matches
\WARN: Only TLS \CloseNotify\ (normal connection closures)

Conclusion

The compactor is not crashing - it's operating perfectly but being killed by Kubernetes health probes. This confirms the fix in this PR applies to all components, not just the ingester.

The global \probes\ configuration in \�alues.yaml\ will resolve this issue for ingester, querier, compactor, and processor simultaneously.

**Recommendation **: Apply the probe fix from this PR to resolve both ingester and compactor issues.

dburton-influxdata · 2026-06-24T16:51:44Z

PR #801 Update: OOM Issue Discovered

Summary

While investigating the startup probe timeout issue, we discovered that the compactor is also experiencing OOM (Out of Memory) kills. This is a separate issue from the probe timeout, but affects the same customer (VAST).

Evidence

Customer provided compactor2.yaml showing:

lastState:
  terminated:
    exitCode: 137
    reason: OOMKilled      # ← NOT probe failure
    finishedAt: "2026-06-23T22:48:11Z"
    startedAt: "2026-06-23T22:10:31Z"

Key observations:

Pod starts successfully (under 1 minute)
Runs for ~38 minutes before OOMKill
Exit code 137 with explicit OOMKilled reason
Customer events show "Liveness probe failed" AFTER the OOM kill (symptom, not cause)

Root Cause: Memory Overcommitment

The customer's configuration has excessive cache allocation:

resources:
  limits:
    memory: 95Gi

env:
  - name: INFLUXDB3_PT_FILE_CACHE_MAX_BYTES
    value: 50%                              # 47.5GB

  - name: INFLUXDB3_OBJECT_STORE_CACHE_MAX_SIZE_BYTES
    value: "34359738368"                    # 32GB

  - name: INFLUXDB3_PT_COMPACTOR_INPUT_SIZE_BUDGET
    value: 12GB

  - name: INFLUXDB3_FORCE_SNAPSHOT_MEM_THRESHOLD
    value: 70%                              # 66.5GB

Memory budget:

File cache:              47.5 GB
Object store cache:      32.0 GB
Compactor input budget:  12.0 GB
                        ---------
Total caches:            91.5 GB
Container limit:         95.0 GB
Available for ops:        3.5 GB  ❌ INSUFFICIENT

Caches alone consume 96% of available memory, leaving only 3.5GB for:

Compaction operations (sorting, merging, deduplicating)
Bloom filter construction
Index building
DataFusion query execution
Runtime overhead

Impact on This PR

This PR (startup probe timeout fix) is still valid and should be merged. However:

The startup probe fix addresses the initial startup issue (pods killed before becoming ready)
The OOM issue is a separate configuration problem that needs its own fix
Both issues may affect the same customer simultaneously

Documentation Created

I've created comprehensive documentation for the OOM issue:

COMPACTOR-OOM-ANALYSIS.md (23KB)
- Deep technical analysis
- Memory budget calculations
- Configuration validation
- Root cause breakdown
COMPACTOR-OOM-QUICK-FIX.md (19KB)
- Customer-facing quick fix guide
- Two fix options (conservative vs increased memory)
- Step-by-step instructions
- Verification procedures
- Troubleshooting guide

These documents are separate from the probe timeout fix and should be provided to customers experiencing OOM issues.

Recommended Actions

For This PR (#801)

✅ Merge as planned (probe timeout fix is valid)
✅ Note in PR description that OOM is a separate issue
✅ Link to OOM documentation when customers report memory issues

For OOM Issue (Separate)

Create new GitHub issue for memory configuration guidance
Add memory configuration validation to HELM chart
Update HELM chart documentation with memory sizing guidelines
Consider adding memory safety checks to the chart (warn if caches > 70% of limit)
Provide default memory configurations that are safe

How to Distinguish the Issues

Startup Probe Timeout:

exitCode: 137
reason: Error              # or "Unhealthy"
startedAt: "..."
finishedAt: "..."          # Within 2 minutes of start

OOM Kill:

exitCode: 137
reason: OOMKilled          # Explicit OOM
startedAt: "..."
finishedAt: "..."          # 20+ minutes after start

Check:

kubectl get pod <pod-name> -o json | \
  jq '.status.containerStatuses[0].lastState.terminated | {exitCode, reason, duration: (.finishedAt - .startedAt)}'

Customer Communication

When customers report exit code 137:

First check: reason field in pod status
- OOMKilled → Point to OOM documentation
- Error or Unhealthy → Point to probe timeout fix (this PR)
Second check: Time between start and termination
- < 5 minutes → Likely probe timeout
- 20 minutes → Likely OOM

Third check: Events

kubectl get events --field-selector reason=OOMKilled

Files Available

Both documentation files are available in the vast/helm_32_issue/ directory:

COMPACTOR-OOM-ANALYSIS.md - Technical deep dive
COMPACTOR-OOM-QUICK-FIX.md - Customer quick fix guide

Questions?

Please let me know if:

This PR should reference the OOM issue in its description
We should create a separate tracking issue for OOM configuration guidance
The HELM chart should include memory validation/warnings
Default memory configurations should be updated

Note: The probe timeout fix in this PR is orthogonal to the OOM issue and should be merged independently. Both issues can occur, but they have different root causes and different solutions.

dburton-influxdata mentioned this pull request Jun 24, 2026

Improve memory configuration documentation and validation for InfluxDB3 Enterprise #804

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Increase startup probe timeout for influxdb3-enterprise 3.2+#801

Fix: Increase startup probe timeout for influxdb3-enterprise 3.2+#801
dburton-influxdata wants to merge 1 commit into
influxdata:masterfrom
dburton-influxdata:fix/influxdb3-enterprise-probe-timeouts

dburton-influxdata commented Jun 18, 2026 •

edited

Loading

Uh oh!

dburton-influxdata commented Jun 23, 2026 •

edited

Loading

Uh oh!

dburton-influxdata commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dburton-influxdata commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Chart Version

Uh oh!

dburton-influxdata commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional Evidence: Compactor Analysis

Compactor Log Analysis

Timeline of Normal Operation

Pattern Confirmation

Why Compactor Needs Longer Startup

Search for Errors

Conclusion

Uh oh!

dburton-influxdata commented Jun 24, 2026

PR #801 Update: OOM Issue Discovered

Summary

Evidence

Root Cause: Memory Overcommitment

Impact on This PR

Documentation Created

Recommended Actions

For This PR (#801)

For OOM Issue (Separate)

How to Distinguish the Issues

Customer Communication

Files Available

Questions?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dburton-influxdata commented Jun 18, 2026 •

edited

Loading

dburton-influxdata commented Jun 23, 2026 •

edited

Loading