feat(server,k8s): implement pause/resume with rootfs snapshot support#668
feat(server,k8s): implement pause/resume with rootfs snapshot support#668fengcone wants to merge 15 commits intoalibaba:mainfrom
Conversation
- Add PausePolicy schema and API endpoints (/pause, /resume) - Add image_pull_policy configuration for sandbox containers - Support pausePolicy in pool-based BatchSandbox creation - Add SandboxSnapshot CRD and controller for snapshot lifecycle - Add image-committer Job for container rootfs snapshot/commit - Update AGENTS.md and Makefile for new components Implements OSEP-0008 pause/resume functionality using SandboxSnapshot CRD and nerdctl-based image committer for Kubernetes runtime.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f6a1f854c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Rename snapshotPushSecretName -> snapshotPushSecret - Rename resumeImagePullSecretName -> resumeImagePullSecret - Fix ruff lint errors (F541, F841, F401) - Regenerate CRDs from Go types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd use generation-based reconciliation - Move controller-managed fields (sourcePodName, sourceNodeName resumeTemplate, containerSnapshots, pause/resume versions) from spec to status so spec is purely user input - Replace version-based dispatch with Kubernetes generation pattern spec.action (Pause/Resume) signals intent, observedGeneration tracks controller progress - Remove PausePolicy from BatchSandbox CRD; pause config now comes from server-level configuration - Add fail-fast spec validation before starting pause cycles - Add RBAC permission for reading secrets - Improve delete_sandbox error handling (proper 404/500 responses - Expand E2E tests with error-case coverage
# Conflicts: # server/tests/k8s/test_batchsandbox_provider.py # server/tests/k8s/test_kubernetes_service.py
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a13c3a542
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 486658b3b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e302bce73
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
b2cc1dc to
7fa689d
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7fa689d222
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f86fd2a96b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…n pause resume lifecycle
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7474fa0343
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| creation-date: 2026-03-11 | ||
| last-updated: 2026-03-13 | ||
| status: implementing | ||
| status: implemented |
There was a problem hiding this comment.
can you add an user example
hittyt
left a comment
There was a problem hiding this comment.
Implements OSEP-0008 pause/resume functionality using a new SandboxSnapshot CRD and nerdctl-based image committer. The implementation is well-structured but has some security and reliability concerns that need addressing before merging. One notable architectural gap is that renew_expiration currently only works on active BatchSandbox resources and will fail with 404 for paused sandboxes; it should be updated to handle snapshots as well.
| spec := fmt.Sprintf("%s:%s", cs.ContainerName, cs.ImageURI) | ||
| containerSpecs = append(containerSpecs, spec) | ||
| } | ||
| fullCommand := fmt.Sprintf("/usr/local/bin/image-committer %s %s %s", |
There was a problem hiding this comment.
[P1] Potential Shell Injection Vulnerability
Building fullCommand using fmt.Sprintf and then executing it via /bin/sh -c is risky, as fields like SourcePodName or Namespace could potentially contain shell-special characters.
Since image-committer is a compiled binary, it's safer to run it directly by passing arguments to the Args list and avoiding a shell altogether.
| fullCommand := fmt.Sprintf("/usr/local/bin/image-committer %s %s %s", | |
| var containerSpecs []string | |
| for _, cs := range containerSnapshots { | |
| spec := fmt.Sprintf("%s:%s", cs.ContainerName, cs.ImageURI) | |
| containerSpecs = append(containerSpecs, spec) | |
| } | |
| // Build the job |
| Name: CommitJobContainerName, | ||
| Image: imageCommitterImage, | ||
| ImagePullPolicy: corev1.PullIfNotPresent, | ||
| Command: []string{"/bin/sh", "-c"}, |
There was a problem hiding this comment.
[P1] Avoid using shell for executing image-committer
Directly execute the binary without a shell wrapper to mitigate injection risks.
| Command: []string{"/bin/sh", "-c"}, | |
| Command: []string{"/usr/local/bin/image-committer"}, | |
| Args: append([]string{snapshot.Status.SourcePodName, snapshot.Namespace}, containerSpecs...), |
|
|
||
| // Set up signal handler to ensure all paused containers are resumed on exit | ||
| c := make(chan os.Signal, 1) | ||
| signal.Notify(c, os.Interrupt, syscall.SIGTERM) |
There was a problem hiding this comment.
[P2] Unreliable container unpause on SIGKILL
resumeAllPausedContainers is registered for SIGTERM and Interrupt, but not SIGKILL. If the Job Pod is killed abruptly (e.g., node shutdown or OOM), containers in the source Pod will remain paused forever. This could break the sandbox state.
Consider adding a fallback mechanism in the controller to ensure containers are unpaused if the Job fails or times out.
| } | ||
| registryHost := imageParts[0] | ||
|
|
||
| isInsecure := strings.Contains(registryHost, "local") || |
There was a problem hiding this comment.
[P2] Limited insecure registry detection
The current logic for isInsecure only checks for specific keywords like local or certain IP prefixes. This may fail for internal corporate registries using private domain names or other private IP ranges like 172.16.x.x.
Consider making "insecure" a flag passed from the server config to the controller, and then to image-committer to make it more robust.
| Value: ContainerdSocketPath, | ||
| }, | ||
| }, | ||
| SecurityContext: &corev1.SecurityContext{ |
There was a problem hiding this comment.
[P1] Highly Privileged Job Execution
This Job runs as root and mounts the host's containerd.sock, which effectively gives the Job Pod root access to the entire node.
It is critical that the ImageCommitterImage is strictly controlled and sourced only from a trusted repository to prevent privilege escalation.
| } | ||
|
|
||
| // handleDeletion handles the deletion of a SandboxSnapshot | ||
| func (r *SandboxSnapshotReconciler) handleDeletion(ctx context.Context, snapshot *sandboxv1alpha1.SandboxSnapshot) (ctrl.Result, error) { |
There was a problem hiding this comment.
[P2] Resource Leak - Snapshot Images
The handleDeletion logic cleans up the K8s Job, but it doesn't clean up the OCI images in the registry. This will lead to storage leakage in the registry over time.
While cleaning up registry images can be complex, this limitation should at least be documented, and perhaps a periodic GC mechanism should be considered in the future.
| ) | ||
|
|
||
| if not workload: | ||
| snapshot = self.snapshot_provider.get_snapshot(sandbox_id, self.namespace) |
There was a problem hiding this comment.
这里应该不是所有的get sandbox请求都需要去获取snapshot这个对象吧?pause和resume的场景比例应该比较小。
| "message": "Pause operation is not supported in Kubernetes runtime", | ||
| }, | ||
| # 1. Get BatchSandbox | ||
| batchsandbox = self.workload_provider.get_workload(sandbox_id, self.namespace) |
There was a problem hiding this comment.
这里不应该有batch sandbox的概念,batch sandbox只是sandbox service的一种实现
| // Spec only contains user input fields (filled by Server), Controller never writes to spec. | ||
| type SandboxSnapshotSpec struct { | ||
| // SandboxID is the stable public identifier for the sandbox. | ||
| SandboxID string `json:"sandboxId"` |
There was a problem hiding this comment.
这个直接叫SnapshotID怎么样?这样Snapshot独立于某个Sandbox之外,sandbox的pause和resume只是利用snapshot能力实现的一种在sandbox上的功能。
Implements OSEP-0008 pause/resume functionality using SandboxSnapshot
CRD and nerdctl-based image committer for Kubernetes runtime.
Summary
Testing
Breaking Changes
Checklist