Skip to content

feat(server,k8s): implement pause/resume with rootfs snapshot support#668

Open
fengcone wants to merge 15 commits intoalibaba:mainfrom
fengcone:feature/public-k8s-pause-resume
Open

feat(server,k8s): implement pause/resume with rootfs snapshot support#668
fengcone wants to merge 15 commits intoalibaba:mainfrom
fengcone:feature/public-k8s-pause-resume

Conversation

@fengcone
Copy link
Copy Markdown
Collaborator

@fengcone fengcone commented Apr 8, 2026

  • Add PausePolicy schema and API endpoints (/pause, /resume)
  • Add image_pull_policy configuration for sandbox containers
  • Support pausePolicy in pool-based BatchSandbox creation
  • Add SandboxSnapshot CRD and controller for snapshot lifecycle
  • Add image-committer Job for container rootfs snapshot/commit
  • Update AGENTS.md and Makefile for new components

Implements OSEP-0008 pause/resume functionality using SandboxSnapshot
CRD and nerdctl-based image committer for Kubernetes runtime.

Summary

  • What is changing and why?

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

  - Add PausePolicy schema and API endpoints (/pause, /resume)
  - Add image_pull_policy configuration for sandbox containers
  - Support pausePolicy in pool-based BatchSandbox creation
  - Add SandboxSnapshot CRD and controller for snapshot lifecycle
  - Add image-committer Job for container rootfs snapshot/commit
  - Update AGENTS.md and Makefile for new components

  Implements OSEP-0008 pause/resume functionality using SandboxSnapshot
  CRD and nerdctl-based image committer for Kubernetes runtime.
@fengcone
Copy link
Copy Markdown
Collaborator Author

fengcone commented Apr 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f6a1f854c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread kubernetes/apis/sandbox/v1alpha1/batchsandbox_types.go Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
fengcone and others added 4 commits April 8, 2026 08:49
- Rename snapshotPushSecretName -> snapshotPushSecret
- Rename resumeImagePullSecretName -> resumeImagePullSecret
- Fix ruff lint errors (F541, F841, F401)
- Regenerate CRDs from Go types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd use generation-based reconciliation

- Move controller-managed fields (sourcePodName, sourceNodeName
  resumeTemplate, containerSnapshots, pause/resume versions) from
  spec to status so spec is purely user input
- Replace version-based dispatch with Kubernetes generation pattern
  spec.action (Pause/Resume) signals intent, observedGeneration
  tracks controller progress
- Remove PausePolicy from BatchSandbox CRD; pause config now comes
  from server-level configuration
- Add fail-fast spec validation before starting pause cycles
- Add RBAC permission for reading secrets
- Improve delete_sandbox error handling (proper 404/500 responses
- Expand E2E tests with error-case coverage
# Conflicts:
#	server/tests/k8s/test_batchsandbox_provider.py
#	server/tests/k8s/test_kubernetes_service.py
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a13c3a542

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/sandboxsnapshot_controller.go Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 486658b3b4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e302bce73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread kubernetes/internal/controller/sandboxsnapshot_controller.go Outdated
@fengcone fengcone force-pushed the feature/public-k8s-pause-resume branch from b2cc1dc to 7fa689d Compare April 12, 2026 14:50
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7fa689d222

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/sandboxsnapshot_controller.go Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f86fd2a96b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@fengcone
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7474fa0343

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/sandboxsnapshot_controller.go
Comment thread kubernetes/internal/controller/sandboxsnapshot_controller.go
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py
Comment thread kubernetes/README.md
creation-date: 2026-03-11
last-updated: 2026-03-13
status: implementing
status: implemented
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an user example

Copy link
Copy Markdown
Collaborator

@hittyt hittyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implements OSEP-0008 pause/resume functionality using a new SandboxSnapshot CRD and nerdctl-based image committer. The implementation is well-structured but has some security and reliability concerns that need addressing before merging. One notable architectural gap is that renew_expiration currently only works on active BatchSandbox resources and will fail with 404 for paused sandboxes; it should be updated to handle snapshots as well.

spec := fmt.Sprintf("%s:%s", cs.ContainerName, cs.ImageURI)
containerSpecs = append(containerSpecs, spec)
}
fullCommand := fmt.Sprintf("/usr/local/bin/image-committer %s %s %s",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Potential Shell Injection Vulnerability

Building fullCommand using fmt.Sprintf and then executing it via /bin/sh -c is risky, as fields like SourcePodName or Namespace could potentially contain shell-special characters.

Since image-committer is a compiled binary, it's safer to run it directly by passing arguments to the Args list and avoiding a shell altogether.

Suggested change
fullCommand := fmt.Sprintf("/usr/local/bin/image-committer %s %s %s",
var containerSpecs []string
for _, cs := range containerSnapshots {
spec := fmt.Sprintf("%s:%s", cs.ContainerName, cs.ImageURI)
containerSpecs = append(containerSpecs, spec)
}
// Build the job

Name: CommitJobContainerName,
Image: imageCommitterImage,
ImagePullPolicy: corev1.PullIfNotPresent,
Command: []string{"/bin/sh", "-c"},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Avoid using shell for executing image-committer

Directly execute the binary without a shell wrapper to mitigate injection risks.

Suggested change
Command: []string{"/bin/sh", "-c"},
Command: []string{"/usr/local/bin/image-committer"},
Args: append([]string{snapshot.Status.SourcePodName, snapshot.Namespace}, containerSpecs...),


// Set up signal handler to ensure all paused containers are resumed on exit
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Unreliable container unpause on SIGKILL

resumeAllPausedContainers is registered for SIGTERM and Interrupt, but not SIGKILL. If the Job Pod is killed abruptly (e.g., node shutdown or OOM), containers in the source Pod will remain paused forever. This could break the sandbox state.

Consider adding a fallback mechanism in the controller to ensure containers are unpaused if the Job fails or times out.

}
registryHost := imageParts[0]

isInsecure := strings.Contains(registryHost, "local") ||
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Limited insecure registry detection

The current logic for isInsecure only checks for specific keywords like local or certain IP prefixes. This may fail for internal corporate registries using private domain names or other private IP ranges like 172.16.x.x.

Consider making "insecure" a flag passed from the server config to the controller, and then to image-committer to make it more robust.

Value: ContainerdSocketPath,
},
},
SecurityContext: &corev1.SecurityContext{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Highly Privileged Job Execution

This Job runs as root and mounts the host's containerd.sock, which effectively gives the Job Pod root access to the entire node.

It is critical that the ImageCommitterImage is strictly controlled and sourced only from a trusted repository to prevent privilege escalation.

}

// handleDeletion handles the deletion of a SandboxSnapshot
func (r *SandboxSnapshotReconciler) handleDeletion(ctx context.Context, snapshot *sandboxv1alpha1.SandboxSnapshot) (ctrl.Result, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Resource Leak - Snapshot Images

The handleDeletion logic cleans up the K8s Job, but it doesn't clean up the OCI images in the registry. This will lead to storage leakage in the registry over time.

While cleaning up registry images can be complex, this limitation should at least be documented, and perhaps a periodic GC mechanism should be considered in the future.

)

if not workload:
snapshot = self.snapshot_provider.get_snapshot(sandbox_id, self.namespace)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该不是所有的get sandbox请求都需要去获取snapshot这个对象吧?pause和resume的场景比例应该比较小。

"message": "Pause operation is not supported in Kubernetes runtime",
},
# 1. Get BatchSandbox
batchsandbox = self.workload_provider.get_workload(sandbox_id, self.namespace)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不应该有batch sandbox的概念,batch sandbox只是sandbox service的一种实现

// Spec only contains user input fields (filled by Server), Controller never writes to spec.
type SandboxSnapshotSpec struct {
// SandboxID is the stable public identifier for the sandbox.
SandboxID string `json:"sandboxId"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个直接叫SnapshotID怎么样?这样Snapshot独立于某个Sandbox之外,sandbox的pause和resume只是利用snapshot能力实现的一种在sandbox上的功能。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants