fix(jobs): align executor kind/provider with AIRCORE-397 execution model#335
fix(jobs): align executor kind/provider with AIRCORE-397 execution model#335matthewgrossman wants to merge 20 commits into
Conversation
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
|
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…-plan/mgrossman Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
| profile: str | None = None | ||
| options: dict | None = None |
There was a problem hiding this comment.
why do we need options; how is that relevant to this PR?
| # execution profiles. | ||
|
|
||
| _CACHE_TTL_SECONDS = 300 # 5 minutes | ||
| _cache: dict[str, Any] = {"profiles": None, "fetched_at": 0.0} |
There was a problem hiding this comment.
why not use a typeddict here?
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
- Fix local.yaml configs: provider=subprocess → provider=cpu, profile=subprocess - Remove unused `options` field from BaseJobRequest - Type the execution profiles cache properly (no more dict[str, Any]) - Rename stale FactoryCPUExecutionProviderSpec alias in conftest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…iles - Fix import sorting (I001) in test files - Remove unused SubprocessExecutionProvider import (F401) in test_jobs_api - Re-export FilesetMetadata from types.files (moved to types.shared by SDK regen) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
SDK type regeneration changed type signatures, making some existing `# type: ignore` comments unnecessary. Add the correct ty rule name to the CI ignore list alongside the existing unused-ignore-comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…n agent tests - Add kind="container" to all 15 executor dicts in e2e/test_jobs.py - Add type: ignore[invalid-key] for subprocess command assertions in agent tests (ty can't narrow the Executor union from dict key access) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…tegration/search tests - e2e/test_jobs.py: convert all executor blocks from container to subprocess format (these run host commands, not containers) - integration tests: add kind="container" to executor dicts - test_job_search.py: add kind="container" to all 25+ executor dicts - test_jobs_api.py: add kind="container" to secrets test executor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
…-plan/mgrossman Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
The framework now resolves the executor kind ("container" or "subprocess")
from the submitter's profile before calling compile(). Compilers receive
both `kind` and `profile` as parameters, so they can emit the correct
executor shape without querying execution profiles themselves.
- _compile_platform_spec resolves kind via resolve_profile_kind() once
- Compiler signature: added kind and profile as last two positional args
- NemoJob.compile() receives kind and profile from the framework
- stamp_profile() moved from _adapt_compile into _compile_platform_spec
- Data designer compiler uses kind to support both container/subprocess
- All plugin compilers and test compilers updated for new signature
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
Merge main's try/finally secret cleanup pattern with our subprocess executor format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
kind should never be None — container is the natural default. Updated all compile() signatures and compiler functions to use kind: str = "container" instead of kind: str | None = None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Grossman <mgrossman@nvidia.com>
📝 WalkthroughWalkthroughReplaces the ChangesExecutor kind discriminator refactor
Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py (1)
73-94:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate executor
kindagainst the selected execution profile.Current validation only checks
provider/profile. After this refactor, that allows shape/profile mismatches (e.g., subprocess payload validated against container profile), which can reach the wrong backend and fail at runtime.Proposed fix
def validate_job_spec( job_spec: PlatformJobSpec, execution_profiles: list[ExecutionProfileT], ) -> None: @@ if not execution_profile: raise HTTPException( status_code=status.HTTP_422_UNPROCESSABLE_CONTENT, detail=f"The execution profile '{step.executor.provider}/{profile_name}' specified in step '{step.name}' does not exist.", ) + if step.executor.kind != execution_profile.kind: + raise HTTPException( + status_code=status.HTTP_422_UNPROCESSABLE_CONTENT, + detail=( + f"Executor kind '{step.executor.kind}' in step '{step.name}' " + f"does not match execution profile kind '{execution_profile.kind}' " + f"for '{step.executor.provider}/{profile_name}'." + ), + )Also applies to: 123-123
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py` around lines 73 - 94, The validate_job_spec function currently only validates that the execution profile exists by checking provider and profile name, but does not validate that the executor kind matches the execution profile. Add a validation check after confirming the execution_profile exists to ensure that step.executor.kind is compatible with the selected execution_profile. This check should prevent mismatches (such as a subprocess executor validated against a container profile) from passing validation and causing runtime failures on the wrong backend.packages/nmp_platform/config/local.yaml (1)
51-63:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate stale
jobs.executorscomment block.This comment still describes removed CPU→subprocess translation and
profile: default. It now conflicts with current behavior (cpu/subprocessprofile and no endpoint rewrite).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nmp_platform/config/local.yaml` around lines 51 - 63, The multi-line comment block describing the subprocess executor registration is outdated and references removed functionality. Update this comment to accurately reflect the current behavior: remove references to the old "cpu/default" plugin steps translation and the removed `translate_cpu_container_steps_to_subprocess` endpoint, and instead describe the current "cpu/subprocess" profile usage and how it works without the endpoint rewrite. Ensure the updated comment clearly explains why this configuration exists and how deployers can extend it with additional profiles for Docker or Kubernetes setups.services/core/jobs/tests/conftest.py (1)
483-501:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStop ignoring
kindandprofileinhello_world_job_config.The function accepts
kind/profilebut always emitskind="container"andprofile="default", so non-default inputs are silently dropped and compile-path propagation is not actually exercised.Suggested fix
def hello_world_job_config( @@ sdk, kind: str = "container", profile: str | None = None, ) -> FactoryPlatformJobSpec: + if kind != "container": + raise ValueError(f"hello_world_job_config only supports kind='container', got '{kind}'") + return FactoryPlatformJobSpec( steps=[ FactoryPlatformJobStep( name="hello-world-step-1", executor=FactoryContainerExecutionProviderSpec( - kind="container", + kind=kind, provider="cpu", - profile="default", + profile=profile or "default", container=FactoryContainerSpec(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/jobs/tests/conftest.py` around lines 483 - 501, The hello_world_job_config function accepts kind and profile parameters but ignores them by using hardcoded values (kind="container" and profile="default") when constructing the FactoryContainerExecutionProviderSpec. Replace the hardcoded kind="container" with the kind parameter passed to the function, and replace the hardcoded profile="default" with the profile parameter (using a sensible default like "default" if profile is None). This ensures the function actually uses the provided parameters instead of silently dropping them.plugins/nemo-anonymizer/src/nemo_anonymizer_plugin/jobs/run.py (1)
102-112:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
RunJob.compileis missing the newkindkwarg and will crash.The route adapter now calls
NemoJob.compile(..., kind=..., profile=...). This method does not acceptkind, so requests can fail withTypeError.Suggested fix
async def compile( cls, *, workspace: str, spec: BaseModel, # AnonymizerStepConfig entity_client: object, job_name: str | None, async_sdk: AsyncNeMoPlatform, + kind: str = "container", profile: str | None = None, options: dict | None = None, ) -> PlatformJobSpec: + if kind != "container": + raise PlatformJobCompilationError(f"Unsupported executor kind for RunJob: {kind}") return PlatformJobSpec(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-anonymizer/src/nemo_anonymizer_plugin/jobs/run.py` around lines 102 - 112, The `RunJob.compile` classmethod is missing the `kind` parameter that the route adapter is now passing to it. Add `kind` as a new kwarg parameter to the `compile` method signature (you can place it alongside the existing `profile` parameter in the method definition), and ensure it accepts a string value. This will prevent the TypeError that occurs when the route adapter attempts to pass the `kind` argument to this method.plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py (1)
172-183:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
kindis accepted but ignored, so non-container profiles compile incorrectly.This method always calls
compile_evaluate_job(..., profile=...), which emits a container executor. If the resolved kind issubprocess, the generated spec is inconsistent with the selected execution profile.Suggested fix
from nemo_platform_plugin.jobs.api_factory import PlatformJobSpec +from nemo_platform_plugin.jobs.exceptions import PlatformJobCompilationError @@ ) -> PlatformJobSpec: """Compile canonical spec to a plugin-native evaluator job.""" - del workspace, entity_client, job_name, async_sdk, options + del workspace, entity_client, job_name, async_sdk, options + if kind != "container": + raise PlatformJobCompilationError(f"Unsupported executor kind for EvaluateJob: {kind}") from nemo_evaluator.jobs.compiler import compile_evaluate_job🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py` around lines 172 - 183, The `kind` parameter is accepted in the method signature but never used when calling `compile_evaluate_job`, causing non-container execution profiles to be compiled incorrectly with a container executor regardless of the specified kind. Pass the `kind` parameter to the `compile_evaluate_job` function call alongside the `profile` parameter so that the function can generate the appropriate executor type (subprocess or container) based on the requested kind value.packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.py (1)
696-707:⚠️ Potential issue | 🟠 Major | ⚡ Quick winForward the route’s compute provider into kind resolution.
create_jobleavesdefault_provideras"cpu", so GPU jobs resolverequest.profileagainst CPU profiles. Expose the provider on the factory and have the NemoJob route adapter passjob_cls.execution_provider.Proposed fix
def job_route_factory( service_name: str, job_type: str, job_input: JobSchemaLike, platform_job_config_compiler: PlatformJobSpecCompiler[JobInputT, JobOutputT] @@ job_output: JobSchemaLike | None = None, input_to_output: InputToOutputTransformer | InputToOutputTransformerAsync | None = None, generate_job_name: JobNameGenerator | None = None, + default_provider: str = "cpu", ) -> APIRouter: @@ generate_job_name: Called when user doesn't provide a job name. Returns the auto-generated name to use. + default_provider: Compute provider used to resolve request.profile to executor kind. @@ service_name, sdk, profile=request.profile, + default_provider=default_provider, )Also applies to: 841-851
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.py` around lines 696 - 707, The job_route_factory function needs to accept the compute provider as a parameter so that GPU jobs can properly resolve profiles against GPU profiles instead of CPU profiles. Add a new parameter to the job_route_factory function signature (such as execution_provider or compute_provider) to accept the provider information, then update all callers of job_route_factory (particularly the NemoJob route adapter mentioned in the comment range 841-851) to pass job_cls.execution_provider when invoking the factory. Finally, use this provider parameter in the kind resolution logic instead of leaving default_provider as "cpu".plugins/nemo-safe-synthesizer/src/nemo_safe_synthesizer_plugin/api/v2/jobs/endpoints.py (1)
55-67:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
kind/profileare accepted but not applied to emitted executors.Line 111-Line 112 adds compiler inputs, but Line 67/Line 92 still force config-driven profile and Line 185 never forwards
kind/profileinto step construction. That makes request-level executor selection non-functional.Suggested patch
-def _create_job_step(job_config: SafeSynthesizerJobConfig, environment: list[EnvironmentVariable]) -> PlatformJobStep: +def _create_job_step( + job_config: SafeSynthesizerJobConfig, + environment: list[EnvironmentVariable], + *, + kind: str, + profile: str | None, +) -> PlatformJobStep: + expected_kind = "subprocess" if config.job_mode == "subprocess-local" else "container" + if kind != expected_kind: + raise PlatformJobCompilationError( + f"Incompatible executor kind {kind!r} for job_mode {config.job_mode!r}" + ) + resolved_profile = profile or config.job_executor_profile + if config.job_mode == "subprocess-local": ... return PlatformJobStep( name="safe-synthesizer", executor=SubprocessExecutionProviderSpec( kind="subprocess", provider="cpu", - profile=config.job_executor_profile, + profile=resolved_profile, command=command, ), ... ) ... return PlatformJobStep( name="safe-synthesizer", executor=ContainerExecutionProviderSpec( kind="container", provider="gpu", - profile=config.job_executor_profile, + profile=resolved_profile, container=ContainerSpec( image=get_qualified_image(config.container_image), entrypoint=config.entrypoint, ), resources=resources, ), ... ) - steps.append(_create_job_step(job_config=transformed_spec, environment=environment)) + steps.append( + _create_job_step( + job_config=transformed_spec, + environment=environment, + kind=kind, + profile=profile, + ) + )Also applies to: 89-93, 111-113, 185-185
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-safe-synthesizer/src/nemo_safe_synthesizer_plugin/api/v2/jobs/endpoints.py` around lines 55 - 67, The _create_job_step function is not using the request-level executor configuration parameters (kind and profile) when constructing the SubprocessExecutionProviderSpec. Currently it hardcodes kind as "subprocess" and derives profile from the config object instead of using values provided in the request. You need to modify _create_job_step to accept kind and profile as parameters (or extract them from the job_config), and then use these request-level values to populate the kind and profile fields when instantiating the SubprocessExecutionProviderSpec in the return statement, ensuring that caller code also passes these parameters when invoking _create_job_step.services/core/jobs/src/nmp/core/jobs/app/providers.py (1)
134-138: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winUse
Selfinstead of string-based forward reference.
schemas.pyalready usesSelffor the same pattern. Import it and replace the string annotation.-from typing import Annotated, Literal, Union +from typing import Annotated, Literal, Self, Union`@model_validator`(mode="after") - def validate_command(self) -> "SubprocessExecutionProvider": + def validate_command(self) -> Self:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/jobs/src/nmp/core/jobs/app/providers.py` around lines 134 - 138, The validate_command method in the SubprocessExecutionProvider class uses a string-based forward reference ("SubprocessExecutionProvider") in its return type annotation instead of the Self type that is already used elsewhere in schemas.py. Import Self from typing_extensions or typing at the top of the file, then replace the string annotation "SubprocessExecutionProvider" with Self in the return type of the validate_command method to align with the codebase pattern.Source: Coding guidelines
🧹 Nitpick comments (3)
plugins/nemo-agents/tests/unit/test_improvement_jobs.py (1)
89-90: ⚡ Quick winAlso assert executor
kindin these subprocess compile tests.These cases now validate
provider, but they should also validate the discriminator (kind="subprocess") to protect the new kind/provider split.Proposed assertion additions
- assert step["executor"]["provider"] == "cpu" + assert step["executor"]["kind"] == "subprocess" + assert step["executor"]["provider"] == "cpu"executor = step["executor"] + assert executor.get("kind") == "subprocess" assert executor.get("provider") == "cpu"Also applies to: 202-203, 282-283, 331-332
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-agents/tests/unit/test_improvement_jobs.py` around lines 89 - 90, The subprocess compile tests in test_improvement_jobs.py are missing assertions to validate the executor kind field. Add assertions that verify step["executor"]["kind"] == "subprocess" in all subprocess-related test cases (at lines 89-90, 202-203, 282-283, and 331-332) to ensure the discriminator is properly set after the kind/provider split, protecting against regressions in the executor configuration structure.packages/nemo_platform_plugin/tests/test_jobs_routes.py (1)
279-285: ⚡ Quick winAssert adapter forwards
kindandprofile.This test only checks return type, so a regression in
_adapt_compileargument forwarding can pass.Proposed test hardening
`@pytest.mark.asyncio` async def test_compile_adapter_invokes_nemo_compile() -> None: - adapter = _adapt_compile(_WidgetJob, default_profile="research") + seen: dict[str, object] = {} + + class CaptureJob(_WidgetJob): + `@classmethod` + async def compile(cls, **kwargs): + seen.update(kwargs) + return _FakePlatformSpec(steps=[_FakeStep(profile=None)]) + + adapter = _adapt_compile(CaptureJob, default_profile="research") spec = _WidgetSpec(name="w") # Adapter receives kind and profile from _compile_platform_spec. # Profile stamping is now done by _compile_platform_spec, not the adapter. platform_spec = await adapter("ws", spec, spec, "entity_client", "job-1", "sdk", "container", "research") assert isinstance(platform_spec, _FakePlatformSpec) + assert seen["kind"] == "container" + assert seen["profile"] == "research"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_platform_plugin/tests/test_jobs_routes.py` around lines 279 - 285, The test_compile_adapter_invokes_nemo_compile test only verifies the return type of the adapter but does not validate that the _adapt_compile adapter correctly forwards the kind and profile arguments to the underlying compile function. Add assertions that verify the adapter properly forwards both the kind argument ("ws") and the profile argument ("research") to the compile function being wrapped, either by mocking the underlying compile function and checking the calls it receives, or by other verification means that ensure a regression in argument forwarding would be caught by this test.packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/routes.py (1)
50-75: ⚡ Quick winReplace
TYPE_CHECKING/string annotations with concrete runtime types.This adapter path was touched and still uses deferred string hints plus
TYPE_CHECKINGimports for runtime-available types; alsokindis typed as broadstrinstead of a concrete executor-kind type.Suggested fix
-from typing import TYPE_CHECKING, Any +from collections.abc import Callable +from typing import Any @@ -from nemo_platform_plugin.job import job_collection_path_for +from nemo_platform_plugin.job import NemoJob, job_collection_path_for @@ -if TYPE_CHECKING: - from collections.abc import Callable - - from nemo_platform_plugin.job import NemoJob @@ -def add_job_routes( - job_cls: type["NemoJob"], +def add_job_routes( + job_cls: type[NemoJob], @@ - generate_job_name: "Callable[..., str] | None" = None, + generate_job_name: Callable[..., str] | None = None, @@ -def _adapt_compile( - job_cls: type["NemoJob"], +def _adapt_compile( + job_cls: type[NemoJob], @@ -) -> "Callable[..., Any]": +) -> Callable[..., Any]: @@ - kind: str = "container", + kind: ExecutorKind = "container",As per coding guidelines,
**/*.pyrequires concrete type hints (not string-based) and avoidingTYPE_CHECKINGtype imports when possible, andpackages/nemo_platform_plugin/**/*.pyrepeats the same rule for runtime-available same-package types.Also applies to: 253-257, 279-281
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/routes.py` around lines 50 - 75, The code uses TYPE_CHECKING with string annotations for runtime-available types, which violates the coding guidelines requiring concrete type hints. Remove the TYPE_CHECKING import block entirely and import Callable directly from collections.abc at the top of the file alongside other imports. Import NemoJob directly from nemo_platform_plugin.job instead of using it as a string annotation. Replace all string-quoted type hints in the add_job_routes function signature (such as "NemoJob", "Callable[..., str] | None", and "str") with concrete runtime type references by removing the quotes. Additionally, replace any broad str type hints with a concrete executor-kind type instead. Apply these same changes to the other affected locations mentioned at lines 253-257 and 279-281.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e/test_jobs.py`:
- Around line 3-4: The tests in test_jobs.py that use
SubprocessExecutionProviderSpec depend on the cpu/subprocess execution profile
which does not exist in the CI environment, causing create-job operations to
fail with a 422 error. Fix this by adding a preflight check before the
subprocess-dependent tests (including those at lines 57-60) that verifies the
cpu/subprocess profile is available in the execution environment, and skip those
tests with an appropriate skip marker if the profile is absent. Alternatively,
provision the cpu/subprocess profile in the e2e runtime configuration so that it
is available when tests execute.
In `@openapi/ga/individual/platform.openapi.yaml`:
- Around line 9557-9565: The kind enum field for container backend profiles
currently allows both container and subprocess options, but since Docker,
Kubernetes, and Volcano backends are container-only in this refactor, the enum
should only permit the container value. Remove subprocess from the enum list in
the kind field definition to prevent subprocess executors from being routed to
container backends. Apply this same change at all locations mentioned in the
comment (lines 12106-12114 and 18062-18070).
- Around line 16974-16989: The schema currently constrains the provider field to
only "cpu" using a const constraint, but the registry supports both cpu and gpu
providers with subprocess kind. Remove the const constraint on the provider
field and replace it with an enum that includes both "cpu" and "gpu" values to
allow the schema to properly represent all supported provider/subprocess
combinations. Additionally, consider whether the profile default value of
"default" is appropriate for all provider configurations or if it should be made
more flexible.
In `@openapi/ga/openapi.yaml`:
- Around line 9557-9565: The `kind` field in `DockerJobExecutionProfile`,
`KubernetesJobExecutionProfile`, and `VolcanoJobExecutionProfile` currently
allows both `container` and `subprocess` as enum options, but these backends
only support `ContainerExecutionProvider` which requires `container`. Replace
the `enum: [container, subprocess]` and `default: container` with `const:
container` in all three profile definitions (at lines 9557-9565, 12106-12114,
and 18062-18070) to constrain these backends to only accept the container
executor shape they can handle.
- Around line 16974-16978: The SubprocessExecutionProvider provider field
currently uses const: cpu which restricts the value to only CPU, but
SubprocessExecutionProvider should allow both cpu and gpu values to support GPU
subprocess execution profiles. Replace the const: cpu constraint in the
SubprocessExecutionProvider schema with an enum that includes both cpu and gpu
values to allow both provider types.
In `@openapi/openapi.yaml`:
- Around line 9557-9565: The kind field in DockerJobExecutionProfile,
KubernetesJobExecutionProfile, and VolcanoJobExecutionProfile currently allows
both "container" and "subprocess" options, but these container-based backends
should only support the "container" executor type. Remove "subprocess" from the
enum list in the kind field definition for all three profiles (found at the
locations specified in the comment), leaving only "container" as a valid option
while keeping it as the default value.
- Around line 16946-16952: The SubprocessJobExecutionProfile schema's provider
field enum is missing the gpu option, which means the /execution-profiles
endpoint cannot represent GPU subprocess routes even though
SubprocessExecutionProvider supports it. Add gpu to the enum list for the
provider field in the SubprocessJobExecutionProfile schema (around the provider
type definition that currently only lists cpu), and apply the same fix to the
other provider enum definition mentioned at lines 16974-16978 to ensure both
schemas are consistent and support gpu as a valid provider option.
- Around line 8689-8717: The `kind` field is used as a discriminator for the
executor variants but is not included in the required fields for the schema
definitions. This allows ambiguous payloads that could match multiple oneOf
branches without an explicit kind value. Add `kind` to the `required` array in
the ContainerExecutionProviderInput, ContainerExecutionProviderOutput, and
SubprocessExecutionProvider schema definitions in the OpenAPI specification to
ensure the discriminator field is always present and prevents ambiguous matches.
In `@packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.py`:
- Around line 655-662: The exception handling block that catches
PlatformJobCompilationError when resolve_profile_kind fails is silently falling
back to "container" mode, which causes incorrect payload shapes to be sent
downstream when a user explicitly selects a profile that cannot be resolved.
Instead of catching and swallowing this error, allow the
PlatformJobCompilationError to propagate through the exception handler so it
properly returns a 422 error to the client. Alternatively, restrict the fallback
to "container" to only apply when using legacy default-only behavior (when no
explicit profile was selected), and let errors from explicitly selected profiles
raise the compilation error.
In `@packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/profiles.py`:
- Around line 66-69: The call to sdk.jobs.list_execution_profiles() can raise
exceptions that are not caught by the fallback error handler in
api_factory._compile_platform_spec, causing submission to fail instead of
falling back safely. Wrap the sdk.jobs.list_execution_profiles() call in a
try-except block and catch any exceptions that occur, then re-raise them as
PlatformJobCompilationError so they are properly handled by the fallback
mechanism in the caller.
In `@packages/nemo_platform_plugin/tests/test_jobs_filter.py`:
- Around line 45-46: The _fake_compiler function signature is too long and
violates code formatting rules. Reformat the function definition by breaking the
parameters across multiple lines, grouping related parameters together or
placing each parameter on its own line to comply with the style guidelines and
allow the lint CI to pass.
In `@packages/nmp_common/tests/api_factory/test_api_factory.py`:
- Line 1700: The function signatures for the compiler function and related stub
functions (at lines 1712, 1723, and 1738-1740) are exceeding the maximum line
length allowed by the style formatter, causing lint failures. Reformat these
function definitions by breaking the long parameter lists across multiple lines,
with each parameter or logical group of parameters on its own line, ensuring
proper indentation alignment. This applies to the compiler function definition
and all other similar stub function signatures mentioned in the comment.
In `@packages/nmp_testing/src/nmp/testing/jobs.py`:
- Around line 33-37: The predicate in the any() check for detecting an existing
subprocess profile is incomplete and may match GPU subprocess executors when
looking for CPU ones. Add an additional condition to verify that
getattr(executor, "provider", None) == "cpu" alongside the existing backend and
profile checks within the any() function. This ensures that
SubprocessJobExecutionProfile insertion is only skipped when a CPU subprocess
executor with the matching profile already exists, not when a GPU subprocess
executor exists.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/analyze_batch.py`:
- Around line 67-69: The compile path accepts kind and profile parameters at
lines 67-69 but does not use them when creating the executor at lines 110-114,
causing request-level profile selection to be lost and invalid kind values to be
silently ignored. Update the executor instantiation to pass the profile
parameter to maintain the request-level profile selection, and add validation
logic to reject invalid kind values before the executor is created. Ensure both
parameters are properly propagated from the function signature to the executor
creation code.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/evaluate_agent.py`:
- Around line 161-162: The function parameter kind is accepted at line 161 but
ignored when creating the executor in lines 185-188, where subprocess executor
is always used regardless of the kind value. Add validation early in the
function to check if the kind parameter matches what is actually supported
(subprocess), and raise an appropriate error if a different kind like
"container" is provided, preventing a mismatch between the requested execution
kind and the compiled output that could cause failures in backend routing and
validation.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/evaluate_suite.py`:
- Around line 140-141: The function accepts a kind parameter but always compiles
the step as subprocess regardless of the kind value provided, causing runtime
failures for invalid kind values. Add an explicit validation guard during the
compilation phase (around lines 205-207) that ensures the kind parameter is
either not provided or explicitly matches "subprocess". If an invalid kind value
is passed that doesn't align with the subprocess compilation behavior, raise an
error immediately during compilation rather than at execution time.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/optimize_agent.py`:
- Around line 141-142: The `kind` parameter in the function signature can be set
to "container" but the code at lines 168-170 always executes as subprocess,
creating a mismatch between the parameter value and actual behavior. Add a guard
clause at the beginning of the function that validates `kind == "subprocess"`,
raising an appropriate error or exception if this condition is not met. This
compile-time check prevents invalid executor and profile combinations where the
caller expects container execution but subprocess is always used instead.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/optimize_skills.py`:
- Around line 90-91: The function accepts a kind parameter with a default value
of "container" on line 90, but the implementation at lines 151-153 hardcodes
"subprocess" usage, creating a mismatch. Add early validation in the compile
function to check that kind equals "subprocess" and raise a compile error (4xx)
if it does not, ensuring profile configuration mistakes are caught immediately
rather than causing runtime issues downstream.
In `@services/hello-world/src/nmp/hello_world/api/v2/jobs/endpoints.py`:
- Around line 27-29: The function accepts kind and profile parameters but
ignores them, instead hardcoding kind="container" and profile="default" when
creating the PlatformJobSpec (around lines 46-50). Replace the hardcoded values
with the actual kind and profile parameters that were passed into the function
to ensure proper profile routing is respected.
---
Outside diff comments:
In `@packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.py`:
- Around line 696-707: The job_route_factory function needs to accept the
compute provider as a parameter so that GPU jobs can properly resolve profiles
against GPU profiles instead of CPU profiles. Add a new parameter to the
job_route_factory function signature (such as execution_provider or
compute_provider) to accept the provider information, then update all callers of
job_route_factory (particularly the NemoJob route adapter mentioned in the
comment range 841-851) to pass job_cls.execution_provider when invoking the
factory. Finally, use this provider parameter in the kind resolution logic
instead of leaving default_provider as "cpu".
In `@packages/nmp_platform/config/local.yaml`:
- Around line 51-63: The multi-line comment block describing the subprocess
executor registration is outdated and references removed functionality. Update
this comment to accurately reflect the current behavior: remove references to
the old "cpu/default" plugin steps translation and the removed
`translate_cpu_container_steps_to_subprocess` endpoint, and instead describe the
current "cpu/subprocess" profile usage and how it works without the endpoint
rewrite. Ensure the updated comment clearly explains why this configuration
exists and how deployers can extend it with additional profiles for Docker or
Kubernetes setups.
In `@plugins/nemo-anonymizer/src/nemo_anonymizer_plugin/jobs/run.py`:
- Around line 102-112: The `RunJob.compile` classmethod is missing the `kind`
parameter that the route adapter is now passing to it. Add `kind` as a new kwarg
parameter to the `compile` method signature (you can place it alongside the
existing `profile` parameter in the method definition), and ensure it accepts a
string value. This will prevent the TypeError that occurs when the route adapter
attempts to pass the `kind` argument to this method.
In `@plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py`:
- Around line 172-183: The `kind` parameter is accepted in the method signature
but never used when calling `compile_evaluate_job`, causing non-container
execution profiles to be compiled incorrectly with a container executor
regardless of the specified kind. Pass the `kind` parameter to the
`compile_evaluate_job` function call alongside the `profile` parameter so that
the function can generate the appropriate executor type (subprocess or
container) based on the requested kind value.
In
`@plugins/nemo-safe-synthesizer/src/nemo_safe_synthesizer_plugin/api/v2/jobs/endpoints.py`:
- Around line 55-67: The _create_job_step function is not using the
request-level executor configuration parameters (kind and profile) when
constructing the SubprocessExecutionProviderSpec. Currently it hardcodes kind as
"subprocess" and derives profile from the config object instead of using values
provided in the request. You need to modify _create_job_step to accept kind and
profile as parameters (or extract them from the job_config), and then use these
request-level values to populate the kind and profile fields when instantiating
the SubprocessExecutionProviderSpec in the return statement, ensuring that
caller code also passes these parameters when invoking _create_job_step.
In `@services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py`:
- Around line 73-94: The validate_job_spec function currently only validates
that the execution profile exists by checking provider and profile name, but
does not validate that the executor kind matches the execution profile. Add a
validation check after confirming the execution_profile exists to ensure that
step.executor.kind is compatible with the selected execution_profile. This check
should prevent mismatches (such as a subprocess executor validated against a
container profile) from passing validation and causing runtime failures on the
wrong backend.
In `@services/core/jobs/src/nmp/core/jobs/app/providers.py`:
- Around line 134-138: The validate_command method in the
SubprocessExecutionProvider class uses a string-based forward reference
("SubprocessExecutionProvider") in its return type annotation instead of the
Self type that is already used elsewhere in schemas.py. Import Self from
typing_extensions or typing at the top of the file, then replace the string
annotation "SubprocessExecutionProvider" with Self in the return type of the
validate_command method to align with the codebase pattern.
In `@services/core/jobs/tests/conftest.py`:
- Around line 483-501: The hello_world_job_config function accepts kind and
profile parameters but ignores them by using hardcoded values (kind="container"
and profile="default") when constructing the
FactoryContainerExecutionProviderSpec. Replace the hardcoded kind="container"
with the kind parameter passed to the function, and replace the hardcoded
profile="default" with the profile parameter (using a sensible default like
"default" if profile is None). This ensures the function actually uses the
provided parameters instead of silently dropping them.
---
Nitpick comments:
In `@packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/routes.py`:
- Around line 50-75: The code uses TYPE_CHECKING with string annotations for
runtime-available types, which violates the coding guidelines requiring concrete
type hints. Remove the TYPE_CHECKING import block entirely and import Callable
directly from collections.abc at the top of the file alongside other imports.
Import NemoJob directly from nemo_platform_plugin.job instead of using it as a
string annotation. Replace all string-quoted type hints in the add_job_routes
function signature (such as "NemoJob", "Callable[..., str] | None", and "str")
with concrete runtime type references by removing the quotes. Additionally,
replace any broad str type hints with a concrete executor-kind type instead.
Apply these same changes to the other affected locations mentioned at lines
253-257 and 279-281.
In `@packages/nemo_platform_plugin/tests/test_jobs_routes.py`:
- Around line 279-285: The test_compile_adapter_invokes_nemo_compile test only
verifies the return type of the adapter but does not validate that the
_adapt_compile adapter correctly forwards the kind and profile arguments to the
underlying compile function. Add assertions that verify the adapter properly
forwards both the kind argument ("ws") and the profile argument ("research") to
the compile function being wrapped, either by mocking the underlying compile
function and checking the calls it receives, or by other verification means that
ensure a regression in argument forwarding would be caught by this test.
In `@plugins/nemo-agents/tests/unit/test_improvement_jobs.py`:
- Around line 89-90: The subprocess compile tests in test_improvement_jobs.py
are missing assertions to validate the executor kind field. Add assertions that
verify step["executor"]["kind"] == "subprocess" in all subprocess-related test
cases (at lines 89-90, 202-203, 282-283, and 331-332) to ensure the
discriminator is properly set after the kind/provider split, protecting against
regressions in the executor configuration structure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 151b02c2-83c7-4b5c-bf8f-c3e7989ea468
⛔ Files ignored due to path filters (31)
sdk/python/nemo-platform/.nmpcontext/openapi.yamlis excluded by!sdk/**sdk/python/nemo-platform/.nmpcontext/stainless.yamlis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/resources/files/api.mdis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/resources/files/filesets.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/resources/jobs/api.mdis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/files/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/files/fileset.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/files/fileset_create_params.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/files/fileset_metadata_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/container_execution_provider.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/container_execution_provider_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/cpu_execution_provider.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/cpu_execution_provider_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/distributed_gpu_execution_provider.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/docker_job_execution_profile.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/e2e_job_execution_profile.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/gpu_execution_provider_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/kubernetes_job_execution_profile.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/platform_job_step_spec.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/platform_job_step_spec_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/subprocess_execution_provider.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/subprocess_execution_provider_param.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/subprocess_job_execution_profile.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/jobs/volcano_job_execution_profile.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/shared/__init__.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/shared/fileset_metadata.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/types/shared_params/fileset_metadata_param.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/api_resources/test_jobs.pyis excluded by!sdk/**sdk/stainless.yamlis excluded by!sdk/**
📒 Files selected for processing (63)
e2e/test_jobs.pyopenapi/ga/individual/platform.openapi.yamlopenapi/ga/openapi.yamlopenapi/openapi.yamlpackages/nemo_platform_plugin/src/nemo_platform_plugin/job.pypackages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.pypackages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/profile.pypackages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/profiles.pypackages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/routes.pypackages/nemo_platform_plugin/tests/test_jobs_filter.pypackages/nemo_platform_plugin/tests/test_jobs_routes.pypackages/nmp_common/tests/api_factory/test_api_factory.pypackages/nmp_common/tests/jobs/test_docker.pypackages/nmp_platform/config/local.yamlpackages/nmp_platform_runner/src/nmp/platform_runner/config/local.yamlpackages/nmp_testing/src/nmp/testing/jobs.pyplugins/nemo-agents/openapi/openapi.yamlplugins/nemo-agents/src/nemo_agents_plugin/jobs/analyze_batch.pyplugins/nemo-agents/src/nemo_agents_plugin/jobs/evaluate_agent.pyplugins/nemo-agents/src/nemo_agents_plugin/jobs/evaluate_suite.pyplugins/nemo-agents/src/nemo_agents_plugin/jobs/optimize_agent.pyplugins/nemo-agents/src/nemo_agents_plugin/jobs/optimize_skills.pyplugins/nemo-agents/tests/unit/test_evaluate_agent_job.pyplugins/nemo-agents/tests/unit/test_improvement_jobs.pyplugins/nemo-anonymizer/src/nemo_anonymizer_plugin/jobs/run.pyplugins/nemo-data-designer/openapi/openapi.yamlplugins/nemo-data-designer/src/nemo_data_designer_plugin/jobs/create.pyplugins/nemo-evaluator/openapi/openapi.yamlplugins/nemo-evaluator/src/nemo_evaluator/jobs/compiler.pyplugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.pyplugins/nemo-safe-synthesizer/openapi/openapi.yamlplugins/nemo-safe-synthesizer/src/nemo_safe_synthesizer_plugin/api/v2/jobs/endpoints.pyplugins/nemo-safe-synthesizer/tests/unit/test_jobs.pyservices/automodel/src/nmp/automodel/app/jobs/compiler.pyservices/automodel/src/nmp/automodel/app/jobs/training/compiler.pyservices/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.pyservices/core/jobs/src/nmp/core/jobs/app/providers.pyservices/core/jobs/src/nmp/core/jobs/app/schemas.pyservices/core/jobs/src/nmp/core/jobs/app/test_helpers.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/config.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/docker.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/kubernetes/kubernetes_job.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/kubernetes/volcano_job.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/registry.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.pyservices/core/jobs/src/nmp/core/jobs/controllers/backends/test.pyservices/core/jobs/tests/conftest.pyservices/core/jobs/tests/controllers/test_base.pyservices/core/jobs/tests/controllers/test_docker_backend.pyservices/core/jobs/tests/controllers/test_kubernetes_backend.pyservices/core/jobs/tests/controllers/test_subprocess_backend.pyservices/core/jobs/tests/controllers/test_volcano_backend.pyservices/core/jobs/tests/integration/test_jobs_auth_propagation.pyservices/core/jobs/tests/integration/test_jobs_secrets_access.pyservices/core/jobs/tests/test_config.pyservices/core/jobs/tests/test_job_search.pyservices/core/jobs/tests/test_jobs_api.pyservices/core/jobs/tests/test_jobs_endpoint_translation.pyservices/core/models/src/nmp/core/models/api/v2/models.pyservices/hello-world/src/nmp/hello_world/api/v2/jobs/endpoints.pyservices/unsloth/src/nmp/unsloth/app/jobs/compiler.pyservices/unsloth/src/nmp/unsloth/app/jobs/training/compiler.pytools/lint/lint-python-types.sh
| These tests submit jobs with SubprocessExecutionProviderSpec (host command). | ||
| The e2e test environment runs against the subprocess backend. |
There was a problem hiding this comment.
Subprocess-profile precondition is unmet in CI, so these tests fail at create-job time.
This suite now hard-depends on cpu/subprocess, but CI currently returns 422 (The execution profile 'cpu/subprocess' ... does not exist). Provision that profile in the e2e runtime config, or preflight-skip subprocess-only tests when it is absent.
Also applies to: 57-60
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@e2e/test_jobs.py` around lines 3 - 4, The tests in test_jobs.py that use
SubprocessExecutionProviderSpec depend on the cpu/subprocess execution profile
which does not exist in the CI environment, causing create-job operations to
fail with a 422 error. Fix this by adding a preflight check before the
subprocess-dependent tests (including those at lines 57-60) that verifies the
cpu/subprocess profile is available in the execution environment, and skip those
tests with an appropriate skip marker if the profile is absent. Alternatively,
provision the cpu/subprocess profile in the e2e runtime configuration so that it
is available when tests execute.
Source: Pipeline failures
| kind: | ||
| type: string | ||
| enum: | ||
| - container | ||
| - subprocess | ||
| title: Kind | ||
| description: 'The executor payload shape this profile expects: ''container'' | ||
| or ''subprocess''.' | ||
| default: container |
There was a problem hiding this comment.
Constrain container backend profiles to kind: container.
Docker, Kubernetes, and Volcano backends are container-only in this refactor; allowing subprocess here can route a subprocess executor to a container backend.
Proposed OpenAPI shape
kind:
type: string
- enum:
- - container
- - subprocess
+ const: container
title: Kind
- description: 'The executor payload shape this profile expects: ''container''
- or ''subprocess''.'
+ description: The executor payload shape this profile expects.
default: containerAlso applies to: 12106-12114, 18062-18070
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openapi/ga/individual/platform.openapi.yaml` around lines 9557 - 9565, The
kind enum field for container backend profiles currently allows both container
and subprocess options, but since Docker, Kubernetes, and Volcano backends are
container-only in this refactor, the enum should only permit the container
value. Remove subprocess from the enum list in the kind field definition to
prevent subprocess executors from being routed to container backends. Apply this
same change at all locations mentioned in the comment (lines 12106-12114 and
18062-18070).
| provider: | ||
| type: string | ||
| const: subprocess | ||
| const: cpu | ||
| title: Provider | ||
| default: subprocess | ||
| default: cpu | ||
| profile: | ||
| type: string | ||
| title: Profile | ||
| description: The profile name for the executor, e.g., high_priority_a100, | ||
| low_priority, etc. | ||
| default: default | ||
| kind: | ||
| type: string | ||
| const: subprocess | ||
| title: Kind | ||
| default: subprocess |
There was a problem hiding this comment.
Keep subprocess profile keys representable.
The registry supports (cpu, subprocess) and (gpu, subprocess), but this schema fixes provider to cpu and defaults profile to default.
Proposed fix
provider:
type: string
- const: cpu
+ enum:
+ - cpu
+ - gpu
title: Provider
default: cpu
profile:
type: string
title: Profile
description: The profile name for the executor, e.g., high_priority_a100,
low_priority, etc.
- default: default
+ default: subprocess🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openapi/ga/individual/platform.openapi.yaml` around lines 16974 - 16989, The
schema currently constrains the provider field to only "cpu" using a const
constraint, but the registry supports both cpu and gpu providers with subprocess
kind. Remove the const constraint on the provider field and replace it with an
enum that includes both "cpu" and "gpu" values to allow the schema to properly
represent all supported provider/subprocess combinations. Additionally, consider
whether the profile default value of "default" is appropriate for all provider
configurations or if it should be made more flexible.
| kind: | ||
| type: string | ||
| enum: | ||
| - container | ||
| - subprocess | ||
| title: Kind | ||
| description: 'The executor payload shape this profile expects: ''container'' | ||
| or ''subprocess''.' | ||
| default: container |
There was a problem hiding this comment.
Constrain container backend profiles to kind: container.
Docker, Kubernetes, and Volcano backends accept ContainerExecutionProvider; allowing subprocess here lets /execution-profiles drive compilers to emit an executor shape these backends cannot handle.
Proposed fix
kind:
type: string
- enum:
- - container
- - subprocess
+ const: container
title: Kind
description: 'The executor payload shape this profile expects: ''container''
or ''subprocess''.'
default: containerApply the same const: container change to DockerJobExecutionProfile, KubernetesJobExecutionProfile, and VolcanoJobExecutionProfile.
Also applies to: 12106-12114, 18062-18070
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openapi/ga/openapi.yaml` around lines 9557 - 9565, The `kind` field in
`DockerJobExecutionProfile`, `KubernetesJobExecutionProfile`, and
`VolcanoJobExecutionProfile` currently allows both `container` and `subprocess`
as enum options, but these backends only support `ContainerExecutionProvider`
which requires `container`. Replace the `enum: [container, subprocess]` and
`default: container` with `const: container` in all three profile definitions
(at lines 9557-9565, 12106-12114, and 18062-18070) to constrain these backends
to only accept the container executor shape they can handle.
| provider: | ||
| type: string | ||
| const: subprocess | ||
| const: cpu | ||
| title: Provider | ||
| default: subprocess | ||
| default: cpu |
There was a problem hiding this comment.
Allow GPU subprocess execution profiles.
SubprocessExecutionProvider allows provider: gpu, and routing includes (gpu, subprocess); const: cpu makes GPU subprocess profiles invalid in the /execution-profiles response schema.
Proposed fix
provider:
type: string
- const: cpu
+ enum:
+ - cpu
+ - gpu
title: Provider
default: cpu📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| provider: | |
| type: string | |
| const: subprocess | |
| const: cpu | |
| title: Provider | |
| default: subprocess | |
| default: cpu | |
| provider: | |
| type: string | |
| enum: | |
| - cpu | |
| - gpu | |
| title: Provider | |
| default: cpu |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@openapi/ga/openapi.yaml` around lines 16974 - 16978, The
SubprocessExecutionProvider provider field currently uses const: cpu which
restricts the value to only CPU, but SubprocessExecutionProvider should allow
both cpu and gpu values to support GPU subprocess execution profiles. Replace
the const: cpu constraint in the SubprocessExecutionProvider schema with an enum
that includes both cpu and gpu values to allow both provider types.
| kind: str = "container", | ||
| profile: str | None = None, |
There was a problem hiding this comment.
Fail fast when the selected execution kind is not subprocess.
Line 161 accepts kind, but Lines 185-188 ignore it and always compile a subprocess executor. That can desync compile output from the resolved profile kind and fail later in backend routing/validation.
Proposed fix
@@
- async def compile( # type: ignore[override]
+ async def compile( # type: ignore[override]
@@
- from nemo_platform_plugin.jobs.api_factory import (
+ from nemo_platform_plugin.jobs.api_factory import (
EnvironmentVariable,
PlatformJobStep,
SubprocessExecutionProviderSpec,
)
+ from nemo_platform_plugin.jobs.exceptions import PlatformJobCompilationError
@@
+ if kind != "subprocess":
+ raise PlatformJobCompilationError(
+ f"agents.evaluate requires subprocess execution, got kind={kind!r}, profile={profile!r}"
+ )
+
return PlatformJobSpec(Also applies to: 185-188
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/evaluate_agent.py` around
lines 161 - 162, The function parameter kind is accepted at line 161 but ignored
when creating the executor in lines 185-188, where subprocess executor is always
used regardless of the kind value. Add validation early in the function to check
if the kind parameter matches what is actually supported (subprocess), and raise
an appropriate error if a different kind like "container" is provided,
preventing a mismatch between the requested execution kind and the compiled
output that could cause failures in backend routing and validation.
| kind: str = "container", | ||
| profile: str | None = None, |
There was a problem hiding this comment.
Reject non-subprocess kind in this compile path.
Line 140 takes kind, but the step is always compiled as subprocess at Lines 205-207. Add an explicit guard so invalid profile/kind combinations fail during compilation, not at execution time.
Proposed fix
@@
async def compile( # type: ignore[override]
@@
) -> PlatformJobSpec:
@@
+ if kind != "subprocess":
+ raise PlatformJobCompilationError(
+ f"agents.evaluate-suite requires subprocess execution, got kind={kind!r}, profile={profile!r}"
+ )
+
from nemo_platform_plugin.jobs.api_factory import (Also applies to: 205-207
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/evaluate_suite.py` around
lines 140 - 141, The function accepts a kind parameter but always compiles the
step as subprocess regardless of the kind value provided, causing runtime
failures for invalid kind values. Add an explicit validation guard during the
compilation phase (around lines 205-207) that ensures the kind parameter is
either not provided or explicitly matches "subprocess". If an invalid kind value
is passed that doesn't align with the subprocess compilation behavior, raise an
error immediately during compilation rather than at execution time.
| kind: str = "container", | ||
| profile: str | None = None, |
There was a problem hiding this comment.
Guard kind to prevent invalid executor/profile combinations.
Line 141 receives kind, but Lines 168-170 always emit subprocess. Add a compile-time check for kind == "subprocess" to avoid mismatched routing behavior.
Proposed fix
@@
- from nemo_platform_plugin.jobs.api_factory import (
+ from nemo_platform_plugin.jobs.api_factory import (
EnvironmentVariable,
PlatformJobStep,
SubprocessExecutionProviderSpec,
)
+ from nemo_platform_plugin.jobs.exceptions import PlatformJobCompilationError
@@
+ if kind != "subprocess":
+ raise PlatformJobCompilationError(
+ f"agents.optimize requires subprocess execution, got kind={kind!r}, profile={profile!r}"
+ )
+
_require_absolute(spec.optimize_config, "optimize_config")Also applies to: 168-170
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/optimize_agent.py` around
lines 141 - 142, The `kind` parameter in the function signature can be set to
"container" but the code at lines 168-170 always executes as subprocess,
creating a mismatch between the parameter value and actual behavior. Add a guard
clause at the beginning of the function that validates `kind == "subprocess"`,
raising an appropriate error or exception if this condition is not met. This
compile-time check prevents invalid executor and profile combinations where the
caller expects container execution but subprocess is always used instead.
| kind: str = "container", | ||
| profile: str | None = None, |
There was a problem hiding this comment.
Enforce subprocess kind in compile.
Line 90 takes kind, but Lines 151-153 hardcode subprocess. Fail fast when kind != "subprocess" so profile mistakes are surfaced as 4xx compile errors.
Proposed fix
@@
async def compile( # type: ignore[override]
@@
) -> PlatformJobSpec:
@@
+ if kind != "subprocess":
+ raise PlatformJobCompilationError(
+ f"agents.optimize-skills requires subprocess execution, got kind={kind!r}, profile={profile!r}"
+ )
+
from nemo_platform_plugin.jobs.api_factory import (Also applies to: 151-153
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-agents/src/nemo_agents_plugin/jobs/optimize_skills.py` around
lines 90 - 91, The function accepts a kind parameter with a default value of
"container" on line 90, but the implementation at lines 151-153 hardcodes
"subprocess" usage, creating a mismatch. Add early validation in the compile
function to check that kind equals "subprocess" and raise a compile error (4xx)
if it does not, ensuring profile configuration mistakes are caught immediately
rather than causing runtime issues downstream.
| kind: str = "container", | ||
| profile: str | None = None, | ||
| ) -> PlatformJobSpec: |
There was a problem hiding this comment.
Stop hardcoding executor profile; use compile inputs.
Lines 27-28 accept kind/profile, but Lines 46-50 ignore them and force kind="container", profile="default". That can break non-default profile routing.
Proposed fix
@@
from nemo_platform_plugin.jobs.api_factory import (
@@
)
+from nemo_platform_plugin.jobs.exceptions import PlatformJobCompilationError
@@
def compile_hello_world_job(
@@
) -> PlatformJobSpec:
@@
+ if kind != "container":
+ raise PlatformJobCompilationError(
+ f"hello-world requires container execution, got kind={kind!r}, profile={profile!r}"
+ )
+
return PlatformJobSpec(
@@
executor=ContainerExecutionProviderSpec(
kind="container",
provider="cpu",
- profile="default",
+ profile=profile or "default",
container=ContainerSpec(Also applies to: 46-50
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/hello-world/src/nmp/hello_world/api/v2/jobs/endpoints.py` around
lines 27 - 29, The function accepts kind and profile parameters but ignores
them, instead hardcoding kind="container" and profile="default" when creating
the PlatformJobSpec (around lines 46-50). Replace the hardcoded values with the
actual kind and profile parameters that were passed into the function to ensure
proper profile routing is respected.
Summary
Separates two concerns that were conflated in the executor model: what kind of compute a job needs (
provider) vs how the job payload is shaped (kind). Then threads the resolvedkindthrough the compilation pipeline so plugin compilers can emit the right executor shape without hardcoding profile names.Why this change
Before this PR, the executor used
provideras a discriminator with valuescpu,gpu,gpu_distributed, andsubprocess. This conflated two ideas:subprocessis a payload shape, not a compute type. Havingprovider="subprocess"made it impossible to express "run this as a subprocess on GPU hardware." It also meant that CPU container jobs submitted against a subprocess backend required a workaround in the Jobs API:translate_cpu_container_steps_to_subprocesswould rewriteCPUExecutionProvidersteps intoSubprocessExecutionProvidersteps at request time, coupling the API layer to backend selection. Plugin compilers had to provide a container image even for subprocess execution, and the image was silently discarded.What we decided
Following the design in AIRCORE-397 and AIRCORE-513:
kindis the executor payload shape discriminator:"container"or"subprocess"provideris purely compute intent:"cpu","gpu","gpu_distributed"profilemaps(provider, profile)to a backend:("cpu", "default")→ Docker/K8s,("cpu", "subprocess")→ SubprocessJobBackendkindkindfrom the profile and passes it to compilers, so no plugin ever needs to query execution profiles itselfHow it works
Why this is incremental (not the final state)
This PR is a stepping stone toward the
compile_default()design from AIRCORE-397. The end-state moves compilation to the Jobs service backend itself — the backend knows its ownkindand constructs the executor server-side. Plugins would post aPluginJobSpec(just domain payload + metadata) and never construct executors at all.What this PR does is the prerequisite plumbing:
kindas discriminator,provideras compute intent (required for any future work)kindso the platform can resolve it (required forcompile_default)kindto compilers so they can emit the right shape today (bridge untilcompile_defaultexists)The
resolve_profile_kind()helper and thekindparameter oncompile()are explicitly marked as bridges — whencompile_default()lands, plugins stop constructing executors and this resolution moves server-side.What changed
Core model (
providers.py)CPUExecutionProvider,GPUExecutionProvider,DistributedGPUExecutionProviderContainerExecutionProviderwithkind="container",providerasLiteral["cpu", "gpu", "gpu_distributed"]SubprocessExecutionProvider:provider→Literal["cpu", "gpu"], addedkind="subprocess"providertokindExecutorKind = Literal["container", "subprocess"]Execution profiles
kind: ExecutorKindtoBaseExecutionProfile(defaults to"container")SubprocessJobExecutionProfileoverrides withkind="subprocess"GET /v2/execution-profilesreturnskindfor each profileBackend registration
("subprocess", "default")→("cpu", "subprocess")BackendKey("gpu", "subprocess")for GPU subprocesstranslate_cpu_container_steps_to_subprocessand its call increate_job()Compilation pipeline
_compile_platform_specresolveskindfrom profile viaresolve_profile_kind()before calling the compiler(workspace, ..., sdk, kind, profile)— two new positional argsNemoJob.compile()receiveskindandprofilefrom the frameworkstamp_profile()moved from_adapt_compileinto_compile_platform_specBaseJobRequest: addedprofilefield for submit-envelope metadataPlugin compilers
kindto emit container or subprocess executorSDK + OpenAPI
Executorunion isUnion[ContainerExecutionProviderParam, SubprocessExecutionProviderParam]kindVerified end-to-end
subprocesskind=subprocess, provider=cpudefaultkind=container, provider=cpuDesign lineage
SUPPORTED_PROVIDERS = ("cpu", "gpu"),kinddiscriminatorTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor
Tests