Skip to content

LLM Trajectory Failure Mode Identification and Clustering#261

Draft
nianjunz wants to merge 68 commits into
mainfrom
trajectory_refactor
Draft

LLM Trajectory Failure Mode Identification and Clustering#261
nianjunz wants to merge 68 commits into
mainfrom
trajectory_refactor

Conversation

@nianjunz
Copy link
Copy Markdown
Collaborator

This PR introduces an automated system that analyzes LLM agent trajectories to identify failure patterns (14 predefined modes + LLM-discovered ones), with optional clustering to group similar failures together.

The system uses LLM-based analysis (Claude Sonnet 4.6) to detect agent behavior issues like "Step Repetition", "Task Derailment", or "Premature Termination", and provides comprehensive documentation, diagnostic tools, and three usage modes.

All documentation has been updated with proper directory context, path fixes, and verified functionality. Eight LLM trajectories are provided as samples.

@DhavalRepo18
Copy link
Copy Markdown
Collaborator

@nianjunz @ShuxinLin please consolidate the terminology with respect to new Agent definition

#239

@DhavalRepo18 DhavalRepo18 requested a review from ShuxinLin April 15, 2026 12:19
nianjunz added 26 commits April 15, 2026 10:36
Add a comprehensive MIGRATION_PLAN.md under src/trajectory_analysis/failure_mode documenting a six‑phase migration of TrajFM from src/tmp/TrajFM to the new failure_mode package. The plan records Phase 1 as complete (directory structure, core files, docs, tests, and env handling), and details Phase 2–6 tasks including LLM integration with src/llm (LiteLLM/WatsonX), refactors for utils/generator/pipeline, remaining component migrations (extractor/reducer/visualizer), tests/docs updates, integration/validation, and cleanup/deprecation. Also specifies required environment variables for LiteLLM and WatsonX, model selection guidance, testing expectations, timeline, risk mitigation, and current status (Phase 1 complete).

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add pytest test support for trajectory_analysis/failure_mode: new conftest.py provides an autouse load_env fixture that loads .env and skips tests if required environment variables are missing, plus LLM backend fixtures (llm_claude, llm_llama, llm_granite), a sample_trajectory payload, and a temp_output_dir fixture. Also add a README with setup, environment variable guidance, example test usage, and security notes about .env and API keys.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Introduce src/trajectory_analysis package and a failure_mode subpackage. Adds a pipeline wrapper (pipeline.py) to run generation + reduction, LLM system prompts (prompts.py), package __init__.py files, comprehensive README and detailed README_detail documentation, and a tests/__init__.py placeholder. Scaffolds an LLM-based failure-mode extraction and clustering workflow (generation, reduction, visualization) and supplies usage/testing guidance.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Update migration plan to mark Phase 2 (LLM integration) as complete and document the refactor work. Summarizes completed changes: removal of ReactXen, integration with src/llm, new/refactored files (utils.py, generator.py, updated pipeline.py), defaulting to Claude 4 Sonnet backend, added temperature control, improved type hints, docstrings, error handling and testing examples. Notes files created and modified, progress update (2/6 phases complete), and next steps to begin Phase 3.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Expand the failure_mode tests README to document Phase 2 work: add a summary of new test files (test_utils.py, test_generator.py, test_pipeline.py) and total lines/coverage, clarify that Phase 2 tests use mocks and do not require API keys, and provide detailed setup and run instructions for both pip and the recommended uv workflow. Also add sections for running tests (including coverage, parallel runs, and examples), test fixtures, coverage summary, CI example, troubleshooting, security notes, and contributing guidelines.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add comprehensive unit and integration tests for the failure_mode package. New tests cover generator (process_trajectories, JSON loading, normalization of additional failure modes), pipeline (run_failure_mode_pipeline behavior, defaults and parameter passing), and utils (LLM prompt formatting and JSON extraction). Tests exercise default LLM creation, temperature propagation, error handling, output directory creation, and end-to-end flows using mocks. These additions increase test coverage and validate expected behaviors for LLM interactions and file processing.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add modules for failure mode extraction from trajectory data and update the high-level pipeline. New src/trajectory_analysis/failure_mode/generator.py implements process_trajectories to load JSON trajectory files, call an LLM backend (defaults to LiteLLM/Claude-4-Sonnet), normalize additional failure modes, populate a DataFrame of predefined failure_mode keys, and save per-timestamp and combined pickles. New src/trajectory_analysis/failure_mode/utils.py provides helpers get_llm_answer_from_json (formats trajectory into an LLM prompt) and extract_json_from_response (robustly parses JSON from LLM output). Update src/trajectory_analysis/failure_mode/pipeline.py to accept an LLMBackend and temperature, default the backend when omitted, call process_trajectories, and comment out the reduction step pending migration (Phase 3).

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Introduce a new example CLI script (src/trajectory_analysis/failure_mode/analyze_trajectories.py) to run the failure mode analysis pipeline. The script adds argument parsing for trajectory path, output dir, temperature and model selection (claude, llama, granite), validates input directory and JSON files, configures optional LiteLLM backends for selected models, invokes run_failure_mode_pipeline, and prints/saves a summarized failure-mode report. Includes basic error handling, usage examples and exit codes.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Uncomment and enable Phase 2 exports for the failure_mode package: import run_failure_mode_pipeline and process_trajectories and add them to __all__. Add Phase 2/Phase 3 comments and placeholder commented imports for future reducer/extractor/visualizer modules. This exposes the pipeline and generator API now that LLM integration is complete, while leaving Phase 3 items as TODOs.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Change the default output directory for process_trajectories from "processed_trajectories" to "./src/trajectory_analysis/failure_mode/processed_trajectories" so generated pickles are stored inside the package tree. Note: this changes the runtime location where files are written — callers or tests that relied on the previous default may need updates.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Change shebang to use uv runner and add uv manifest (requires-python and dependencies: pandas, python-dotenv). Update usage docs to recommend `uv run`, add executable instructions and a fallback to traditional python. Adjust default input/output paths to repo-local sample_trajectories and processed_trajectories to make the script runnable out-of-the-box.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Refactor failure_mode into a core subpackage and add clustering and CLI tooling. Changes include:

- Create src/trajectory_analysis/failure_mode/core package (__init__.py).
- Add CLI extractor (core/extractor.py) to run end-to-end pipeline and configure LLM backends.
- Add failure_mode_reduction (core/reducer.py) to explode additional failure modes, embed titles, cluster with KMeans (auto-select K by silhouette), choose representative titles, and export CSVs (preserves addtional_fm.csv name).
- Rename/move generator, pipeline, prompts, utils into core/ (updated imports) and update generator to save per-timestamp and combined results as both pickle and CSV (returns combined_csv_path).
- Update pipeline to call failure_mode_reduction and return both generation and reduction results.
- Update tests to import from core package, adjust mock targets, and normalize boolean assertions.

These changes modularize core analysis logic, add CSV outputs for easier inspection, and expose a CLI for running the full failure mode extraction and clustering flow.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Rework failure_mode package exports to import and re-export APIs from a new core module (run_failure_mode_pipeline, process_trajectories, failure_mode_reduction, run_extraction_pipeline). Update the CLI (analyze_trajectories.py) to add clustering support and related flags (--cluster, -k/--num-clusters, --embedding-model), improve usage examples, and make imports robust for script vs module execution. The script now uses rglob to detect files, distinguishes detection-only vs full pipeline (calls process_trajectories for detection-only, run_failure_mode_pipeline for clustering), and prints clearer output paths and clustering summary information.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Adjust user-facing log messages in analyze_trajectories.py to reflect the correct LLM versions and cloud providers. The startup notice now references "Claude Sonnet 4.6 (AWS)" and the default backend message now reports "Claude 3.7 Sonnet (GCP) (default)", clarifying which models/providers are being used during failure mode analysis.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Introduce a new timeout utility module to protect LLM backend calls without modifying backend code. Adds a custom TimeoutError and call_with_timeout(func, timeout_seconds=30, ...) which runs the call in a daemon thread, joins with a timeout, raises TimeoutError if the call exceeds the timeout, re-raises any exception from the wrapped function, and returns the function result on success.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add two test scripts to verify LiteLLM proxy models: src/trajectory_analysis/failure_mode/test_llm_model_connection.py and src/trajectory_analysis/failure_mode/test_all_litellm_models.py. test_llm_model_connection.py provides a CLI to initialize LiteLLMBackend, send a simple prompt, print response/troubleshooting, and return proper exit codes; it includes a timeout wrapper and checks credentials (LITELLM_API_KEY/LITELLM_BASE_URL or WATSONX_*). test_all_litellm_models.py iterates a curated list of proxy model IDs, invokes the single-model tester via subprocess (using `uv run`), parses results/errors, enforces timeouts, and prints per-provider and overall summaries with recommendations. Both scripts are intended for diagnosing connectivity and availability of proxy-hosted models.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add a new test module covering timeout protection for trajectory analysis. Tests exercise call_with_timeout (fast, slow, exception, return values, None), integration with get_llm_answer_from_json using a mock LLM backend (normal response, slow backend triggering timeout, default timeout behavior), and edge cases (very short and very large timeouts, multiple consecutive calls). These tests ensure LLM calls and wrapped functions respect configured timeouts and propagate errors appropriately.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Introduce a CLI test utility (verify_trajectory_import.py) to validate trajectory JSON import, format detection, and extraction using the failure_mode registry. The script loads a trajectory file, detects the appropriate handler, formats the trajectory into question/steps/final_answer, prints summaries, and optionally prints a preview of the formatted LLM prompt (using system_prompt). Useful for debugging and verifying handlers and prompt construction.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Change default LiteLLM backend to the AWS Claude Sonnet 4.6 proxy and add notes about GCP model issues (referencing litellm_models_report.md). Refactor get_llm_answer_from_json to use a pluggable format handler registry, accept a timeout_seconds parameter, and build prompts from normalized question/steps/final_answer fields. Add timeout-protected LLM calls via call_with_timeout and return a clear timeout error message; also adjust prompt joining/formatting and import the new utilities.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Delete obsolete LiteLLM proxy models test report (src/trajectory_analysis/failure_mode/litellm_models_report.md). The file contained a snapshot of model test results and recommendations; it was removed to avoid stale documentation in the repository. Test scripts in src/trajectory_analysis/failure_mode (e.g. test_all_litellm_models.py) can be used to regenerate an up-to-date report if needed.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Replace the hardcoded LITELLM_MODELS list with runtime discovery: add get_available_models() which queries the LiteLLM proxy (/models) using LITELLM_BASE_URL and optional LITELLM_API_KEY, organizes results by provider, and falls back to get_fallback_models() on error or empty responses. Add imports (requests, os, defaultdict), update main() to call the dynamic loader, and improve test output formatting and diagnostics (show full model IDs, clearer status labels, and print detailed errors for failed models). The change enables dynamic testing of currently available proxy models and better failure visibility.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
nianjunz added 25 commits April 15, 2026 10:36
Change get_llm_answer_from_json to return a tuple (response_text, handler_name) instead of just the response string. Update the function signature, docstring, and exception handling to propagate the handler name on success and include it (or "Unknown") on errors/timeouts. This surfaces which format handler processed the input and provides better context for error reporting.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Update default out_dir from processed_trajectories/summary to results/summary and adjust docstrings to reflect the new export location. Rename the misspelled addtional_fm.csv variable/file to additional_fm.csv (including printed messages and returned paths keys) and simplify the CSV save comment. Also update the docstring return entry for the cluster count to reference n_clusters.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Expand README for trajectory_analysis/failure_mode to replace the brief "Columns" list with a detailed "CSV Column Reference" table. The table adds types and examples for each field (e.g., model_id, trajectory_path, format_handler, counter, ut_id, addi_fm_cnt, addi_fm_list) and enumerates the 14 predefined boolean failure-mode flags. Notes about additional failure modes and the format handler (examples: ThoughtActionFormatHandler, TaskDescriptionFormatHandler) were also added to clarify parsing and output structure.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Delete several .gitignore entries for trajectory_analysis: trajectory_analysis_copy, sample_trajectories, and processed_trajectories. These were duplicate/backup and sample/processed data ignores; removing them cleans up the ignore file and allows those directories/files to be tracked or managed differently.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Revamp logging behavior: always use INFO level and add file handlers while only attaching a console handler when --verbose. For --cluster-only, create summary/clustering.log and print its path; for regular runs create runs/<timestamp>/analysis.log (generate run_id from timestamp) and attach a file handler for an audit trail. Pass the pre-generated run_id into the pipeline invocation, ensure file handlers are closed/removed in a finally block, and print concise verbose/quiet status messages. Also reordered the output/temperature variable setup and simplified initial logging configuration.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Introduce an optional run_id parameter to run_failure_mode_pipeline and forward it to the internal generation/run call. This allows callers to provide a run identifier that is propagated into the downstream generation logic without changing other behavior.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Preserve trajectory_path when extracting additional failure modes so failures can be traced back to their source. Updated column selection to optionally keep trajectory_path, adjusted validation to require title/description presence, and included trajectory_path in the clustered CSV and final DataFrame column ordering when available.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Clarify logging and clustering behavior in the failure_mode README: verbose now saves timestamped log files and log files are always created (analysis.log per run and summary/clustering.log). Add run folder naming (run_YYYYMMDD_HHMMSS), document clustered CSV columns and clustering details (embeddings, auto K selection, representative titles), and provide CLI examples and log locations. Small reordering/formatting updates to make outputs and traceability clearer.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Adapt tests to recent API and behavior changes: handle get_llm_answer_from_json returning a (response, handler_name) tuple; update process_trajectories expectations to return df/run_dir/run_id instead of combined_df/combined_path/per_timestamp_paths; add pandas.DataFrame.to_pickle patch to avoid serializing Mock objects and assert pickle calls where appropriate; adjust default model string assertion (removed 'aws/' prefix); remove/annotate timestamp-based pipeline tests because the pipeline now auto-discovers trajectories. Minor assertion and comment updates to match the new outputs and error formats.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Replace internal imports from src.llm.* to llm.* and centralize pipeline outputs under ./results (and ./results/summary). Update default CLI flags, example paths, and generator/pipeline/reducer/utils to use the new results directory. Also add usage note and fix the diagnostic test invocation path to diagnostics/test_llm_model_connection.py. These changes standardize package imports and consolidate generated artifacts into a single results folder.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Refactor two diagnostic scripts to use package imports instead of manipulating sys.path and to standardize CLI flags and examples. Changes include: switch imports from src.* to package-style llm.litellm and trajectory_analysis.*, remove manual sys.path insertion, update test_llm_model_connection to accept --model-id and adjust example model IDs/paths, and enhance verify_trajectory_import docstring, default trajectory path, and usage examples (add --show-prompt option). These edits improve packaging cleanliness and CLI consistency for diagnostics.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Enhance the failure_mode README with a Mermaid system architecture diagram and a detailed pipeline flow (input, format detection, generation, optional clustering, and output). Expand and clarify diagnostic tooling: update example commands to use "uv run python", add descriptions for testing all models, verifying specific model connections, and improved trajectory format verification examples (including --show-prompt usage and a realistic sample path). Documentation-only changes to src/trajectory_analysis/failure_mode/README.md to improve onboarding and debugging.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Update README for trajectory_analysis failure_mode: rename format handlers and format names (StepFormatHandler -> AgentResponseFormatHandler; Old/New format → Agent Response / ReAct), adjust example JSON keys (result -> final_answer), and change default sample/output paths and model id. Add a new "Adding Custom Formats" section with usage steps and a code example to register custom TrajectoryFormatHandler implementations. Clarify output layout (remove "run_" prefix from timestamp folders) and note analysis.log is always created. Also update CLI examples to invoke the script path directly.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Remove trailing "# Made with Bob" comment lines from test modules to clean up source files. Affected files: src/trajectory_analysis/failure_mode/tests/__init__.py, test_generator.py, test_pipeline.py, test_timeout.py, and test_utils.py. No behavioral changes; purely cosmetic cleanup.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Replace and expand the script usage/help comment with a clearer, structured CLI guide. Adds an explicit note to run commands from the project root, groups examples into three main modes (analysis-only, analysis+clustering, cluster-only), provides concrete example commands and additional options, and removes the older, less-structured examples. This is a documentation-only change (no runtime logic modified).

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Rework Basic Usage into three clear modes (Analysis-Only, Analysis+Clustering, Cluster-Only) with concrete CLI examples for sample data and user-provided trajectory folders. Add a note to run commands from the project root and replace specific sample paths with more general paths (/path/to/...) to make examples easier to reuse. Remove an outdated cluster-only embedding-model snippet.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Rename diagnostic script to verify_trajectory_format.py and update its usage/docs. Refresh README-style examples to include cd commands and multiple sample trajectories (mistral, claude) and add a --show-prompt example. Also change the default trajectory path to sample_trajectories/mistral-large/0001 and update the argparse help text accordingly.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add an explicit `cd /path/to/AssetOpsBench` line to the usage example in test_all_litellm_models.py so users run the script from the repository root. Also remove an extra trailing blank line at EOF.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Prepend a reminder to change into the project root (cd /path/to/AssetOpsBench) in the script usage examples for Claude, Llama, and GPT so users run the command from the correct directory. Also remove an extraneous trailing blank line at the end of the file.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Clarify testing and diagnostic instructions in the failure_mode README: add a note to run tests from the project root, add a verbose pytest command, fix the coverage pytest invocation to point at tests, and rename diagnostic script/path references from verify_trajectory_import.py and trajectories/... to verify_trajectory_format.py and sample_trajectories/.... Also update the error message reference to use verify_trajectory_format.py.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Add note to run diagnostic tools from the project root and prepend 'cd /path/to/AssetOpsBench' to diagnostic command examples. Also clarify that the --show-prompt option only previews the LLM prompt (does not call the LLM) and update the related use-case bullet.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Update the module docstring usage examples in verify_trajectory_format.py to reference verify_trajectory_format.py instead of verify_trajectory_import.py. Also clarify that the "Show full LLM prompt" example does not actually call the LLM.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
Replace 'src.trajectory_analysis...' imports and patch targets with 'trajectory_analysis...' across test files (test_generator.py, test_pipeline.py, test_timeout.py, test_utils.py). Adjust mocked patch paths to match the new import style and perform minor whitespace/cleanup edits. No functional changes—this aligns tests with the package import layout to avoid import path issues when running tests in different environments.

Signed-off-by: Nianjun Zhou <jzhou@us.ibm.com>
@DhavalRepo18 DhavalRepo18 marked this pull request as draft April 16, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants