diff --git a/.github/workflows/stale.yml b/.github/workflows/stale.yml
new file mode 100644
index 00000000..4731918b
--- /dev/null
+++ b/.github/workflows/stale.yml
@@ -0,0 +1,24 @@
+name: Close stale issues
+
+on:
+ schedule:
+ - cron: '0 0 * * *'
+ workflow_dispatch:
+
+jobs:
+ stale:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/stale@v10
+ with:
+ stale-issue-message: >
+ This issue has been inactive for 30 days. It will be closed in 7 days
+ unless there is further activity.
+ close-issue-message: >
+ Closed due to 30 days of inactivity. Please reopen if this is still relevant.
+ days-before-issue-stale: 30
+ days-before-issue-close: 7
+ stale-issue-label: stale
+ exempt-issue-labels: pinned,security,in-progress
+ days-before-pr-stale: -1
+ days-before-pr-close: -1
diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md
index 7ceea7dd..76cb7df5 100644
--- a/INSTRUCTIONS.md
+++ b/INSTRUCTIONS.md
@@ -14,13 +14,17 @@ This directory contains the MCP servers and infrastructure for the AssetOpsBench
- [tsfm](#tsfm)
- [wo](#wo)
- [vibration](#vibration)
-- [Plan-Execute Runner](#plan-execute-runner)
+- [Plan-Execute Agent](#plan-execute-agent)
- [How it works](#how-it-works)
- [CLI](#cli)
- [End-to-end example](#end-to-end-example)
- [Python API](#python-api)
- [Bring your own LLM](#bring-your-own-llm)
- [Add more MCP servers](#add-more-mcp-servers)
+- [Claude Agent](#claude-agent)
+ - [How it works](#how-it-works-1)
+ - [CLI](#cli-1)
+ - [Python API](#python-api-1)
- [Connect to Claude Desktop](#connect-to-claude-desktop)
- [Running Tests](#running-tests)
- [Architecture](#architecture)
@@ -211,9 +215,9 @@ uv run vibration-mcp-server
---
-## Plan-Execute Runner
+## Plan-Execute Agent
-`src/workflow/` is a custom MCP client that implements a **plan-and-execute** workflow over the MCP servers. It replaces AgentHive's bespoke orchestration with the standard MCP protocol.
+`src/agent/` is a custom MCP client that implements a **plan-and-execute** workflow over the MCP servers. It replaces AgentHive's bespoke orchestration with the standard MCP protocol.
### How it works
@@ -250,8 +254,8 @@ Flags:
| `--model-id MODEL_ID` | litellm model string with provider prefix (default: `watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8`) |
| `--server NAME=SPEC` | Override MCP servers with `NAME=SPEC` pairs (repeatable); SPEC is an entry-point name or path |
| `--show-plan` | Print the generated plan before execution |
-| `--show-history` | Print each step result after execution |
-| `--json` | Output answer + plan + history as JSON |
+| `--show-trajectory` | Print each step result after execution |
+| `--json` | Output answer + plan + trajectory as JSON |
The provider is encoded in the `--model-id` prefix:
@@ -273,7 +277,7 @@ uv run plan-execute --model-id watsonx/ibm/granite-3-3-8b-instruct --show-plan "
uv run plan-execute --model-id litellm_proxy/GCP/claude-4-sonnet "What are the failure modes for a chiller?"
# Machine-readable output
-uv run plan-execute --show-history --json "How many observations exist for CH-1?" | jq .answer
+uv run plan-execute --show-trajectory --json "How many observations exist for CH-1?" | jq .answer
```
### End-to-end examples
@@ -295,7 +299,7 @@ uv run plan-execute "For equipment CWC04013, how many preventive vs corrective w
uv run plan-execute "What is the probability that alert rule RUL0018 on equipment CWC04009 leads to a work order, and how long does it typically take?"
# Work order distribution + next prediction (multi-step)
-uv run plan-execute --show-plan --show-history \
+uv run plan-execute --show-plan --show-trajectory \
"For equipment CWC04014, show the work order distribution and predict the next maintenance type"
```
@@ -304,8 +308,8 @@ uv run plan-execute --show-plan --show-history \
Run a question that exercises three servers with independent parallel steps:
```bash
-uv run plan-execute --show-plan --show-history \
- "What is the current date and time? Also list assets at site MAIN. Also get failure modes for a chiller."
+uv run plan-execute --show-plan --show-trajectory \
+ "What is the current date and time? Also list assets at site MAIN. Also get sensor list and failure mode list for any of the chiller at site MAIN."
```
Expected plan (3 parallel steps, no dependencies):
@@ -335,7 +339,7 @@ Expected execution output (trimmed):
```python
import asyncio
-from workflow import PlanExecuteRunner
+from agent import PlanExecuteRunner
from llm import LiteLLMBackend
runner = PlanExecuteRunner(llm=LiteLLMBackend("watsonx/meta-llama/llama-3-3-70b-instruct"))
@@ -349,7 +353,7 @@ print(result.answer)
| --------- | ------------------ | --------------------------------- |
| `answer` | `str` | Final synthesised answer |
| `plan` | `Plan` | The generated plan with its steps |
-| `history` | `list[StepResult]` | Per-step execution results |
+| `trajectory` | `list[StepResult]` | Per-step execution results |
### Bring your own LLM
@@ -370,7 +374,7 @@ runner = PlanExecuteRunner(llm=MyLLM())
Pass `server_paths` to register additional servers. Keys must match the server names the planner assigns steps to:
```python
-from workflow import PlanExecuteRunner
+from agent import PlanExecuteRunner
runner = PlanExecuteRunner(
llm=my_llm,
@@ -389,6 +393,104 @@ runner = PlanExecuteRunner(
---
+## Claude Agent
+
+`src/agent/claude_agent/` uses the **claude-agent-sdk** to drive the same MCP servers. Unlike `PlanExecuteRunner`, there is no explicit plan — the SDK's built-in agentic loop handles tool discovery, invocation, and multi-turn reasoning autonomously.
+
+### How it works
+
+```
+ClaudeAgentRunner.run(question)
+ │
+ └─ claude-agent-sdk query loop
+ • connects to each MCP server over stdio
+ • Claude decides which tools to call and in what order
+ • tool calls and results are handled internally by the SDK
+ • final answer is returned as ResultMessage
+```
+
+### CLI
+
+After `uv sync`, the `claude-agent` command is available:
+
+```bash
+uv run claude-agent "What sensors are on Chiller 6?"
+```
+
+Flags:
+
+| Flag | Description |
+| --------------------- | ---------------------------------------------------------------------------- |
+| `--model-id MODEL_ID` | Claude model ID (default: `claude-opus-4-6`) |
+| `--max-turns N` | Maximum agentic loop turns (default: 30) |
+| `--show-trajectory` | Print each turn's text, tool calls, and token usage |
+| `--json` | Output full trajectory (turns, tool calls, token counts) as JSON |
+| `--verbose` | Show INFO-level logs on stderr |
+
+The `--model-id` prefix determines the backend:
+
+| Prefix | Backend | Required env vars |
+| ---------------- | ------------- | ------------------------------------- |
+| _(none)_ | Anthropic API | `LITELLM_API_KEY` |
+| `litellm_proxy/` | LiteLLM proxy | `LITELLM_API_KEY`, `LITELLM_BASE_URL` |
+
+Examples:
+
+```bash
+# Direct Anthropic API
+uv run claude-agent "What assets are at site MAIN?"
+
+# LiteLLM proxy
+uv run claude-agent --model-id litellm_proxy/aws/claude-opus-4-6 "What sensors are on Chiller 6?"
+
+# Show full trajectory (turns, tool calls, token usage)
+uv run claude-agent --show-trajectory "What are the failure modes for a chiller?"
+
+# Machine-readable trajectory
+uv run claude-agent --json "What is the current time?" | jq .turns
+```
+
+### Python API
+
+```python
+import anyio
+from agent.claude_agent import ClaudeAgentRunner
+
+runner = ClaudeAgentRunner(model="litellm_proxy/aws/claude-opus-4-6")
+result = anyio.run(runner.run, "What sensors are on Chiller 6?")
+print(result.answer)
+```
+
+`AgentResult` fields:
+
+| Field | Type | Description |
+| --------- | ------------ | ---------------------------------------------- |
+| `answer` | `str` | Final answer from the agent |
+| `trajectory` | `Trajectory` | Full execution trace (turns, tool calls, tokens) |
+
+`Trajectory` fields:
+
+| Field | Type | Description |
+| --------------------- | ----------------- | ------------------------------------ |
+| `turns` | `list[TurnRecord]`| One record per assistant turn |
+| `total_input_tokens` | `int` | Sum of input tokens across all turns |
+| `total_output_tokens` | `int` | Sum of output tokens across all turns|
+| `all_tool_calls` | `list[ToolCall]` | Flat list of every tool call made |
+
+Each `TurnRecord` has `index`, `text`, `tool_calls`, `input_tokens`, `output_tokens`.
+Each `ToolCall` has `name`, `input`, `id`, `output` (the MCP server response, captured via `PostToolUse` hook).
+
+```python
+traj = result.trajectory
+print(f"{traj.total_input_tokens} input / {traj.total_output_tokens} output tokens")
+for tc in traj.all_tool_calls:
+ print(f" {tc.name}: {tc.input}")
+ if tc.output is not None:
+ print(f" -> {tc.output}")
+```
+
+---
+
## Connect to Claude Desktop
Add the following to your Claude Desktop `claude_desktop_config.json`:
@@ -460,7 +562,7 @@ uv run pytest src/servers/utilities/tests/
uv run pytest src/servers/fmsr/tests/ -k "not integration"
uv run pytest src/servers/tsfm/tests/ -k "not integration"
uv run pytest src/servers/wo/tests/test_tools.py -k "not integration"
-uv run pytest src/workflow/tests/
+uv run pytest src/agent/tests/
```
### Work order integration tests (requires CouchDB + populated `workorder` db)
@@ -482,22 +584,27 @@ uv run pytest src/ -v
## Architecture
```
-┌──────────────────────────────────────────────────────┐
-│ workflow/ │
-│ │
-│ PlanExecuteRunner.run(question) │
-│ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
-│ │ Planner │ → │ Executor │ → │ Summariser │ │
-│ │ │ │ │ │ │ │
-│ │ LLM breaks │ │ Routes each│ │ LLM combines │ │
-│ │ question │ │ step to the│ │ step results │ │
-│ │ into steps │ │ right MCP │ │ into answer │ │
-│ └────────────┘ │ server via │ └──────────────┘ │
-│ │ stdio │ │
-└───────────────────┼────────────┼─────────────────────┘
- │ MCP protocol (stdio)
- ┌──────────┼──────────┬──────────┬──────┬───────────┐
- ▼ ▼ ▼ ▼ ▼ ▼
- iot utilities fmsr tsfm wo vibration
- (tools) (tools) (tools) (tools) (tools) (tools)
+┌──────────────────────────────────────────────────────────────┐
+│ agent/ │
+│ │
+│ PlanExecuteRunner.run(question) │
+│ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
+│ │ Planner │ → │ Executor │ → │ Summariser │ │
+│ │ LLM breaks │ │ Routes each│ │ LLM combines │ │
+│ │ question │ │ step to MCP│ │ step results │ │
+│ │ into steps │ │ via stdio │ │ into answer │ │
+│ └────────────┘ └────────────┘ └──────────────┘ │
+│ │
+│ ClaudeAgentRunner.run(question) │
+│ ┌─────────────────────────────────────────┐ │
+│ │ claude-agent-sdk agentic loop │ │
+│ │ Claude decides tools + order autonomously │
+│ │ Trajectory (turns, tool calls, tokens) collected │
+│ └─────────────────────────────────────────┘ │
+└──────────────────────────┬───────────────────────────────────┘
+ │ MCP protocol (stdio)
+ ┌─────────────────┼───────────┬──────────┬──────┬───────────┐
+ ▼ ▼ ▼ ▼ ▼ ▼
+ iot utilities fmsr tsfm wo vibration
+ (tools) (tools) (tools) (tools) (tools) (tools)
```
diff --git a/README.md b/README.md
index b1549160..b4867f42 100644
--- a/README.md
+++ b/README.md
@@ -31,8 +31,8 @@ We are expanding **AssetOpsBench** to cover a broader range of industrial challe
* **Task Domains:** Prognostics and Health Management, Remaining Useful Life (RUL) estimation, or Root Cause Analysis (RCA), Diagnostic Analysis and Predictive Maintenance.
**How to contribute:**
-1. **Define** your scenario following our [Utterance Guideline](https://github.com/IBM/AssetOpsBench/blob/extra_scenarios/experimental_scenarios/utterance_design_guideline.md),
-[Ground Truth Guideline](https://github.com/IBM/AssetOpsBench/blob/extra_scenarios/experimental_scenarios/ground_truth_creation_best_practice.md)
+1. **Define** your scenario following our [Utterance Guideline](docs/guideline/utterance_design_guideline.md),
+[Ground Truth Guideline](docs/guideline/ground_truth_design_guideline.md)
1. **Explore** the [Hugging Face dataset](https://huggingface.co/datasets/ibm-research/AssetOpsBench) as examples.
3. **Submit** a Pull Request or open an [Issue](https://github.com/IBM/AssetOpsBench/issues) with the tag `new-scenario`.
@@ -200,23 +200,136 @@ Thanks goes to these wonderful people ✨
-
diff --git a/benchmark/basic_requirements.txt b/benchmark/basic_requirements.txt
index 2afa73cc..4da53f76 100644
--- a/benchmark/basic_requirements.txt
+++ b/benchmark/basic_requirements.txt
@@ -3,6 +3,6 @@ poetry
requests
pydantic
openai
-litellm
+litellm==1.81.13
datasets
couchdb3
diff --git a/docs/guideline/case_study_industrial_asset_management.md b/docs/guideline/case_study_industrial_asset_management.md
new file mode 100644
index 00000000..09f97444
--- /dev/null
+++ b/docs/guideline/case_study_industrial_asset_management.md
@@ -0,0 +1,520 @@
+# Industrial Asset Management: Utterance Design Case Study (Coming from AssetOpsBench)
+
+## Overview
+
+This document demonstrates the application of the [Utterance Design Guideline](utterance_design_guideline.md) to the Industrial Asset Management domain, specifically focusing on Facilities Management (HVAC/Chiller systems). This is a **partially accomplished case study** showing work-in-progress utterance development.
+
+**Status**: This case study represents an ongoing effort to build a comprehensive utterance set for industrial asset management. It demonstrates the incremental development approach recommended in the guideline, with some areas fully developed and others identified for future work.
+
+---
+
+## Domain Context
+
+### Equipment Types
+Chillers, air handling units (AHUs), cooling towers, pumps, compressors, heat exchangers
+
+### Measurement Types
+Temperature, pressure, flow, power consumption, efficiency, tonnage, humidity
+
+### Event Types
+Alerts, anomalies, work orders, failure modes, alarms
+
+### Time Granularity
+Real-time to yearly (typically hourly for HVAC monitoring)
+
+### Location Hierarchy
+Enterprise → Sites → Buildings → HVAC Zones → Equipment → Components
+
+### Action Types
+Preventive maintenance (PM), corrective maintenance (CM), inspections, calibration
+
+---
+
+## Category Mapping for Industrial Asset Management
+
+This section shows how the generic categories from the [main guideline](utterance_design_guideline.md) map to industrial asset management contexts.
+
+### Information Retrieval → Equipment & Failure Knowledge
+- **Industrial Context**: Querying equipment inventory, failure modes, sensor configurations
+- **Examples**: "List all chillers at site MAIN", "What failure modes exist for Chiller 6?"
+
+### Data Extraction → Sensor & Telemetry Data
+- **Industrial Context**: Retrieving time-series sensor data, work order history, operational logs
+- **Examples**: "Download Chiller 6 tonnage data from last week", "Get work orders for 2017"
+
+### Analysis & Inference → Performance Forecasting
+- **Industrial Context**: Time series forecasting, performance prediction, trend analysis
+- **Examples**: "Forecast Chiller 9 energy consumption for next week", "Predict condenser water flow"
+
+### Model Customization → Domain-Specific Model Training
+- **Industrial Context**: Fine-tuning models for specific equipment or site conditions
+- **Examples**: "Finetune forecasting model for Chiller 9 using site-specific data"
+
+### Anomaly & Exception Detection → Equipment Anomaly Detection
+- **Industrial Context**: Detecting abnormal equipment behavior, performance deviations
+- **Examples**: "Detect anomalies in Chiller 6 tonnage", "Identify unusual power consumption patterns"
+
+### Recommendation & Optimization → Maintenance Recommendations
+- **Industrial Context**: Work order recommendations, maintenance scheduling, resource optimization
+- **Examples**: "Recommend work orders for detected anomaly", "Suggest maintenance bundling strategy"
+
+### Future State Prediction → Failure Prediction
+- **Industrial Context**: Predicting equipment failures, work order probability, remaining useful life
+- **Examples**: "Predict next work order probability", "Estimate failure risk over next 30 days"
+
+### Multi-Step Orchestration → Integrated Diagnostics
+- **Industrial Context**: Combined data retrieval, analysis, and recommendation workflows
+- **Examples**: "Retrieve sensor data, detect anomalies, and recommend corrective actions"
+
+---
+
+## Applying the Three Core Categories
+
+### 1. retrospective (Knowledge Extraction)
+
+**Domain Translation**: Understanding historical equipment performance, current asset state, and operational history.
+
+#### Knowledge Query Examples
+
+**Inventory and Configuration**:
+- "What IoT sites are available?"
+- "Which assets are located at the MAIN facility?"
+- "List all chillers at site MAIN"
+- "What assets can be found at the MAIN site?"
+
+**Failure Mode Analysis**:
+- "List all failure modes of asset Chiller."
+- "List all failure modes of asset Chiller 6."
+- "List all failure modes of asset Wind Turbine." (non-deterministic - knowledge base dependent)
+
+**Sensor and Monitoring**:
+- "List all installed sensors of asset Chiller 6."
+- "Can I list all the metrics monitored by CQPA AHU 2B? use site MAIN"
+- "Provide some sensors of asset Wind Turbine." (non-deterministic)
+
+**Failure Mode-Sensor Mapping**:
+- "List all failure modes of Chiller 6 that can be detected by Chiller 6 Supply Temperature."
+- "List all failure modes of Chiller 6 that can be detected by temperature sensors."
+- "List all failure modes of Chiller 6 that can be detected by temperature sensors and power input sensors."
+- "Get failure modes for Chiller 6 and only include in final response those that can be monitored using the available sensors."
+- "Are there any failure modes of Chiller 6 that can be predicted by monitoring the vibration sensor data?"
+
+**Sensor-Failure Mode Mapping**:
+- "List all sensors of Chiller 6 that are potentially relevant to Compressor Overheating."
+- "If compressor overheating occurs for Chiller 6, which sensor should be prioritized for monitoring this specific failure?"
+- "If Evaporator Water side fouling occurs for Chiller 6, which sensor is most relevant for monitoring this specific failure?"
+
+#### Data Query Examples
+
+**Metadata Retrieval**:
+- "Retrieve metadata for Chiller 6 located at the MAIN site."
+- "Get the asset details for Chiller 9 at the MAIN site."
+- "Download the metadata for Chiller 3 at the MAIN facility."
+
+**Sensor Data Retrieval**:
+- "Download sensor data for Chiller 6's Tonnage from the last week of 2020 at the MAIN site"
+- "Retrieve sensor data for Chiller 6's % Loaded from June 2020 at MAIN."
+- "Get sensor data for both Chiller 6 and Chiller 9's Tonnage from first week of June 2020 at MAIN in a single file."
+- "Download all sensor data for Chiller 6 from the last week of April '20 at the MAIN site."
+- "Retrieve sensor data for Chiller 6 from June 2020."
+
+**Point-in-Time Queries**:
+- "What was the latest supply humidity from CQPA AHU 1 at site MAIN on sept 3 2015? return in a file"
+- "what was the supply temperature from CQPA AHU 2B on sept 19, 2020 at quarter to midnight, at site MAIN? return in a file"
+- "how much power was CQPA AHU 1 (MAIN site) using on 6/14/20?"
+- "What is the power consumption of CQPA AHU 1 on mar 13 '20, site MAIN?"
+- "what was the return temperature from CQPA AHU 2B on sept 19, 2020 at 7pm, at site MAIN? return data in a file"
+
+**Time Range Queries**:
+- "retrieve the supply temperature data recorded last week for Chiller 3 (MAIN site)?"
+
+#### Work Order History Examples
+
+- "Get the work order of equipment CWC04013 for year 2017."
+
+**Status**: Retrospective category is well-developed with 30+ utterances covering inventory, failure modes, sensors, and historical data retrieval.
+
+---
+
+### 2. predictive
+
+**Domain Translation**: Forecasting equipment performance, predicting failures, and estimating future maintenance needs.
+
+#### Time Series Forecasting Examples
+
+**Model Availability Queries**:
+- "What types of time series analysis are supported?"
+- "What are time series pretrained models are available in system?"
+- "Are any time series forecasting models supported?"
+- "Is TTM (Tiny Time Mixture), a time series model supported?"
+- "Is LSTM model supported in TSFM?" (Answer: No)
+- "Is Chronos model supported in TSFM?" (Answer: No)
+- "Is Anomaly Detection supported in TSFM?" (Answer: Yes)
+- "Is Time Series Classification supported in TSFM?" (Answer: No)
+- "Is any model with context length 96 supported in TSFM?"
+- "Is any model with context length 1024 supported in TSFM?" (Answer: No)
+
+**Forecasting Queries**:
+- "What is the forecast for 'Chiller 9 Condenser Water Flow' in the week of 2020-04-27?"
+- "Forecast 'Chiller 9 Condenser Water Flow' using data in 'chiller9_annotated_small_test.csv'."
+- "Use data in 'chiller9_annotated_small_test.csv' to forecast with 'Timestamp' as a timestamp."
+
+**Model Fine-tuning**:
+- "Finetune a forecasting model for 'Chiller 9 Condenser Water Flow' using data in 'chiller9_finetuning_small.csv'."
+
+#### Anomaly Detection Examples
+
+- "I need to perform Time Series anomaly detection of 'Chiller 9 Condenser Water Flow' using data in chiller9_tsad.csv."
+- "Is there any anomaly detected in Chiller 6's Tonnage in the week of 2020-04-27?"
+
+#### Failure Prediction Examples
+
+- "Can you predict next work order probability for equipment CWC04009?"
+- "I would like to predict the next work order probability for equipment CWC04013."
+- "Build a predictive model from historical alerts and work orders of CWC04009 to forecast failures."
+- "Build a predictive model to forecast failures over the forthcoming 10-year horizon."
+
+**Status**: Predictive category has moderate coverage (~20 utterances) focused on time series forecasting and anomaly detection. **Gap identified**: Need more utterances for remaining useful life (RUL) estimation and degradation trend analysis.
+
+---
+
+### 3. prescriptive
+
+**Domain Translation**: Recommending maintenance actions, optimizing schedules, and supporting operational decisions.
+
+#### Recommendation Examples
+
+**Anomaly Response**:
+- "When an anomaly happens for equipment CWC04009, can you recommend top three work orders to address this problem?"
+- "How can I analyze anomalies across multiple KPIs to better diagnose the root cause of these issues?"
+
+**Work Order Bundling**:
+- "Which corrective work orders for equipment CWC04009 in year 2017 can be bundled in the next maintenance window?"
+
+**Prioritization**:
+- "Which work orders should I prioritize first for Chiller 9 in July 2020?"
+
+#### Root Cause Analysis Examples
+
+- "When power input of Chiller 6 drops, what is the potential failure that causes it?"
+- "When the Liquid Refrigerant Evaporator Temperature of Chiller 6 drops, what failure is most likely to occur?"
+- "When compressor motor of Chiller 6 fails, what is the temporal behavior of the power input?"
+
+#### Diagnostic Recipe Examples
+
+- "Purge unit of chiller 6 have possibility to excess purge, what is the plan by the maintenance experts to early detect the failure?"
+- "Generate a machine learning recipe for detecting overheating failure for Chiller 6. Result should include feature sensors and target sensor."
+- "I want to build an anomaly model for identifying a chiller trip failure for POKMAIN chiller 6. Provide me a list of sensors that I should use, along with the temporal behavior."
+- "What are the failure modes of Chiller 6 that can be identified by analyzing the data from the available sensors?"
+
+#### Alert Reasoning Examples
+
+- "How can reasoning on operation alerts help in generating significant warning messages?"
+
+**Status**: Prescriptive category is under-developed (~10 utterances). **Major gaps identified**:
+- Maintenance scheduling optimization
+- Resource allocation
+- Cost-benefit analysis
+- Multi-asset prioritization
+- Preventive maintenance planning
+
+---
+
+## Coverage Analysis
+
+### Current Coverage Matrix
+
+| Operational Scenario | Retrospective | Predictive | Prescriptive | Coverage Status |
+|---------------------|---------------|------------|--------------|-----------------|
+| Chiller performance monitoring | ✓ (8 utterances) | ✓ (4 utterances) | ✓ (2 utterances) | **Complete** |
+| Failure mode analysis | ✓ (12 utterances) | ✗ (0 utterances) | ✓ (3 utterances) | **Partial** - Missing predictive |
+| Sensor data retrieval | ✓ (15 utterances) | N/A | N/A | **Complete** |
+| Work order management | ✓ (1 utterance) | ✓ (3 utterances) | ✓ (2 utterances) | **Partial** - Need more prescriptive |
+| Energy optimization | ✗ (0 utterances) | ✓ (1 utterance) | ✗ (0 utterances) | **Incomplete** |
+| Maintenance scheduling | ✗ (0 utterances) | ✗ (0 utterances) | ✓ (1 utterance) | **Incomplete** |
+| Root cause diagnosis | ✓ (2 utterances) | ✗ (0 utterances) | ✓ (3 utterances) | **Partial** |
+| Anomaly detection | ✓ (0 utterances) | ✓ (2 utterances) | ✓ (1 utterance) | **Partial** - Missing retrospective |
+
+### Completeness Metrics
+
+**Category Distribution** (Current):
+- Retrospective: ~55% (38 utterances)
+- Predictive: ~30% (20 utterances)
+- Prescriptive: ~15% (10 utterances)
+
+**Target Distribution**:
+- Retrospective: 40%
+- Predictive: 30%
+- Prescriptive: 30%
+
+**Assessment**: Prescriptive category is significantly under-represented. Need to develop ~10-15 more prescriptive utterances.
+
+---
+
+## SME Contribution Tracking
+
+### Current Contributors
+
+| SME Role | Expertise Area | Utterances Created | Priority Focus |
+|----------|---------------|-------------------|----------------|
+| Chiller Operations SME | Chiller monitoring & operations | 25 | Retrospective, Data queries |
+| FMSA Expert | Failure modes & sensor analysis | 18 | Retrospective, Knowledge queries |
+| Time Series Analytics SME | Forecasting & anomaly detection | 15 | Predictive |
+| Maintenance Planning SME | Work order management | 5 | Prescriptive |
+| **NEEDED** | Energy optimization | 0 | Prescriptive |
+| **NEEDED** | Maintenance scheduling | 0 | Prescriptive |
+
+### Identified Gaps Requiring SME Input
+
+1. **Energy Optimization SME** (Priority: P1)
+ - Energy consumption forecasting
+ - Load optimization recommendations
+ - Cost-benefit analysis for efficiency improvements
+
+2. **Maintenance Scheduling SME** (Priority: P1)
+ - Preventive maintenance planning
+ - Resource allocation optimization
+ - Multi-asset maintenance coordination
+
+3. **Controls & Automation SME** (Priority: P2)
+ - Control system integration queries
+ - Setpoint optimization
+ - Automated response strategies
+
+---
+
+## Utterance Schema for Industrial Asset Management
+
+The original ALM dataset (`docs/alm/alm_utterance.json`) contains 152 utterances with 6 base fields. This guideline recommends adding 3 additional fields for better organization and metadata tracking.
+
+
+
+1. **id** (integer): Unique identifier
+2. **text** (string): Natural language utterance
+3. **type** (string): AI agent identifier
+4. **category** (string): Classification
+5. **deterministic** (boolean): Single correct answer or not
+6. **characteristic_form** (string): Expected response description
+
+7. **group** (string or array): RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE
+8. **entity** (string or array): Primary subject(s)
+9. **note** (string): Flexible metadata field
+
+### Complete Enhanced Schema (9 fields):
+
+Each utterance should follow this enhanced schema with 9 required fields in this order:
+
+1. **id** (integer): Unique identifier
+2. **text** (string): Natural language utterance
+3. **type** (string): MCP server or AI agent identifier - which component processes this utterance
+ - **IoT**: IoT data agent (handles sensor data, asset metadata, site information)
+ - **FMSA**: Failure Mode & Sensor Analysis agent (handles failure modes, sensor relationships, diagnostic recipes)
+ - **TSFM**: Time Series Foundation Model agent (handles forecasting, anomaly detection, model fine-tuning)
+ - **Workorder**: Work order management agent (handles maintenance records, recommendations, scheduling)
+ - **multiagent**: Multi-agent coordination (requires collaboration between multiple agents)
+ - Note: In MCP systems, this could also identify MCP servers like "filesystem", "database", "web-search", etc.
+4. **category** (string): Classification (see Category Mapping above)
+5. **deterministic** (boolean): Single correct answer (true) or multiple valid responses (false)
+6. **characteristic_form** (string): Expected response description
+7. **group** (string or array): RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE (use lowercase in JSON: "retrospective", "predictive", "prescriptive")
+8. **entity** (string or array): Primary physical subject(s) - equipment types (e.g., "Chiller", "AHU", "Pump", "Sensor", "Site"). Use physical things, not abstract concepts like "Anomaly" or "WorkOrder"
+9. **note** (string): A flexible field for any information useful to explain or contextualize the utterance, including source, owner/contributor, omitted details, design rationale, implementation notes, or any other relevant metadata
+
+**Important Notes**:
+- **Original ALM Data**: The base dataset in `docs/alm/alm_utterance.json` contains only fields 1-6. Only one utterance (id 409) has a `note` field in the original data.
+- **Recommended Enhancements**: Fields 7-9 (group, entity, note) are proposed additions for better organization, filtering, and metadata tracking.
+- **Migration Path**: When adopting this guideline, existing utterances can be enhanced by adding the three recommended fields.
+- **Flexibility**: The `note` field is intentionally flexible to accommodate diverse documentation needs.
+
+---
+
+## Example Utterance Schemas
+
+### Example 1: RETROSPECTIVE - Site Inventory (IoT Type)
+
+```json
+{
+ "id": 1,
+ "text": "What IoT sites are available?",
+ "type": "IoT",
+ "category": "Knowledge Query",
+ "deterministic": true,
+ "characteristic_form": "The expected response should be the return value of all sites, either as text or as a reference to a file",
+ "group": "retrospective",
+ "entity": "Site",
+ "note": "Source: Initial domain analysis; Owner: Domain SME Team; Basic inventory query pattern for site discovery"
+}
+```
+
+### Example 2: PRESCRIPTIVE - Work Order Recommendation (Workorder Type)
+
+```json
+{
+ "id": 416,
+ "text": "When an anomaly happens for equipment CWC04009, can you recommend top three work orders to address this problem?",
+ "type": "Workorder",
+ "category": "Decision Support",
+ "deterministic": false,
+ "characteristic_form": "It gives a list of work order with a primary failure code. Based on an anomaly in CWC04009, recommend the top three most appropriate work orders for remediation.",
+ "group": "prescriptive",
+ "entity": "Chiller",
+ "note": "Source: Operations team request; Owner: Maintenance SME; Supports proactive maintenance workflow; Requires anomaly detection integration"
+}
+```
+
+### Example 3: RETROSPECTIVE - Event Summary with Note
+
+```json
+{
+ "id": 409,
+ "text": "Get the daily count of alert, anomaly and work order event for the May 2020 for equipment CWC04009.",
+ "type": "Workorder",
+ "category": "Knowledge Query",
+ "deterministic": false,
+ "characteristic_form": "There are 26 days with the records. Depending on the LLM used, the result could be daily total event summary or daily summary for each event type. The expected response should retrieve and summarize daily counts of alerts, anomalies, and work order events for CWC04009 for May 2020, verifying correct aggregation over time (daily) and filtering of asset, event type (alert, anomaly and work order), and time range - May 2020.",
+ "group": "retrospective",
+ "entity": ["Alert", "Anomaly", "WorkOrder"],
+ "note": "Source: Operations dashboard requirements; Owner: Analytics Team; Multi-entity aggregation query; Design note: We have both work order business data object and work order event as a group type in the event file, so we made a change to the utterance; Omitted: Specific aggregation method preference"
+}
+```
+
+### Example 4: PREDICTIVE - Time Series Forecasting (TSFM Type)
+
+```json
+{
+ "id": 217,
+ "text": "Forecast 'Chiller 9 Condenser Water Flow' using data in 'chiller9_annotated_small_test.csv'. Use parameter 'Timestamp' as a timestamp.",
+ "type": "TSFM",
+ "category": "Inference Query",
+ "deterministic": true,
+ "characteristic_form": "The expected response should be: Forecasting results of 'Chiller 9 Condenser Water Flow' using data in 'chiller9_annotated_small_test.csv' are stored in json file",
+ "group": "predictive",
+ "entity": "Chiller",
+ "note": "Source: Predictive maintenance initiative; Owner: Data Science Team; Requires time series forecasting model; Implementation: Uses TSFM framework with timestamp parameter"
+}
+```
+
+### Example 5: Multi-Group - Performance Review & Anomaly Detection
+
+```json
+{
+ "id": 420,
+ "text": "Assume today is early of July 2020, I would like to review the performance of chiller 9 with equipment ID CWC04009 for June 2020 and track any anomalies or operation violations as alerts.",
+ "type": "Workorder",
+ "category": "Decision Support",
+ "deterministic": false,
+ "characteristic_form": "There were 30 alerts for 'Chiller - Evaporator Approach High', and anomalies were observed in Cooling Load (12 instances), Flow Efficiency (9), Delta Setpoint (6), and Delta Temperature (3). The LLM ReAct process is to review the operational performance and detect anomalies or alerts for Chiller 9 (CWC04009) during June 2020, confirming correct equipment ID and timeframe.",
+ "group": ["retrospective", "predictive"],
+ "entity": "Chiller",
+ "note": "Source: Operations dashboard requirements; Owner: Analytics Team; Multi-category query combining historical review with anomaly detection; Cross-reference: Related to anomaly detection queries in category 5; Implementation: Requires ReAct agent pattern"
+}
+```
+
+### Example 6: PRESCRIPTIVE - Failure Prediction Model Building
+
+```json
+{
+ "id": 436,
+ "text": "Build a predictive model from historical alerts and work orders of CWC04009 to forecast failures and replacement needs over the forthcoming 10-year horizon.",
+ "type": "Workorder",
+ "category": "Prediction",
+ "deterministic": false,
+ "characteristic_form": "Leverage ALERT and WORK_ORDER records (features: description, event_group, event_category, event_time) for CWC04009 to train and validate a model that predicts the likelihood and timing of future failures or required replacements.",
+ "group": ["retrospective", "predictive", "prescriptive"],
+ "entity": "Chiller",
+ "note": "Source: Advanced analytics initiative; Owner: Data Science & Operations Teams; Complex multi-phase query spanning all three problem categories; Implementation: Requires ML pipeline with data preparation, model training, prediction, and decision support components; Estimated complexity: High; Timeline: 10-year forecast horizon"
+}
+```
+
+**Note**: All examples above follow the required schema structure with 9 fields. The `group` field maps utterances to the three core problem categories (RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE), the `entity` field identifies the primary subject(s) of the utterance, and the `note` field captures essential metadata including source, owner, design rationale, and implementation details. Note that in actual JSON, group values use lowercase for simplicity.
+
+---
+
+## Next Steps for Completion
+
+### Phase 1: Fill Critical Gaps (Weeks 1-2)
+
+**Priority P0 - Safety & Operations**:
+1. Add 5 utterances for emergency response scenarios
+2. Add 3 utterances for critical alarm handling
+3. Add 2 utterances for safety interlock verification
+
+**Priority P1 - High Business Impact**:
+1. Develop 8 energy optimization utterances (Prescriptive)
+2. Create 6 maintenance scheduling utterances (Prescriptive)
+3. Add 4 multi-asset coordination utterances (Prescriptive)
+
+### Phase 2: Expand Coverage (Weeks 3-4)
+
+**Predictive Enhancements**:
+1. Add 5 RUL (Remaining Useful Life) estimation utterances
+2. Create 4 degradation trend analysis utterances
+3. Develop 3 seasonal performance prediction utterances
+
+**Prescriptive Enhancements**:
+1. Add 5 resource allocation optimization utterances
+2. Create 4 cost-benefit analysis utterances
+3. Develop 3 policy recommendation utterances
+
+### Phase 3: Refinement (Weeks 5-6)
+
+1. Test all utterances with actual users
+2. Refine characteristic_form descriptions based on feedback
+3. Validate deterministic flags with domain experts
+4. Update priority levels based on usage patterns
+
+### Phase 4: Documentation & Handoff (Week 7)
+
+1. Document all SME contributions
+2. Create usage guidelines for each utterance category
+3. Establish maintenance procedures for utterance updates
+4. Train new SMEs on utterance creation process
+
+---
+
+## Lessons Learned
+
+### What Worked Well
+
+1. **Incremental Development**: Starting with Retrospective queries provided a solid foundation
+2. **SME Specialization**: Having dedicated SMEs for different equipment types improved quality
+3. **Coverage Matrix**: Visual tracking helped identify gaps quickly
+4. **Priority-Based Approach**: Focusing on P0/P1 utterances first delivered immediate value
+
+### Challenges Encountered
+
+1. **Prescriptive Underrepresentation**: Harder to create good prescriptive utterances - requires deep operational knowledge
+2. **SME Availability**: Maintenance planning SME had limited time, slowing prescriptive development
+3. **Characteristic Form Precision**: Initial attempts were too vague; required multiple iterations
+4. **Cross-Equipment Scenarios**: Multi-asset utterances are complex and need careful design
+
+### Recommendations for Other Domains
+
+1. **Start with Retrospective**: Build knowledge and data query foundation first
+2. **Engage Multiple SMEs Early**: Don't rely on single expert for entire domain
+3. **Use Real Scenarios**: Base utterances on actual operational incidents and decisions
+4. **Iterate on Characteristic Forms**: Expect to refine these multiple times
+5. **Track Coverage Explicitly**: Use matrices to visualize gaps and progress
+6. **Prioritize Ruthlessly**: Focus on high-impact scenarios before edge cases
+
+---
+
+## Conclusion
+
+This Industrial Asset Management case study demonstrates the practical application of the Utterance Design Guideline in a real-world context. As a **partially accomplished** example, it shows:
+
+- **Successful areas**: Retrospective and Predictive categories with good coverage
+- **Work in progress**: Prescriptive category needs significant expansion
+- **Incremental approach**: Phased development with clear next steps
+- **SME collaboration**: Multiple experts contributing to their areas of expertise
+- **Continuous improvement**: Ongoing refinement based on usage and feedback
+
+This case study can serve as a template for other domains, showing both the successes and challenges of building a comprehensive utterance set incrementally.
+
+---
+
+## References
+
+- [Main Utterance Design Guideline](utterance_design_guideline.md)
+- Background materials: `background/all_utterance.jsonl`
+- Domain abstraction patterns: `background/asset_agnostic_utterances.docx`
\ No newline at end of file
diff --git a/docs/guideline/case_study_wind_turbine.md b/docs/guideline/case_study_wind_turbine.md
new file mode 100644
index 00000000..1ba84933
--- /dev/null
+++ b/docs/guideline/case_study_wind_turbine.md
@@ -0,0 +1,448 @@
+# Case Study: Wind Turbine Operations
+
+## Overview
+
+This case study demonstrates the application of the domain-agnostic utterance design guideline to **wind turbine operations and renewable energy management**. It provides a complete, end-to-end example of how to develop utterances for a wind farm monitoring and optimization system.
+
+**Domain**: Renewable Energy - Wind Power Generation
+**Primary Assets**: Wind Turbines, Gearboxes, Generators, Blades
+**Operational Focus**: Power generation optimization, predictive maintenance, grid integration
+**Dataset**: 30 complete utterances with all 9 required fields
+
+---
+
+## 1. Domain Context
+
+### 1.1 Wind Farm Components
+
+Wind turbines are complex electromechanical systems consisting of multiple critical components:
+
+- **Wind Turbine (WT)**: Complete turbine assembly including tower, nacelle, and rotor
+- **Rotor Blades**: Aerodynamic surfaces that capture wind energy (typically 3 blades)
+- **Gearbox**: Mechanical transmission system that increases rotational speed
+- **Generator**: Converts mechanical energy to electrical energy
+- **Nacelle**: Housing containing gearbox, generator, and control systems
+- **Tower**: Structural support elevating the turbine to optimal wind conditions
+- **Control Systems**: Pitch control, yaw control, and power electronics
+
+### 1.2 Key Measurements and Sensors
+
+Wind turbines are instrumented with extensive sensor networks:
+
+- **Wind Measurements**: Wind speed (m/s), wind direction (degrees), turbulence intensity
+- **Power Metrics**: Power output (kW/MW), capacity factor (%), energy production (MWh)
+- **Mechanical Sensors**: Vibration (mm/s, g-force), temperature (°C), rotational speed (RPM)
+- **Position Sensors**: Pitch angle (degrees), yaw angle (degrees), blade position
+- **Environmental**: Ambient temperature, humidity, barometric pressure, ice detection
+- **Electrical**: Voltage, current, frequency, power factor, grid connection status
+
+### 1.3 Operational Events and Conditions
+
+Wind farm operations involve various events and conditions:
+
+- **Normal Operations**: Power generation, grid synchronization, automatic control
+- **Alarms and Faults**: Overspeed, overtemperature, vibration exceedance, grid faults
+- **Maintenance Activities**: Preventive maintenance, corrective repairs, component replacements
+- **Environmental Events**: High wind shutdown, ice formation, lightning strikes, storms
+- **Grid Events**: Curtailment requests, frequency regulation, voltage support
+- **Performance Issues**: Power curve degradation, underperformance, availability loss
+
+---
+
+## 2. Category Mapping to Wind Energy
+
+The 8 universal categories from the guideline map naturally to wind turbine operations:
+
+| Universal Category | Wind Energy Application | Example Use Cases |
+|-------------------|------------------------|-------------------|
+| **Information Retrieval** | Asset inventory, sensor configuration, knowledge base queries | "What wind turbines are available?", "List sensors on WT-105" |
+| **Data Extraction** | Historical data retrieval, time-series queries, event logs | "Get power output for January", "Retrieve vibration data" |
+| **Analysis & Inference** | Performance analysis, correlation studies, trend identification | "Analyze wind speed vs power correlation", "Identify degradation trends" |
+| **Anomaly & Exception Detection** | Condition monitoring, fault detection, abnormal pattern recognition | "Detect vibration anomalies", "Identify temperature exceedance" |
+| **Future State Prediction** | Power forecasting, RUL estimation, failure prediction | "Forecast power output", "Predict gearbox remaining life" |
+| **Recommendation & Optimization** | Maintenance planning, control optimization, decision support | "Recommend maintenance actions", "Optimize pitch angle" |
+| **Workflow & Process Execution** | Automated control sequences, maintenance workflows | "Execute startup sequence", "Initiate emergency shutdown" |
+| **Compliance & Validation** | Grid code compliance, safety checks, performance validation | "Verify grid connection requirements", "Validate power curve" |
+
+---
+
+## 3. Three Core Problem Categories Applied
+
+### 3.1 RETROSPECTIVE (Understanding What Happened)
+**Status**: [COMPLETE] Complete - 10 utterances covering all key retrospective needs
+
+**Coverage Areas**:
+- Asset inventory and configuration (IDs 1-2)
+- Historical data retrieval (IDs 3-4, 9-10)
+- Knowledge base queries (ID 5)
+- Performance analysis (ID 6)
+- Event history (ID 7)
+- Operational metrics (ID 8)
+
+**Key Characteristics**:
+- Deterministic responses based on historical data
+- Focus on "what happened" and "what exists"
+- Support for operations reporting, maintenance planning, and performance analysis
+
+### 3.2 PREDICTIVE (Predicting What Will Happen)
+**Status**: [COMPLETE] Complete - 10 utterances covering forecasting and anomaly detection
+
+**Coverage Areas**:
+- Power output forecasting (IDs 11, 18)
+- Component life prediction (IDs 12, 14, 17)
+- Anomaly detection (IDs 13, 16)
+- Weather forecasting (ID 15)
+- Performance degradation (ID 19)
+- Environmental risk prediction (ID 20)
+
+**Key Characteristics**:
+- Non-deterministic predictions with confidence levels
+- Focus on "what will happen" and "what might fail"
+- Support for predictive maintenance, energy trading, and risk management
+
+### 3.3 PRESCRIPTIVE (Helping Humans Make Decisions)
+**Status**: [COMPLETE] Complete - 10 utterances covering optimization and recommendations
+
+**Coverage Areas**:
+- Maintenance recommendations (IDs 21, 23, 24, 28)
+- Control optimization (IDs 22, 27)
+- Operational decisions (IDs 25, 29)
+- Resource planning (ID 26)
+- Lifecycle management (ID 30)
+
+**Key Characteristics**:
+- Action-oriented recommendations with justification
+- Focus on "what should we do" and "how to optimize"
+- Support for decision-making, planning, and optimization
+
+---
+
+## 4. Coverage Analysis Matrix
+
+| Category | Retrospective | Predictive | Prescriptive | Total |
+|----------|--------------|-----------|--------------|-------|
+| Information Retrieval | 3 | 0 | 0 | 3 |
+| Data Extraction | 5 | 0 | 0 | 5 |
+| Analysis & Inference | 2 | 2 | 0 | 4 |
+| Anomaly & Exception Detection | 0 | 3 | 0 | 3 |
+| Future State Prediction | 0 | 7 | 0 | 7 |
+| Recommendation & Optimization | 0 | 0 | 10 | 10 |
+| **Total** | **10** | **10** | **10** | **30** |
+
+**Entity Coverage**:
+- WindTurbine: 15 utterances (50%)
+- WindFarm: 8 utterances (27%)
+- Gearbox: 5 utterances (17%)
+- Generator: 1 utterance (3%)
+- Blade: 1 utterance (3%)
+
+---
+
+## 5. Utterance Schema
+
+Each utterance in the wind turbine dataset includes all 9 required fields:
+
+
+
+1. **id**: Unique identifier (1-30)
+2. **text**: Natural language query or command
+3. **type**: System type (IoT, Analytics, Forecasting, Prognostics, AnomalyDetection, DecisionSupport, Optimization, Maintenance, FMEA)
+4. **category**: Universal category from the 8 defined categories
+5. **deterministic**: Boolean indicating if response is deterministic
+6. **characteristic_form**: Expected response format and content
+
+7. **group**: Problem category (retrospective, predictive, prescriptive)
+8. **entity**: Physical asset or component (WindTurbine, Gearbox, Generator, Blade, WindFarm)
+9. **note**: Metadata including source, owner, and context
+
+---
+
+## 6. Example Utterances
+
+### Example 1: Retrospective - Asset Inventory
+```json
+{
+ "id": 1,
+ "text": "What wind turbines are available in the wind farm?",
+ "type": "IoT",
+ "category": "Information Retrieval",
+ "deterministic": true,
+ "characteristic_form": "List of all wind turbine assets in the farm with their identifiers and locations",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Asset inventory requirements; Owner: Operations Team; Basic inventory query for wind farm overview"
+}
+```
+
+### Example 2: Retrospective - Historical Data
+```json
+{
+ "id": 3,
+ "text": "Retrieve power output data for Wind Turbine WT-105 from January 2024.",
+ "type": "IoT",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "Time series data of power output in kW for the specified turbine and time period",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Performance analysis; Owner: Analytics Team; Historical power generation data retrieval"
+}
+```
+
+### Example 3: Predictive - Power Forecasting
+```json
+{
+ "id": 11,
+ "text": "Forecast power output for Wind Turbine WT-105 for the next 48 hours.",
+ "type": "Forecasting",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Predicted power output time series based on weather forecast and historical performance",
+ "group": "predictive",
+ "entity": "WindTurbine",
+ "note": "Source: Energy trading requirements; Owner: Energy Management Team; Short-term power forecasting for grid planning"
+}
+```
+
+### Example 4: Predictive - Anomaly Detection
+```json
+{
+ "id": 13,
+ "text": "Is there any anomaly detected in the vibration pattern of Wind Turbine WT-105?",
+ "type": "AnomalyDetection",
+ "category": "Anomaly & Exception Detection",
+ "deterministic": false,
+ "characteristic_form": "Boolean result with anomaly score and description of detected abnormal vibration patterns",
+ "group": "predictive",
+ "entity": "WindTurbine",
+ "note": "Source: Condition monitoring system; Owner: Predictive Maintenance Team; Real-time anomaly detection query"
+}
+```
+
+### Example 5: Prescriptive - Maintenance Recommendation
+```json
+{
+ "id": 21,
+ "text": "Recommend maintenance actions for Wind Turbine WT-105 based on current condition.",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Prioritized list of recommended maintenance actions with justification and urgency level",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Condition-based maintenance; Owner: Maintenance Team; Maintenance recommendation based on condition indicators"
+}
+```
+
+### Example 6: Prescriptive - Control Optimization
+```json
+{
+ "id": 22,
+ "text": "Optimize the pitch angle for Wind Turbine WT-105 to maximize power output at current wind speed.",
+ "type": "Optimization",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Optimal pitch angle in degrees for current wind conditions to maximize power capture",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Performance optimization; Owner: Control Systems Team; Real-time pitch optimization for power maximization"
+}
+```
+
+---
+
+## 7. Development Process
+
+### 7.1 Initial Planning
+1. **Domain Analysis**: Identified key wind turbine components, measurements, and operational scenarios
+2. **Stakeholder Mapping**: Defined user roles (Operations, Maintenance, Engineering, Management)
+3. **Use Case Identification**: Listed critical operational needs across the three problem categories
+4. **Coverage Planning**: Ensured balanced distribution across categories and entities
+
+### 7.2 Utterance Creation
+1. **Retrospective First**: Started with foundational data and knowledge queries
+2. **Predictive Second**: Built on retrospective data to create forecasting and detection utterances
+3. **Prescriptive Last**: Leveraged both retrospective and predictive capabilities for recommendations
+4. **Iterative Refinement**: Reviewed and refined utterances for clarity and completeness
+
+### 7.3 Quality Assurance
+- [COMPLETE] All 9 required fields present in every utterance
+- [COMPLETE] Balanced distribution: 10 utterances per problem category
+- [COMPLETE] Diverse entity coverage: 5 different physical entities
+- [COMPLETE] Realistic use cases based on actual wind farm operations
+- [COMPLETE] Clear characteristic_form descriptions for each utterance
+- [COMPLETE] Comprehensive note fields with source, owner, and context
+
+---
+
+## 8. Key Insights and Lessons Learned
+
+### 8.1 Domain-Specific Considerations
+
+**Wind Energy Unique Aspects**:
+- **Weather Dependency**: Wind forecasting is critical for power prediction and operations planning
+- **Grid Integration**: Curtailment and grid stability considerations are essential
+- **Component Criticality**: Gearbox failures are expensive; predictive maintenance is high-value
+- **Environmental Challenges**: Ice formation, storms, and extreme weather require special handling
+- **Control Optimization**: Real-time pitch and yaw optimization significantly impact performance
+
+### 8.2 Utterance Design Patterns
+
+**Effective Patterns Observed**:
+1. **Specific Asset References**: Using "Wind Turbine WT-105" makes utterances concrete and testable
+2. **Time-Bound Queries**: Including time ranges ("next 48 hours", "last month") clarifies expectations
+3. **Measurable Outcomes**: Specifying units (kW, m/s, degrees) in characteristic_form improves clarity
+4. **Actionable Recommendations**: Prescriptive utterances include "what" and "why" for decisions
+5. **Hierarchical Entities**: Supporting both individual turbines and farm-level queries provides flexibility
+
+### 8.3 Completeness Framework and Rationale
+
+**Definition of "Complete" for This Case Study**:
+
+A case study is considered **complete** when it demonstrates the **minimum viable coverage** needed to validate the guideline's applicability to a new domain. This means:
+
+1. **Structural Completeness**: All three core problem categories represented
+2. **Categorical Completeness**: Multiple universal categories demonstrated
+3. **Entity Completeness**: Key physical assets covered
+4. **Use Case Completeness**: Critical operational scenarios addressed
+5. **Stakeholder Completeness**: Major user roles represented
+
+**Completeness Criteria Applied**:
+
+| Criterion | Target | Achieved | Status |
+|-----------|--------|----------|--------|
+| **Problem Categories** | All 3 (RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE) | 3 categories, 10 each | [COMPLETE] Complete |
+| **Universal Categories** | At least 5 of 8 | 6 of 8 categories used | [COMPLETE] Complete |
+| **Physical Entities** | At least 3 critical assets | 5 entities (WindTurbine, Gearbox, Generator, Blade, WindFarm) | [COMPLETE] Complete |
+| **Operational Lifecycle** | Monitor, Predict, Optimize | All phases covered | [COMPLETE] Complete |
+| **Stakeholder Roles** | At least 3 roles | 6 roles (Operations, Maintenance, Engineering, Analytics, Management, Control Systems) | [COMPLETE] Complete |
+| **Deterministic Mix** | Both types present | 10 deterministic, 20 non-deterministic | [COMPLETE] Complete |
+
+**Why 30 Utterances is Sufficient**:
+
+1. **Demonstration Purpose**: This is a case study to demonstrate guideline application, not a production system
+2. **Balanced Coverage**: Equal representation (10 each) across the three core categories proves the framework works
+3. **Pattern Establishment**: 30 utterances establish clear patterns that can be replicated for additional use cases
+4. **Critical Path Coverage**: All essential wind farm operations are represented
+5. **Scalability Proof**: The structure shows how to expand to 100+ utterances if needed
+
+**What "Complete" Does NOT Mean**:
+
+- [NOT] **Not Exhaustive**: Doesn't cover every possible wind farm query
+- [NOT] **Not Production-Ready**: A real system would need 100-200+ utterances
+- [NOT] **Not All Categories**: Only 6 of 8 universal categories used (Workflow and Compliance not included)
+- [NOT] **Not All Components**: Doesn't cover towers, foundations, substations, SCADA systems
+- [NOT] **Not All Scenarios**: Doesn't include commissioning, decommissioning, or advanced grid services
+
+**Comparison with Industrial Asset Management Case Study**:
+
+| Aspect | Wind Turbine (30) | Industrial ALM (152) | Interpretation |
+|--------|-------------------|---------------------|----------------|
+| **Purpose** | Demonstrate guideline from scratch | Evolved from real project | Wind turbine is a "clean slate" example |
+| **Completeness** | Minimum viable demonstration | Comprehensive production dataset | Both are "complete" for their purposes |
+| **Coverage** | Balanced across categories | Heavily retrospective (77/152) | Wind turbine shows ideal balance |
+| **Development** | Designed complete upfront | Grew incrementally with SME input | Different development approaches |
+
+**Potential Extensions** (deliberately excluded to maintain focus):
+- Workflow execution utterances (startup/shutdown sequences)
+- Compliance validation utterances (grid code requirements)
+- Multi-turbine coordination utterances (wake effect optimization)
+- Financial analysis utterances (revenue optimization, cost analysis)
+- Environmental impact utterances (noise monitoring, wildlife protection)
+- Advanced analytics (fleet benchmarking, cross-site comparison)
+- Integration utterances (weather service, grid operator, maintenance systems)
+
+**Conclusion on Completeness**:
+
+This case study is **complete as a demonstration** because it:
+1. Proves the guideline works for renewable energy
+2. Provides sufficient examples for each problem category
+3. Establishes replicable patterns for expansion
+4. Covers critical operational needs
+5. Validates the 9-field schema in a new domain
+
+It is **intentionally not exhaustive** because:
+1. The goal is demonstration, not production deployment
+2. Keeping it focused (30 vs 150+) makes it more accessible as a learning tool
+3. It shows the minimum needed to validate the framework
+4. Teams can extend it based on their specific needs
+
+**Practical Guidance**: Teams adopting this framework should:
+- Start with 20-30 utterances like this case study
+- Validate the framework works for their domain
+- Incrementally expand based on actual user needs
+- Aim for 100-200 utterances for production systems
+- Continuously refine based on usage patterns
+
+---
+
+## 9. Comparison with Industrial Asset Management Case Study
+
+| Aspect | Wind Turbine (This Study) | Industrial Asset Management |
+|--------|---------------------------|----------------------------|
+| **Domain** | Renewable energy, power generation | HVAC, building automation |
+| **Primary Assets** | Wind turbines, gearboxes, generators | Chillers, AHUs, equipment |
+| **Key Metrics** | Power output, wind speed, capacity factor | Temperature, pressure, efficiency |
+| **Operational Focus** | Energy production, grid integration | Climate control, comfort |
+| **Maintenance Driver** | Component reliability, downtime cost | Service continuity, comfort |
+| **Dataset Size** | 30 utterances (complete) | 152 utterances (comprehensive) |
+| **Development Stage** | Complete from start | Evolved incrementally |
+
+**Key Similarity**: Both demonstrate the universal applicability of the three core problem categories (RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE) across different physical asset domains.
+
+---
+
+## 10. Using This Case Study
+
+### 10.1 As a Template
+This case study can serve as a template for developing utterances in other renewable energy domains:
+- **Solar PV Systems**: Replace wind turbines with solar panels, inverters, and trackers
+- **Hydroelectric Plants**: Adapt to turbines, generators, and water flow management
+- **Battery Energy Storage**: Focus on charge/discharge cycles, state of health, and grid services
+
+### 10.2 As a Reference
+Use this case study to:
+- Understand how to apply the domain-agnostic guideline to a specific domain
+- See examples of complete utterances with all 9 required fields
+- Learn patterns for creating balanced coverage across problem categories
+- Identify best practices for entity selection and note field usage
+
+### 10.3 As a Starting Point
+Teams can:
+1. Clone the JSON dataset and modify utterances for their specific wind farm
+2. Add domain-specific utterances for unique operational requirements
+3. Extend coverage to additional entities (towers, foundations, substations)
+4. Incorporate organization-specific terminology and processes
+
+---
+
+## 11. Dataset Access
+
+**JSON Dataset**: `wind_turbine_utterances.json`
+**Location**: Same directory as this document
+**Format**: JSON array with 30 objects
+**Schema Version**: 9-field schema (6 base + 3 enhanced)
+
+**Quick Statistics**:
+- Total Utterances: 30
+- Retrospective: 10 (33%)
+- Predictive: 10 (33%)
+- Prescriptive: 10 (33%)
+- Entities: WindTurbine (15), WindFarm (8), Gearbox (5), Generator (1), Blade (1)
+- Deterministic: 10 (33%)
+- Non-deterministic: 20 (67%)
+
+---
+
+## 12. Conclusion
+
+This wind turbine case study demonstrates that the domain-agnostic utterance design guideline successfully applies to renewable energy operations. The 30 utterances provide comprehensive coverage of wind farm monitoring, prediction, and optimization needs while maintaining the universal structure defined in the guideline.
+
+**Key Takeaways**:
+1. The three core problem categories (RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE) are universally applicable
+2. The 9-field schema provides sufficient structure and flexibility for any domain
+3. Complete case studies can be developed efficiently using the guideline framework
+4. Domain-specific terminology and concepts map naturally to universal categories
+5. Balanced coverage across categories ensures comprehensive system capabilities
+
+This case study, alongside the Industrial Asset Management case study, validates the cross-domain applicability of the utterance design guideline and provides practical examples for teams developing AI-powered operational systems in any physical asset domain.
\ No newline at end of file
diff --git a/docs/guideline/ground_truth_design_guideline.md b/docs/guideline/ground_truth_design_guideline.md
new file mode 100644
index 00000000..efa21fdf
--- /dev/null
+++ b/docs/guideline/ground_truth_design_guideline.md
@@ -0,0 +1,2636 @@
+# Ground Truth Creation Best Practice
+
+## Overview
+
+This document provides best practices for creating ground truth for an utterance. It can be viewed as a trajectory of the execution process of the utterance under an ideal execution process without hallucination or execution failure. The ground truth is the cornerstone to evaluate AI agent systems irresoective it domain or application. Ground truth represents the **ideal trajectory** - the expected sequence of planning decisions and execution steps that a system should follow to correctly answer a user's natural language query, i.e. utterance.
+
+### What is Ground Truth?
+
+Ground truth is a structured specification that defines:
+1. **What** the AI agent system should accomplish (expected output)
+2. **How** the AI agent system should accomplish it (execution trajectory)
+3. **Why** each step is necessary (planning rationale)
+
+It serves as the **soft ground truth** for automated evaluation, enabling systematic comparison of actual agent behavior against expert-defined expected behavior. "Soft" means the framework accommodates both types, providing exact answers for deterministic tasks and acceptance criteria for non-deterministic tasks. For example, in a deterministic task like "How many work orders in 2017?", the ground truth would specify the exact number of work orders. In a non-deterministic task like "Recommend maintenance actions", the ground truth would specify the criteria for a valid recommendation, such as "The recommended actions should address the top 3 failure modes with the highest risk scores."
+
+### Purpose
+
+Enable rigorous, reproducible benchmarking of AI agent systems by providing:
+- Objective evaluation criteria
+- Automated scoring mechanisms
+- Failure mode discovery
+- Architectural comparison
+- Model selection guidance
+
+### Use Cases
+
+While this document uses industrial asset management examples (from AssetOpsBench), the principles apply to any domain requiring agent-based automation, including energy systems, healthcare, finance, and customer service.
+
+### Two System Paradigms
+
+Ground truth supports evaluation of two complementary paradigms:
+
+**1. Multi-Agent Systems**
+- Multiple specialized AI agents collaborating to solve tasks
+ - Each agent has domain expertise (e.g., IoTAgent for sensor data, WOAgent for work orders)
+ - Agents communicate and coordinate through orchestration
+- Example: AssetOpsBench with 4 specialized agents (IoT, FMSR, TSFM, WO)
+
+**2. Multi-MCP Systems**
+- Multiple Model Context Protocol (MCP) servers providing standardized tools
+- MCP servers expose capabilities (e.g., filesystem, database, web-search)
+- LLM orchestrates tool usage to accomplish tasks
+- Example: MCP-Bench with filesystem, database, API servers
+
+**Key Insight**: Both paradigms share the **same ground truth structure**. The difference lies in semantic interpretation:
+- **Multi-Agent**: `type` = agent type (e.g., "IoTAgent"), `agent` = agent name
+- **Multi-MCP**: `type` = MCP server type (e.g., "filesystem"), `agent` = server name
+
+### Benchmarking Context
+
+Ground truth enables three critical benchmarking capabilities:
+
+**1. Architectural Comparison**
+- Compare different orchestration strategies (Agent-As-Tool vs Plan-Execute)
+- Evaluate trade-offs between reasoning approaches
+- Example: AssetOpsBench found Agent-As-Tool achieves 60-70% completion vs 46% for Plan-Execute
+
+**2. Model Selection**
+- Objectively compare LLM performance across scenarios
+- Identify model-specific strengths and weaknesses
+- Example: AssetOpsBench evaluates models across multi-agent coordination, tool selection, and domain-specific reasoning tasks
+
+**3. Failure Mode Discovery**
+- Automatically identify common failure patterns
+- Diagnose reasoning and recovery bottlenecks
+- Enable systematic debugging and improvement
+
+### Community and Reproducibility
+
+Ground truth serves as shared infrastructure for open source and research communities:
+- **Standardized Evaluation**: All systems measured against same baseline
+- **Fair Comparison**: Eliminates subjective judgment
+- **Progress Tracking**: Quantify improvements over time
+- **Collaboration**: Shared benchmarks accelerate development and research
+
+### Document Scope
+
+This document provides systematic workflow, structure definitions, best practices, and validation guidance for creating ground truth trajectories. Based on AssetOpsBench and industry best practices for agent evaluation.
+
+---
+
+## 1. High-Level Ground Truth Creation Workflow
+
+This section provides a systematic, step-by-step process for creating ground truth trajectories. Follow this workflow to ensure completeness and consistency.
+
+### 1.1 Overview: The Five-Phase Process
+
+```
+Phase 1: Understand the Utterance
+ ↓
+Phase 2: Define Expected Behavior
+ ↓
+Phase 3: Design the Ideal Trajectory
+ ↓
+Phase 4: Validate and Refine
+ ↓
+Phase 5: Document and Review
+```
+
+### 1.2 Phase 1: Understand the Utterance
+
+**Objective**: Fully comprehend what the user is asking and what constitutes success.
+
+**Key Questions**:
+1. What is the user's intent?
+2. What information or action does the user need?
+3. Which agents/servers are required?
+4. What data sources must be accessed?
+5. What constraints or conditions apply?
+
+**Actions**:
+- [ ] Read the utterance carefully
+- [ ] Identify key entities (equipment, time ranges, metrics)
+- [ ] Determine the problem category (RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE)
+- [ ] Classify the utterance type (Knowledge Query, Data Query, Inference Query, etc.)
+- [ ] Identify the primary agent/server type
+
+**Example**:
+```
+Utterance: "Get the work order of equipment CWC04013 for year 2017."
+
+Analysis:
+- Intent: Retrieve historical work order records
+- Information needed: Work orders for specific equipment and time period
+- Required agent: WOAgent (Work Order Agent)
+- Data source: Work order database
+- Constraints: Equipment ID = CWC04013, Year = 2017
+- Group: retrospective (historical data retrieval)
+- Category: Knowledge Query
+- Type: Workorder
+- Entity: Equipment
+- Deterministic: true (33 records expected)
+```
+
+### 1.3 Phase 2: Define Expected Behavior
+
+**Objective**: Specify what the correct output should be and how to validate it.
+
+**Key Decisions**:
+
+**Decision 1: Is this deterministic or non-deterministic?**
+
+Ask: "Is there exactly one correct answer?"
+
+- **Deterministic (true)**:
+ - Data retrieval with specific parameters
+ - Factual queries with verifiable answers
+ - Calculations with exact results
+ - Example: "How many work orders in 2017?" → Exact count
+
+- **Non-deterministic (false)**:
+ - Recommendations with multiple valid options
+ - Predictions with probabilistic outputs
+ - Analysis with subjective interpretations
+ - Example: "What maintenance actions are recommended?" → Multiple valid suggestions
+
+**Decision 2: What is the characteristic form?**
+
+For **deterministic** utterances:
+```
+"There will be 33 records. The expected response should retrieve
+all work orders for equipment CWC04013 in the year 2017, ensuring
+correct equipment ID and time filtering."
+```
+
+For **non-deterministic** utterances:
+```
+"The answer should contain one or more valid work order
+recommendations with rationale. The recommendations should be
+relevant to addressing anomalies in equipment CWC04009."
+```
+
+**Decision 3: What is the expected output structure?**
+
+Define `final_out`:
+- For deterministic: Exact expected values
+- For non-deterministic: Representative example
+
+**Actions**:
+- [ ] Set `deterministic` flag (true/false)
+- [ ] Write clear `characteristic_form` description
+- [ ] Define `final_out` structure
+- [ ] Write `final_out_description` explaining the output
+
+### 1.4 Phase 3: Design the Ideal Trajectory
+
+**Objective**: Map out the step-by-step process the agent should follow.
+
+#### Step 3A: High-Level Planning
+
+**Create `planning_steps`**: Strategic decomposition of the problem
+
+**Template**:
+```json
+"planning_steps": [
+ {
+ "agent": "AgentName",
+ "instruction": "What this agent should accomplish (high-level)"
+ }
+]
+```
+
+**Guidelines**:
+- Use semantic, intent-focused descriptions
+- Each step should be a meaningful sub-goal
+- Order matters - represent logical sequence
+- Use predefined agent/server names
+
+**Example**:
+```json
+"planning_steps": [
+ {
+ "agent": "WOAgent",
+ "instruction": "Create Equipment instance for CWC04013"
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Create DateRange for year 2017"
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Retrieve work orders for equipment and date range"
+ }
+]
+```
+
+#### Step 3B: Detailed Execution Design
+
+**Create `execution_steps`**: Specific tool calls and operations
+
+**For each step, define**:
+1. **name**: Unique identifier (e.g., "create_equipment", "get_work_orders")
+2. **action**: Tool/function name (e.g., "Equipment", "get_work_orders")
+3. **agent**: Which agent/server executes this
+4. **arguments**: Input parameters (can reference previous outputs)
+5. **outputs**: Variable names produced
+6. **deterministic**: Flags for name, action, arguments, outputs
+
+**Template**:
+```json
+{
+ "name": "descriptive_step_name",
+ "action": "tool_or_function_name",
+ "agent": "AgentName",
+ "arguments": {
+ "param1": "value1",
+ "param2": "reference_to_previous_output"
+ },
+ "outputs": ["output_variable_name"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+}
+```
+
+**Key Considerations**:
+- **Data Flow**: Outputs from one step become inputs to later steps
+- **Dependencies**: Identify which steps depend on others
+- **Intermediate Processing**: Include load, parse, filter, transform steps
+- **Special Steps**: Always end with "Finish" step for final output
+
+#### Step 3C: Define Execution Links
+
+**Create `execution_links`**: Connect steps to form DAG
+
+**Template**:
+```json
+"execution_links": [
+ {"source": "step1_name", "target": "step2_name"},
+ {"source": "step2_name", "target": "step3_name"}
+]
+```
+
+**Rules**:
+- Every step (except first) must have incoming link
+- Every step (except last) must have outgoing link
+- No cycles allowed (must be a DAG)
+- Multiple sources can target same step (parallel → sequential)
+- One source can target multiple steps (branching)
+
+**Validation**:
+- [ ] All step names in links exist in execution_steps
+- [ ] No orphaned steps
+- [ ] No cycles (use topological sort to verify)
+- [ ] Data dependencies correctly represented
+
+### 1.5 Phase 4: Validate and Refine
+
+**Objective**: Ensure ground truth is correct, complete, and consistent.
+
+#### Validation Checklist
+
+**Structural Validation**:
+- [ ] All required fields present (id, text, type, category, deterministic, characteristic_form)
+- [ ] planning_steps is list of dicts with "agent" and "instruction"
+- [ ] execution_steps has all required fields for each step
+- [ ] execution_links properly connects all steps
+- [ ] final_out and final_out_description present
+
+**Semantic Validation**:
+- [ ] Utterance text is clear and unambiguous
+- [ ] Category correctly classifies the utterance
+- [ ] Deterministic flag matches answer uniqueness
+- [ ] characteristic_form clearly describes expected output
+- [ ] Planning steps use predefined agent names
+- [ ] Execution steps use valid tool/function names
+- [ ] All argument references point to existing outputs
+
+**Logical Validation**:
+- [ ] Planning steps form logical problem-solving sequence
+- [ ] Execution steps accomplish planning goals
+- [ ] Data flows correctly through execution steps
+- [ ] DAG structure is valid (no cycles, no orphans)
+- [ ] Final output matches characteristic_form
+
+**Domain Validation**:
+- [ ] Ground truth reflects actual operational workflow
+- [ ] Tool usage is realistic and efficient
+- [ ] Parameters use correct domain terminology
+- [ ] Expected output is achievable with available tools
+
+#### Refinement Process
+
+1. **Test with Domain Expert**: Walk through trajectory with SME
+2. **Simulate Execution**: Mentally execute each step to verify feasibility
+3. **Check Edge Cases**: Consider what could go wrong
+4. **Optimize Path**: Remove unnecessary steps, combine where possible
+5. **Update Documentation**: Ensure notes explain any non-obvious decisions
+
+### 1.6 Phase 5: Document and Review
+
+**Objective**: Finalize documentation and prepare for use.
+
+#### Documentation Requirements
+
+**Required Documentation**:
+1. **characteristic_form**: Clear, detailed description of expected output
+2. **final_out_description**: Explanation of what the output represents
+3. **note** field: Context, rationale, and metadata
+
+**Recommended Note Content**:
+```json
+"note": "Source: [where this scenario came from];
+ Owner: [SME who created/validated it];
+ Context: [important background information];
+ Design rationale: [why this approach was chosen];
+ Implementation notes: [technical details];
+ Omitted: [what was intentionally left out]"
+```
+
+#### Final Review
+
+**Review with Stakeholders**:
+- [ ] Domain expert validates correctness
+- [ ] System architect validates technical feasibility
+- [ ] End user validates practical utility
+- [ ] Peer reviewer validates consistency with other scenarios
+
+**Quality Gates**:
+- [ ] Passes all validation checks
+- [ ] Achieves >90% confidence from domain expert
+- [ ] Aligns with existing ground truth patterns
+- [ ] Documentation is complete and clear
+
+### 1.7 Quick Reference: Creation Process Flow
+
+```
+START: New Utterance
+ ↓
+Q1: What is user's intent?
+ → Identify: RETROSPECTIVE, PREDICTIVE, or PRESCRIPTIVE
+ ↓
+Q2: Is there exactly one correct answer?
+ → YES: deterministic = true
+ → NO: deterministic = false
+ ↓
+Q3: Which agents/servers are needed?
+ → Single: type = "AgentName"
+ → Multiple: type = "multiagent"
+ ↓
+Q4: What are the high-level steps?
+ → Create planning_steps (strategic intent)
+ ↓
+Q5: What are the detailed operations?
+ → Create execution_steps (tool calls)
+ ↓
+Q6: How do steps connect?
+ → Create execution_links (DAG)
+ ↓
+Q7: What is the expected output?
+ → Define final_out and characteristic_form
+ ↓
+Q8: Does it pass validation?
+ → YES: Document and finalize
+ → NO: Return to appropriate phase and refine
+ ↓
+DONE: Ground Truth Complete
+```
+
+### 1.8 Common Pitfalls and How to Avoid Them
+
+**Pitfall 1: Vague characteristic_form**
+- Bad: "The response should return the data."
+- Good: "There will be 33 records. The response should retrieve all work orders for equipment CWC04013 in year 2017."
+
+**Pitfall 2: Missing intermediate steps**
+- Bad: get_data → finish
+- Good: get_data → load_data → parse_data → process_data → finish
+
+**Pitfall 3: Incorrect deterministic flag**
+- Bad: Recommendation query marked as deterministic
+- Good: Recommendation query marked as non-deterministic
+
+**Pitfall 4: Circular dependencies**
+- Bad: A → B → C → A (cycle)
+- Good: A → B → C → D (DAG)
+
+**Pitfall 5: Ambiguous variable references**
+- Bad: arguments: {"data": "result"}
+- Good: arguments: {"data": "sensor_readings"} (clear variable name)
+
+### 1.9 Effort Considerations
+
+Ground truth creation effort varies significantly based on scenario complexity:
+
+**Factors Affecting Creation Time**:
+- **Scenario Complexity**: Number of execution steps and dependencies
+- **Domain Familiarity**: Understanding of tools and workflows
+- **Determinism**: Non-deterministic scenarios require more careful specification
+- **Validation Requirements**: Domain expert review adds time
+- **Documentation Quality**: Clear characteristic_form and notes take effort
+
+**Observed Patterns from AssetOpsBench**:
+- **Simple scenarios** (e.g., Scenario 3: 2 steps, linear): Faster to create
+- **Complex scenarios** (e.g., Scenario 421: 9 steps, parallel branches): More time-intensive
+- **Dataset scale**: 141 scenarios created for AssetOpsBench benchmark
+
+**Efficiency Strategies**:
+- Use templates for common patterns (e.g., Equipment + DateRange initialization)
+- Reuse execution step structures from similar scenarios
+- Batch similar scenarios together (e.g., all work order queries)
+- Involve domain experts early to avoid rework
+- Start with deterministic scenarios before tackling non-deterministic ones
+
+**Note**: Specific time estimates depend on team expertise, tooling, and domain complexity. Track your own metrics to establish baselines for your context.
+
+---
+
+## 2. Ground Truth Structure
+
+### 2.1 Core Components
+
+Each ground truth scenario consists of the following key components:
+
+```json
+{
+ "id": ,
+ "uuid": "",
+ "text": "",
+ "type": "",
+ "category": "",
+ "deterministic": ,
+ "characteristic_form": "",
+ "planning_steps": [...],
+ "execution_steps": [...],
+ "execution_links": [...],
+ "final_out": {...},
+ "final_out_description": [...]
+}
+```
+
+### 2.2 Field Definitions
+
+#### Required Fields
+
+1. **id** (integer): Unique identifier for the scenario
+ - Must be unique across the entire ground truth dataset
+ - Used for tracking and referencing scenarios
+
+2. **text** (string): The natural language utterance/query
+ - Should be clear, unambiguous, and representative of real user queries
+ - Must align with the utterance design guidelines
+
+3. **type** (string): Primary agent or MCP server responsible
+ - Examples: `IoT`, `FMSA`, `TSFM`, `Workorder`, `multiagent`
+ - In MCP context: `filesystem`, `database`, `web-search`, etc.
+ - Use `multiagent` when multiple agents must collaborate
+
+4. **category** (string): Classification of the utterance
+ - Examples: `Knowledge Query`, `Data Query`, `Inference Query`, `Decision Support`, `Prediction`, `Complex Query`
+ - Should align with the 8 universal categories from utterance design guideline
+
+5. **deterministic** (boolean): Whether the utterance has a single correct answer
+ - `true`: Single verifiable correct answer (e.g., data retrieval, factual queries)
+ - `false`: Multiple valid responses possible (e.g., recommendations, predictions)
+
+6. **characteristic_form** (string): Description of expected response format and validation criteria
+ - For deterministic: Specify exact expected output or validation rules
+ - For non-deterministic: Describe acceptable response characteristics
+ - Include data format, file references, or structural requirements
+
+#### Planning Steps
+
+7. **planning_steps** (array of objects): High-level agent-level planning
+ - Each step is a dictionary with keys: `agent` and `instruction`
+ - Represents the strategic decomposition of the problem
+ - Must use predefined agent names from the available agent list
+ - Order matters - represents logical sequence of problem-solving
+
+**Structure**:
+```json
+"planning_steps": [
+ {
+ "agent": "AgentName",
+ "instruction": "High-level description of what this agent should do"
+ }
+]
+```
+
+**Best Practices**:
+- Use semantic descriptions that capture intent, not implementation details
+- Each step should represent a meaningful sub-goal
+- Comparing plans requires semantic matching (same goal, different wording is acceptable)
+
+#### Execution Steps
+
+8. **execution_steps** (array of objects): Detailed tool-level operations
+ - Each step represents a specific tool/function call
+ - Forms nodes in the execution DAG
+
+**Structure**:
+```json
+"execution_steps": [
+ {
+ "name": "unique_step_identifier",
+ "action": "tool_or_function_name",
+ "agent": "AgentName",
+ "arguments": {
+ "param1": "value1",
+ "param2": "reference_to_previous_output"
+ },
+ "outputs": ["output_variable_name"],
+ "deterministic": {
+ "name": false,
+ "action": true/false,
+ "arguments": true/false,
+ "outputs": true/false
+ }
+ }
+]
+```
+
+**Field Details**:
+- **name**: Unique identifier for this step (used in execution_links)
+ - Should be human-readable and descriptive
+ - Examples: `step1`, `create_equipment`, `get_sensor_data`, `filter_anomalies`
+
+- **action**: Name of the tool, function, or operation
+ - Examples: `sensors`, `history`, `get_work_orders`, `pickle.load`, `Finish`
+ - Special actions: `Finish`, `Self-Ask`, `Agent-Ask`
+
+- **agent**: Which agent executes this step
+
+- **arguments**: Dictionary of parameters passed to the action
+ - Can reference outputs from previous steps using variable names
+ - Use `@variable_name` or just `variable_name` for references
+
+- **outputs**: List of variable names produced by this step
+ - These can be referenced by subsequent steps
+ - Empty list `[]` if no output
+
+- **deterministic**: Dictionary indicating determinism at different levels
+ - `name`: Is the step name deterministic? (usually false - can vary)
+ - `action`: Is the action/tool deterministic? (true for data retrieval, false for LLM calls)
+ - `arguments`: Are the arguments deterministic?
+ - `outputs`: Are the outputs deterministic?
+
+#### Execution Links
+
+9. **execution_links** (array of objects): Defines the DAG structure
+ - Connects execution steps to form a directed acyclic graph
+ - Represents dependencies and execution order
+
+**Structure**:
+```json
+"execution_links": [
+ {
+ "source": "step_name_1",
+ "target": "step_name_2"
+ }
+]
+```
+
+**Best Practices**:
+- Every step (except the first) should have at least one incoming link
+- Every step (except the last) should have at least one outgoing link
+- Must form a valid DAG (no cycles)
+- Multiple sources can target the same step (parallel execution converging)
+- One source can target multiple steps (branching execution)
+
+#### Final Output
+
+10. **final_out** (object): Structured representation of the expected final result
+ - JSON object containing the actual expected output
+ - Can be null if not applicable
+ - Should match the format described in `characteristic_form`
+
+11. **final_out_description** (array of strings): Textual description of the final output
+ - Human-readable explanation of what the output represents
+ - Can include validation criteria, interpretation notes, or context
+
+#### Optional Fields
+
+12. **uuid** (string): Globally unique identifier (optional)
+ - Use when scenarios need to be tracked across systems
+ - Format: UUID v4 (e.g., "efc94d35-5236-410c-9e4f-5dcdfee818cc")
+
+13. **expected_result** (any): Legacy field for expected results (optional)
+ - Can be null or omitted
+ - Prefer using `final_out` instead
+
+14. **data** (object): Additional metadata or context (optional)
+ - Can store scenario-specific data
+ - Usually empty `{}`
+
+15. **possible_alternatives** (object): Alternative valid execution paths (optional)
+ - Documents other acceptable ways to solve the problem
+ - Useful for non-deterministic scenarios
+
+---
+
+## 3. Special Execution Steps
+
+### 3.1 Finish Step
+
+The `Finish` step is a special terminal step that marks task completion and provides the final answer.
+
+**Characteristics**:
+- **Action**: Always `"Finish"`
+- **Purpose**: Signals task completion and returns final result
+- **Placement**: Typically the last step in the execution DAG
+- **Output**: The final answer that will be presented to the user
+
+**Usage Patterns**:
+
+**Pattern 1: Direct Finish (IoT Agent)**
+```json
+{
+ "name": "finish",
+ "action": "Finish",
+ "agent": "IoTAgent",
+ "argument": "The assets at the MAIN site are: CQPA AHU 1, CQPA AHU 2B, Chiller 4, Chiller 6, Chiller 9, Chiller 3.",
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+}
+```
+
+**Pattern 2: Finish with Structured Arguments (TSFM Agent)**
+```json
+{
+ "name": "return_final_answer",
+ "action": "Finish",
+ "agent": "TSFMAgent",
+ "arguments": {
+ "answer": true,
+ "evidence_models": ["ttm_96_28", "ttm_512_96"]
+ },
+ "outputs": ["final_answer"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+}
+```
+
+**Pattern 3: Finish without Arguments (FMSA Agent)**
+```json
+{
+ "name": "finish",
+ "action": "Finish",
+ "arguments": "",
+ "outputs": ""
+}
+```
+
+**Best Practices**:
+- FMSA and TSFM agents typically use `Finish` as an anchor/terminal step
+- IoT agent may or may not use explicit `Finish` step (inconsistency noted in original data)
+- The `Finish` step's output is used as the "final answer" for evaluation
+- For deterministic queries, the finish argument should contain the exact expected answer
+- For non-deterministic queries, the finish argument should contain representative valid output
+
+### 3.2 Other Special Steps
+
+**Self-Ask**: Agent asks itself a clarifying question
+- Used when agent needs to decompose complex queries
+- Creates internal reasoning loops
+
+**Agent-Ask**: Agent requests information from another agent
+- Used in multi-agent scenarios
+- Represents inter-agent communication
+
+---
+
+## 4. Best Practices
+
+### 4.1 Consistency and Standards
+
+1. **Agent Naming**
+ - Use predefined agent names consistently
+ - Common agents: `IoTAgent`, `FMSRAgent`, `TSFMAgent`, `WOAgent`
+ - In MCP context: Use standard MCP server names
+
+2. **Action Naming**
+ - Use consistent tool/function names across scenarios
+ - Document available tools and their signatures
+ - Examples: `sensors`, `history`, `get_work_orders`, `Equipment`, `DateRange`
+
+3. **Variable Naming**
+ - Use descriptive variable names for outputs
+ - Examples: `equipment`, `date_range`, `events`, `sensor_data`, `forecast_results`
+ - Avoid generic names like `result`, `output`, `data` unless context is clear
+
+4. **Step Naming**
+ - Use descriptive step names that indicate purpose
+ - Good: `create_equipment`, `filter_anomalies`, `calculate_statistics`
+ - Avoid: `step1`, `step2`, `step3` (unless no better name exists)
+
+### 4.2 Determinism Marking
+
+**When to mark as deterministic**:
+
+- **Action deterministic**: Tool always produces same output for same input
+ - Data retrieval: `true`
+ - Mathematical calculations: `true`
+ - LLM calls: `false`
+ - Model inference: `false`
+
+- **Arguments deterministic**: Input parameters are fixed
+ - Hardcoded values: `true`
+ - References to deterministic outputs: `true`
+ - User input or dynamic values: `false`
+
+- **Outputs deterministic**: Output is predictable
+ - Fixed data retrieval: `true`
+ - Calculations: `true`
+ - LLM-generated content: `false`
+ - Model predictions: `false`
+
+**Example**:
+```json
+{
+ "name": "load_work_orders",
+ "action": "pickle.load",
+ "deterministic": {
+ "name": false, // Step name can vary
+ "action": false, // pickle.load is not a standard action
+ "arguments": false, // File path may vary
+ "outputs": false // Content depends on file
+ }
+}
+```
+
+### 4.3 Multi-Agent Scenarios
+
+For complex queries requiring multiple agents:
+
+1. **Set type to "multiagent"**
+ ```json
+ "type": "multiagent"
+ ```
+
+2. **Include all agents in planning_steps**
+ - Show agent handoffs explicitly
+ - Document inter-agent communication
+
+3. **Use Agent-Ask for coordination**
+ - When one agent needs information from another
+ - Document the request-response pattern
+
+4. **Consider execution order**
+ - Some agents may need to complete before others start
+ - Use execution_links to enforce ordering
+
+**Example**:
+```json
+"planning_steps": [
+ {
+ "agent": "IoTAgent",
+ "instruction": "Retrieve sensor data for the equipment"
+ },
+ {
+ "agent": "TSFMAgent",
+ "instruction": "Analyze sensor data for anomalies"
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Recommend work orders based on detected anomalies"
+ }
+]
+```
+
+### 4.4 Handling Non-Deterministic Scenarios
+
+For utterances with multiple valid answers:
+
+1. **Set deterministic to false**
+ ```json
+ "deterministic": false
+ ```
+
+2. **Provide representative output**
+ - Show one valid example in `final_out`
+ - Note that other valid outputs exist
+
+3. **Define acceptance criteria**
+ - In `characteristic_form`, describe what makes an answer valid
+ - List required elements
+ - Specify constraints
+
+4. **Document alternatives**
+ - Use `possible_alternatives` to show other valid approaches
+ - Helps evaluators understand acceptable variations
+
+**Example**:
+```json
+"characteristic_form": "The answer should contain one or more valid work order recommendations with rationale. The recommendations should be relevant to addressing anomalies in equipment CWC04009.",
+"deterministic": false,
+"final_out": {
+ "recommendations": ["M005", "M006", "OP004"]
+},
+"final_out_description": [
+ "Recommended work orders based on anomaly analysis",
+ "Other valid recommendations may exist depending on prioritization criteria"
+]
+```
+
+### 4.5 Cross-Validation Checks
+
+Before finalizing ground truth, validate:
+
+1. **Planning Steps Validation**
+ - All agents are from predefined list
+ - Instructions are clear and semantic
+ - Steps form logical sequence
+
+2. **Execution Steps Validation**
+ - All step names are unique
+ - All actions are valid tool/function names
+ - All argument references point to existing outputs
+ - Deterministic flags are correctly set
+
+3. **Execution Links Validation**
+ - All source and target names exist in execution_steps
+ - No cycles in the DAG
+ - All steps (except first) have incoming links
+ - All steps (except last) have outgoing links
+
+4. **Output Validation**
+ - `final_out` matches `characteristic_form` description
+ - `final_out_description` is clear and complete
+ - For deterministic queries, exact expected output is specified
+
+5. **Consistency Validation**
+ - `type` matches primary agent in planning/execution
+ - `category` aligns with utterance characteristics
+ - `deterministic` flag matches output predictability
+
+---
+
+
+### 4.6 Quality Assurance Checklist
+
+Before submitting ground truth, verify:
+
+#### Structural Completeness
+- [ ] All required fields present (id, text, type, category, deterministic, characteristic_form)
+- [ ] planning_steps is a list of dicts with "agent" and "instruction" keys
+- [ ] execution_steps is a list with all required fields
+- [ ] execution_links properly connects all steps
+- [ ] final_out and final_out_description are present (if applicable)
+
+#### Semantic Correctness
+- [ ] Utterance text is clear and unambiguous
+- [ ] Category correctly classifies the utterance type
+- [ ] Deterministic flag accurately reflects answer uniqueness
+- [ ] characteristic_form clearly describes expected output
+
+#### Planning Quality
+- [ ] Planning steps use predefined agent names
+- [ ] Instructions are semantic and intent-focused
+- [ ] Steps form logical problem-solving sequence
+- [ ] All necessary agents are included
+
+#### Execution Quality
+- [ ] All step names are unique
+- [ ] All actions are valid tool/function names
+- [ ] All argument references point to existing outputs
+- [ ] Deterministic flags correctly set at all levels
+- [ ] Output variables are properly named and used
+
+#### DAG Validity
+- [ ] No cycles in execution_links
+- [ ] All steps (except first) have incoming links
+- [ ] All steps (except last) have outgoing links
+- [ ] No orphaned steps
+- [ ] Dependencies correctly represent data flow
+
+#### Output Quality
+- [ ] final_out matches characteristic_form description
+- [ ] For deterministic queries, exact expected output specified
+- [ ] For non-deterministic queries, representative example provided
+- [ ] final_out_description is clear and complete
+
+#### Consistency
+- [ ] Type matches primary agent in execution
+- [ ] Category aligns with utterance characteristics
+- [ ] Planning and execution steps are aligned
+- [ ] Terminology is consistent across fields
+
+---
+
+
+### 4.7 Common Pitfalls and How to Avoid Them
+
+#### 8.1 Inconsistent Finish Steps
+
+**Problem**: Different agents use different Finish patterns
+- IoT agent sometimes omits Finish
+- FMSA/TSFM always include Finish
+
+**Solution**: Standardize Finish usage
+- Always include explicit Finish step
+- Use consistent argument structure
+- Document agent-specific conventions
+
+#### 8.2 Ambiguous Variable References
+
+**Problem**: Unclear how outputs are referenced in arguments
+
+**Solution**: Use consistent reference syntax
+- Document whether to use `@variable` or just `variable`
+- Be explicit about variable scope
+- Validate all references point to existing outputs
+
+#### 8.3 Incomplete Determinism Marking
+
+**Problem**: Deterministic flags not set at all levels
+
+**Solution**: Always include complete deterministic dict
+```json
+"deterministic": {
+ "name": false,
+ "action": true/false,
+ "arguments": true/false,
+ "outputs": true/false
+}
+```
+
+#### 8.4 Circular Dependencies
+
+**Problem**: Execution links create cycles
+
+**Solution**: Validate DAG structure
+- Use topological sort to detect cycles
+- Ensure data flows in one direction
+- Break cycles by reordering steps
+
+#### 8.5 Missing Intermediate Steps
+
+**Problem**: Jumping from data retrieval to final output without processing
+
+**Solution**: Include all necessary transformations
+- Add load/parse steps for file-based data
+- Include filter/transform steps for data processing
+- Show LLM calls explicitly when used
+
+#### 8.6 Vague Characteristic Forms
+
+**Problem**: Expected output not clearly described
+
+**Solution**: Be specific and detailed
+- For deterministic: Specify exact expected values or validation rules
+- For non-deterministic: List required elements and constraints
+- Include format specifications (JSON, CSV, text, etc.)
+
+---
+
+
+### 4.8 Advanced Scenarios
+
+For complex ground truth scenarios:
+
+#### Conditional Execution
+
+For scenarios with conditional logic:
+
+```json
+{
+ "name": "conditional_step",
+ "action": "if_then_else",
+ "agent": "Agent",
+ "arguments": {
+ "condition": "anomaly_detected == true",
+ "then_action": "recommend_maintenance",
+ "else_action": "continue_monitoring"
+ },
+ "outputs": ["action_taken"],
+ "deterministic": {
+ "name": false,
+ "action": false,
+ "arguments": false,
+ "outputs": false
+ }
+}
+```
+
+
+#### Parallel Execution Paths
+
+For scenarios with parallel processing:
+
+```json
+"execution_links": [
+ {"source": "get_data", "target": "analyze_path1"},
+ {"source": "get_data", "target": "analyze_path2"},
+ {"source": "analyze_path1", "target": "merge_results"},
+ {"source": "analyze_path2", "target": "merge_results"}
+]
+```
+
+
+---
+
+## 5. Common Patterns and Templates
+
+This section presents real-world patterns from AssetOpsBench ground truth scenarios, demonstrating practical implementations used in industrial asset management.
+
+### 5.1 Simple Knowledge Query Pattern (ID 103)
+
+**Use Case**: Single-agent query for domain knowledge
+
+**Scenario**: "List all failure modes of asset Wind Turbine."
+
+**Pattern Characteristics**:
+- Single agent (FMSRAgent)
+- Direct knowledge retrieval
+- Non-deterministic outputs (knowledge-based)
+- Simple linear execution
+
+```json
+{
+ "planning_steps": [
+ {
+ "agent": "FMSRAgent",
+ "instruction": "List all failure modes of asset Wind Turbine."
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "get_failure_modes",
+ "action": "Get Failure Modes",
+ "agent": "FMSRAgent",
+ "arguments": "Wind Turbine",
+ "outputs": "The failure modes of asset Wind Turbine: ['Blade Failure', 'Gearbox Failure', 'Generator Failure', 'Bearing Failure', 'Electrical System Failure', 'Control System Failure', 'Foundation Failure', 'Tower Failure', 'Hub Failure', 'Pitch System Failure ', 'Yaw System Failure ', 'Hydraulic System Failure']",
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": false
+ }
+ },
+ {
+ "name": "finish",
+ "action": "Finish",
+ "arguments": "",
+ "outputs": ""
+ }
+ ],
+ "execution_links": [
+ {"source": "get_failure_modes", "target": "finish"}
+ ]
+}
+```
+
+### 5.2 Multi-Step Data Processing Pattern (ID 400)
+
+**Use Case**: Create objects, retrieve data, process, and generate output
+
+**Scenario**: "Get the work order of equipment CWC04013 for year 2017."
+
+**Pattern Characteristics**:
+- Single agent with multiple steps
+- Object creation → data retrieval → processing → output generation
+- Mix of deterministic and non-deterministic steps
+- Parallel execution branches (print and generate_json)
+
+```json
+{
+ "planning_steps": [
+ {
+ "agent": "WOAgent",
+ "instruction": "Create an Equipment instance using the provided equipment_id CWC04013."
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Create a DateRange object spanning 2017-01-01 to 2017-12-31."
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Retrieve the work orders file for the specified equipment and date range."
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Load the work orders data from the retrieved file using pickle."
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Generate a structured JSON file from the work orders using an LLM."
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "create_equipment",
+ "action": "Equipment",
+ "agent": "WOAgent",
+ "arguments": {"equipment_id": "CWC04013"},
+ "outputs": ["equipment"],
+ "deterministic": {"name": false, "action": true, "arguments": true, "outputs": true}
+ },
+ {
+ "name": "create_date_range",
+ "action": "DateRange",
+ "agent": "WOAgent",
+ "arguments": {"start_date": "2017-01-01", "end_date": "2017-12-31"},
+ "outputs": ["date_range"],
+ "deterministic": {"name": false, "action": true, "arguments": true, "outputs": true}
+ },
+ {
+ "name": "get_work_orders_file",
+ "action": "get_work_orders",
+ "agent": "WOAgent",
+ "arguments": {"equipment": "equipment", "date_range": "date_range"},
+ "outputs": ["file"],
+ "deterministic": {"name": false, "action": true, "arguments": true, "outputs": true}
+ },
+ {
+ "name": "load_work_orders",
+ "action": "pickle.load",
+ "agent": "WOAgent",
+ "arguments": {"file": "file"},
+ "outputs": ["work_orders"],
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "print_work_orders",
+ "action": "print",
+ "agent": "WOAgent",
+ "arguments": {"obj": "work_orders"},
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "generate_json",
+ "action": "LLM.generate_json",
+ "agent": "WOAgent",
+ "arguments": {"data": "work_orders"},
+ "outputs": ["json_file"],
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ }
+ ],
+ "execution_links": [
+ {"source": "create_equipment", "target": "get_work_orders_file"},
+ {"source": "create_date_range", "target": "get_work_orders_file"},
+ {"source": "get_work_orders_file", "target": "load_work_orders"},
+ {"source": "load_work_orders", "target": "print_work_orders"},
+ {"source": "load_work_orders", "target": "generate_json"}
+ ]
+}
+```
+
+### 5.3 Multi-Agent Coordination Pattern (ID 601)
+
+**Use Case**: Simple multi-agent coordination for knowledge retrieval
+
+**Scenario**: "List all failure modes of asset Chiller 6 at MAIN site."
+
+**Pattern Characteristics**:
+- Two agents (IoTAgent for verification, FMSRAgent for knowledge)
+- Agent handoff pattern
+- Validation and finalization steps
+- Structured final output
+
+```json
+{
+ "planning_steps": [
+ {
+ "agent": "IoTAgent",
+ "instruction": "Get basic description/identifiers for Chiller 6 at MAIN site (asset verification)."
+ },
+ {
+ "agent": "FMSRAgent",
+ "instruction": "List all failure modes for the verified asset (Chiller 6 at MAIN)."
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "fmsr_list",
+ "agent": "FMSRAgent",
+ "action": "get_failure_modes",
+ "arguments": {"asset_name": "Chiller 6", "site_name": "MAIN"},
+ "outputs": {
+ "status": "success",
+ "failure_modes": [
+ "Compressor Overheating: Failed due to Normal wear, overheating",
+ "Heat Exchangers: Fans: Degraded motor or worn bearing due to Normal use",
+ "Evaporator Water side fouling",
+ "Condenser Water side fouling",
+ "Condenser Improper water side flow rate",
+ "Purge Unit Excessive purge",
+ "Refrigerant Operated Control Valve Failed spring"
+ ]
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "finalize",
+ "agent": "FMSRAgent",
+ "action": "Finish",
+ "arguments": {"validation_checks": ["len(failure_modes)==7"]},
+ "outputs": {"final_state": "completed"},
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ }
+ ],
+ "execution_links": [
+ {"source": "fmsr_list", "target": "finalize"}
+ ],
+ "final_output": {
+ "asset": "Chiller 6",
+ "site": "MAIN",
+ "parameter": "Failure Modes",
+ "failure_modes": [
+ "Compressor Overheating: Failed due to Normal wear, overheating",
+ "Heat Exchangers: Fans: Degraded motor or worn bearing due to Normal use",
+ "Evaporator Water side fouling",
+ "Condenser Water side fouling",
+ "Condenser Improper water side flow rate",
+ "Purge Unit Excessive purge",
+ "Refrigerant Operated Control Valve Failed spring"
+ ]
+ }
+}
+```
+
+### 5.4 Complex Multi-Agent with Error Recovery Pattern (ID 502)
+
+**Use Case**: Multi-agent workflow with retry logic and error handling
+
+**Scenario**: "What is the forecast for 'Chiller 9 Condenser Water Flow' in the week of 2020-04-27 based on data from the MAIN site?"
+
+**Pattern Characteristics**:
+- Three agents (IoTAgent, TSFMAgent coordination)
+- Error detection and recovery (failed attempt → successful retry)
+- Model selection and forecasting
+- Comprehensive validation and artifact management
+
+**Key Pattern Elements**:
+1. **Sensor Location**: IoTAgent locates the specific sensor
+2. **Data Retrieval with Retry**: Initial failure due to incorrect asset identifier, followed by successful retry
+3. **Model Selection**: TSFMAgent enumerates available models and selects appropriate checkpoint
+4. **Forecasting**: Run time-series forecasting with validated arguments
+5. **Validation**: Confirm artifacts and provide comprehensive summary
+
+```json
+{
+ "planning_steps": [
+ {
+ "agent": "IoTAgent",
+ "instruction": "Locate 'Condenser Water Flow' sensor for 'Chiller 9' at 'MAIN'."
+ },
+ {
+ "agent": "IoTAgent",
+ "instruction": "Retrieve windowed history (2020-04-27 → 2020-05-03) and save CSV."
+ },
+ {
+ "agent": "TSFMAgent",
+ "instruction": "List available pretrained checkpoints and select 'ttm_96_28'."
+ },
+ {
+ "agent": "TSFMAgent",
+ "instruction": "Run forecast on the CSV and save JSON to declared path."
+ },
+ {
+ "agent": "TSFMAgent",
+ "instruction": "Validate artifacts and finalize summary."
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "sensors",
+ "agent": "IoTAgent",
+ "action": "LocateSensor",
+ "arguments": {
+ "site_name": "MAIN",
+ "assetnum": "Chiller 9",
+ "sensor_name_filter": "Condenser Water Flow"
+ },
+ "outputs": {
+ "status": "success",
+ "site_name": "MAIN",
+ "assetnum": "Chiller 9",
+ "sensor_id": "chiller9_condenser_water_flow",
+ "sensor_name": "Condenser Water Flow",
+ "file_path": "/tmp/cbmdir/chiller9_MAIN_sensors.json",
+ "message": "Located target sensor."
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "history",
+ "agent": "IoTAgent",
+ "action": "DownloadSeries_Attempt1",
+ "arguments": {
+ "site_name": "MAIN",
+ "assetnum": "9",
+ "metric": "Condenser Water Flow",
+ "start": "2020-04-27T00:00:00Z",
+ "final": "2020-05-03T23:59:59Z",
+ "save_series_csv": true,
+ "save_path": "data/raw/iot/main/chiller_9_condenser_water_flow_2020-04-27_to_2020-05-03.csv"
+ },
+ "outputs": {
+ "status": "failure",
+ "error": "assetnum '9' not found (expected 'Chiller 9').",
+ "csv_path": null,
+ "message": "History retrieval failed due to asset identifier mismatch."
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "history",
+ "agent": "IoTAgent",
+ "action": "DownloadSeries_ReattemptWithFixedAssetnum",
+ "arguments": {
+ "site_name": "MAIN",
+ "assetnum": "Chiller 9",
+ "metric": "Condenser Water Flow",
+ "start": "2020-04-27T00:00:00Z",
+ "final": "2020-05-03T23:59:59Z",
+ "save_series_csv": true,
+ "save_path": "data/raw/iot/main/chiller_9_condenser_water_flow_2020-04-27_to_2020-05-03.csv"
+ },
+ "outputs": {
+ "status": "success",
+ "rows": 581,
+ "sampling_hint": "irregular/partial (expected ~672 @ 15min)",
+ "csv_path": "data/raw/iot/main/chiller_9_condenser_water_flow_2020-04-27_to_2020-05-03.csv",
+ "message": "Series retrieved and saved to CSV."
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "jsonreader",
+ "agent": "IoTAgent",
+ "action": "DirectFileRead_Attempt",
+ "arguments": {
+ "file_name": "chiller9_condenser_water_flow_MAIN_2020-04-27.json"
+ },
+ "outputs": {
+ "status": "failure",
+ "error": "[Errno 2] No such file or directory: 'chiller9_condenser_water_flow_MAIN_2020-04-27.json'",
+ "message": "Direct JSON read failed; proceed with CSV path from history step."
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "tsfm_list_models",
+ "agent": "TSFMAgent",
+ "action": "EnumeratePretrainedCheckpoints",
+ "arguments": {},
+ "outputs": {
+ "status": "success",
+ "models": [
+ {"model_id": "ttm_96_28", "model_checkpoint": "ttm_96_28", "model_description": "Pretrained forecasting model with context length 96"},
+ {"model_id": "ttm_512_96", "model_checkpoint": "ttm_512_96", "model_description": "Pretrained forecasting model with context length 512"},
+ {"model_id": "ttm_energy_96_28", "model_checkpoint": "ttm_96_28", "model_description": "Energy-tuned model, context length 96"},
+ {"model_id": "ttm_energy_512_96", "model_checkpoint": "ttm_512_96", "model_description": "Energy-tuned model, context length 512"}
+ ],
+ "selected": "ttm_96_28",
+ "message": "Selected checkpoint 'ttm_96_28' for forecasting."
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "tsfm_forecast",
+ "agent": "TSFMAgent",
+ "action": "RunForecast_ttm_96_28_WithCorrectArgs",
+ "arguments": {
+ "input_csv": "data/raw/iot/main/chiller_9_condenser_water_flow_2020-04-27_to_2020-05-03.csv",
+ "timestamp_col": "timestamp",
+ "value_col": "value",
+ "window_start": "2020-04-27T00:00:00Z",
+ "window_end": "2020-05-03T23:59:59Z",
+ "checkpoint": "ttm_96_28",
+ "save_path": "data/derived/forecast/chiller_9/condenser_water_flow/forecast_2020-04-27_to_2020-05-03_ttm_96_28.json"
+ },
+ "outputs": {
+ "status": "success",
+ "model_checkpoint": "ttm_96_28",
+ "input_read": true,
+ "input_rows": 581,
+ "forecast_span": {"start": "2020-04-27T00:00:00Z", "end": "2020-05-03T23:59:59Z"},
+ "forecast_file": "data/derived/forecast/chiller_9/condenser_water_flow/forecast_2020-04-27_to_2020-05-03_ttm_96_28.json",
+ "message": "Forecast generated and saved (arguments validated; prior unpacking/validation errors resolved)."
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ },
+ {
+ "name": "finalize",
+ "agent": "TSFMAgent",
+ "action": "Finish",
+ "arguments": {
+ "validation_checks": [
+ "exists:data/raw/iot/main/chiller_9_condenser_water_flow_2020-04-27_to_2020-05-03.csv",
+ "exists:data/derived/forecast/chiller_9/condenser_water_flow/forecast_2020-04-27_to_2020-05-03_ttm_96_28.json"
+ ],
+ "summary_request": "Confirm variable binding, input read, forecast span, row counts, and artifact paths. Note prior failure remediation (assetnum correction; correct TSFM input schema)."
+ },
+ "outputs": {
+ "final_state": "completed",
+ "summary": "Forecast for Chiller 9 'Condenser Water Flow' (2020-04-27 → 2020-05-03) generated with checkpoint ttm_96_28. Input CSV (581 rows) read; forecast saved to declared path. Earlier issues (assetnum mismatch, missing JSON file, TSFM arg validation) were identified and resolved.",
+ "forecast_file": "data/derived/forecast/chiller_9/condenser_water_flow/forecast_2020-04-27_to_2020-05-03_ttm_96_28.json"
+ },
+ "deterministic": {"name": false, "action": false, "arguments": false, "outputs": false}
+ }
+ ],
+ "execution_links": [
+ {"source": "sensors", "target": "history"},
+ {"source": "history", "target": "history"},
+ {"source": "history", "target": "jsonreader"},
+ {"source": "jsonreader", "target": "tsfm_list_models"},
+ {"source": "tsfm_list_models", "target": "tsfm_forecast"},
+ {"source": "tsfm_forecast", "target": "finalize"}
+ ]
+}
+```
+
+**Key Lessons from ID 502**:
+- **Error Recovery**: Document both failed and successful attempts
+- **Self-Correction**: Show how agents detect and fix errors (assetnum correction)
+- **Validation**: Include comprehensive validation checks in finalization
+- **Artifact Management**: Track all intermediate and final file paths
+- **Non-Deterministic Throughout**: Complex workflows are typically non-deterministic at every step
+
+---
+
+## 6. Evaluation Using Ground Truth
+
+### 6.1 Evaluation Metrics
+
+Ground truth enables multiple evaluation dimensions:
+
+1. **Planning Accuracy**
+ - Compare agent's planning steps to ground truth planning_steps
+ - Use semantic similarity (not exact match)
+ - Measure: Planning step coverage, agent selection accuracy
+
+2. **Execution Accuracy**
+ - Compare agent's execution DAG to ground truth execution DAG
+ - Check: Tool selection, argument correctness, execution order
+ - Measure: DAG similarity, tool usage accuracy
+
+3. **Output Correctness**
+ - Compare agent's final output to ground truth final_out
+ - For deterministic: Exact match or validation rules
+ - For non-deterministic: Semantic similarity, constraint satisfaction
+ - Measure: Output accuracy, format compliance
+
+4. **Efficiency**
+ - Compare number of steps taken vs. ground truth
+ - Identify unnecessary steps or missing optimizations
+ - Measure: Step count ratio, execution time
+
+### 6.2 Scoring Functions
+
+**DAG Comparison**:
+- Node matching: Compare execution steps
+- Edge matching: Compare execution links
+- Path similarity: Compare execution sequences
+- Tool usage: Verify correct tools are called
+
+**Output Comparison**:
+- Exact match (deterministic)
+- Semantic similarity (non-deterministic)
+- Constraint validation
+- Format compliance
+
+**Planning Comparison**:
+- Agent selection accuracy
+- Step sequence similarity
+- Semantic intent matching
+
+---
+
+## 7. Paradigm-Specific Best Practices
+
+### 7.1 Multi-Agent Ground Truth Best Practices
+
+When creating ground truth for multi-agent systems:
+
+#### Agent Selection
+- Use predefined agent names consistently
+- Document agent capabilities and responsibilities
+- Common agents: `IoTAgent`, `FMSRAgent`, `TSFMAgent`, `WOAgent`
+- Use `multiagent` type for scenarios requiring multiple agents
+
+#### Agent Coordination
+- Show explicit agent handoffs in planning_steps
+- Document inter-agent communication patterns
+- Use `Agent-Ask` for agent-to-agent queries
+- Ensure proper sequencing of agent activities
+
+#### Domain-Specific Actions
+- Use agent-specific function names (e.g., `get_work_orders`, `get_sensor_data`)
+- Document available actions for each agent type
+- Include domain objects (e.g., `Equipment`, `DateRange`, `Sensor`)
+
+**Example Multi-Agent Pattern**:
+```json
+{
+ "type": "multiagent",
+ "planning_steps": [
+ {
+ "agent": "IoTAgent",
+ "instruction": "Retrieve sensor data for equipment"
+ },
+ {
+ "agent": "TSFMAgent",
+ "instruction": "Analyze sensor data for anomalies"
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Recommend work orders based on anomalies"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "get_sensor_data",
+ "action": "sensors",
+ "agent": "IoTAgent",
+ "arguments": {"site_name": "MAIN", "assetnum": "Chiller 6"}
+ },
+ {
+ "name": "detect_anomalies",
+ "action": "detect_anomalies",
+ "agent": "TSFMAgent",
+ "arguments": {"data": "sensor_data"}
+ },
+ {
+ "name": "recommend_wo",
+ "action": "recommend_work_order",
+ "agent": "WOAgent",
+ "arguments": {"anomalies": "detected_anomalies"}
+ }
+ ]
+}
+```
+
+### 7.2 Multi-MCP Ground Truth Best Practices
+
+When creating ground truth for MCP (Model Context Protocol) systems:
+
+#### MCP Server Types
+
+Use standard MCP server types:
+- `filesystem`: File operations (read, write, list, search)
+- `database`: Database queries (SQL, NoSQL)
+- `web-search`: Web search and scraping
+- `api-client`: REST/GraphQL API calls
+- `git`: Version control operations
+- `slack`: Team communication
+- `email`: Email operations
+- `calendar`: Calendar management
+- `computation`: Mathematical operations
+- `memory`: Persistent storage for LLM context
+
+#### MCP Tool Naming
+
+Use MCP protocol tool naming conventions:
+```json
+{
+ "action": "filesystem/read_file", // Server/tool format
+ "action": "database/query",
+ "action": "web-search/search",
+ "action": "api-client/post"
+}
+```
+
+Or simplified format:
+```json
+{
+ "action": "read_file", // Tool name only
+ "agent": "filesystem" // Server specified separately
+}
+```
+
+#### MCP Resource References
+
+Handle MCP resources with proper URIs:
+```json
+{
+ "arguments": {
+ "resource_uri": "file:///path/to/data.csv",
+ "resource_type": "text/csv"
+ }
+}
+```
+
+```json
+{
+ "arguments": {
+ "resource_uri": "db://localhost/sensors",
+ "query": "SELECT * FROM readings WHERE date > '2024-01-01'"
+ }
+}
+```
+
+#### MCP Tool Schemas
+
+Reference MCP tool schemas in execution steps:
+```json
+{
+ "name": "read_config",
+ "action": "read_file",
+ "agent": "filesystem",
+ "arguments": {
+ "path": "/config/app.json"
+ },
+ "outputs": ["file_content"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+}
+```
+
+#### MCP Server Coordination
+
+For multi-MCP scenarios:
+```json
+{
+ "type": "multiagent", // Or "multi-mcp"
+ "planning_steps": [
+ {
+ "agent": "filesystem",
+ "instruction": "Read data file"
+ },
+ {
+ "agent": "database",
+ "instruction": "Query reference data"
+ },
+ {
+ "agent": "api-client",
+ "instruction": "Send results to external API"
+ }
+ ]
+}
+```
+
+**Example Multi-MCP Pattern**:
+```json
+{
+ "id": 600,
+ "text": "Read customer data from CSV, enrich with database info, and send to CRM API.",
+ "type": "multiagent",
+ "category": "Complex Query",
+ "deterministic": true,
+ "planning_steps": [
+ {
+ "agent": "filesystem",
+ "instruction": "Read customer CSV file"
+ },
+ {
+ "agent": "database",
+ "instruction": "Query customer details from database"
+ },
+ {
+ "agent": "api-client",
+ "instruction": "POST enriched data to CRM API"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "read_csv",
+ "action": "read_file",
+ "agent": "filesystem",
+ "arguments": {
+ "path": "/data/customers.csv"
+ },
+ "outputs": ["csv_content"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+ },
+ {
+ "name": "parse_csv",
+ "action": "parse",
+ "agent": "filesystem",
+ "arguments": {
+ "content": "csv_content",
+ "format": "csv"
+ },
+ "outputs": ["customer_ids"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+ },
+ {
+ "name": "query_details",
+ "action": "query",
+ "agent": "database",
+ "arguments": {
+ "sql": "SELECT * FROM customers WHERE id IN (@customer_ids)",
+ "params": {"customer_ids": "customer_ids"}
+ },
+ "outputs": ["customer_details"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": true
+ }
+ },
+ {
+ "name": "send_to_crm",
+ "action": "post",
+ "agent": "api-client",
+ "arguments": {
+ "url": "https://crm.example.com/api/customers",
+ "body": "customer_details",
+ "headers": {"Content-Type": "application/json"}
+ },
+ "outputs": ["api_response"],
+ "deterministic": {
+ "name": false,
+ "action": true,
+ "arguments": true,
+ "outputs": false
+ }
+ }
+ ],
+ "execution_links": [
+ {"source": "read_csv", "target": "parse_csv"},
+ {"source": "parse_csv", "target": "query_details"},
+ {"source": "query_details", "target": "send_to_crm"}
+ ]
+}
+```
+
+### 7.3 Hybrid Approach Best Practices
+
+When combining multi-agent and multi-MCP:
+
+#### Clear Separation of Concerns
+- Agents handle reasoning and decision-making
+- MCP servers provide tools and data access
+- Document which layer handles what
+
+#### Consistent Naming
+- Use `agent` field for both agents and MCP servers
+- Use `type` to indicate primary system type
+- Add `paradigm` field if needed for clarity:
+ ```json
+ {
+ "type": "multiagent",
+ "paradigm": "hybrid",
+ "note": "IoTAgent uses filesystem MCP server"
+ }
+ ```
+
+#### Layered Execution
+```json
+{
+ "execution_steps": [
+ {
+ "name": "agent_decides",
+ "action": "analyze_requirements",
+ "agent": "PlannerAgent",
+ "outputs": ["required_files"]
+ },
+ {
+ "name": "mcp_retrieves",
+ "action": "read_file",
+ "agent": "filesystem",
+ "arguments": {"paths": "required_files"},
+ "outputs": ["file_contents"]
+ },
+ {
+ "name": "agent_processes",
+ "action": "process_data",
+ "agent": "ProcessorAgent",
+ "arguments": {"data": "file_contents"},
+ "outputs": ["results"]
+ }
+ ]
+}
+```
+
+### 7.4 Choosing Between Paradigms
+
+**Use Multi-Agent When**:
+- System has specialized agents with domain expertise
+- Focus on agent reasoning and coordination
+- Agents make complex decisions
+- Example: AssetOpsBench, healthcare diagnosis, financial analysis
+
+**Use Multi-MCP When**:
+- System uses standardized tool interfaces
+- Focus on tool selection and usage
+- LLM orchestrates tool calls
+- Example: File operations, database queries, API integrations
+
+**Use Hybrid When**:
+- Agents need external tools
+- Combining reasoning with data access
+- Best of both worlds
+- Example: AI agent using filesystem and database MCP servers
+
+---
+
+## 8. Using Ground Truth for Agent Evaluation
+
+### 8.1 Three-Level Evaluation Framework
+
+Ground truth enables comprehensive evaluation at three levels:
+
+| Level | Focus | Metrics | Weight |
+|-------|-------|---------|--------|
+| **Outcome** | Final result correctness | Task completeness (y₁), Data retrieval (y₂), Result verification (y₃) | 50% |
+| **Process** | Execution trajectory | Action sequence, DAG structure, Step completeness | 30% |
+| **Planning** | Strategic decomposition | ROUGE scores (planning alignment) | 20% |
+
+### 8.2 Evaluation Workflow
+
+**Step 1: Collect Agent Trace** → **Step 2: Load Ground Truth** → **Step 3: Multi-Level Comparison**
+
+### 8.3 Rubric-Based Scoring (Outcome Level)
+
+**Three Core Metrics**:
+
+1. **Task Completeness (y₁)**: Compare agent output against `characteristic_form` using LLM-as-Judge (0-1 score)
+2. **Data Retrieval Accuracy (y₂)**: Verify correct parameters (equipment IDs, date ranges, sensors)
+3. **Result Verification (y₃)**:
+ - Deterministic: Exact match with `final_out`
+ - Non-deterministic: Semantic similarity
+
+**Combined Score**: `Rubric_Score = (y₁ + y₂ + y₃) / 3`
+
+**Example** (Scenario 421):
+- y₁ = 1.0 (task completed), y₂ = 1.0 (correct params), y₃ = 0.6 (3/5 codes)
+- Rubric Score = 0.87
+
+### 8.4 Execution Alignment (Process Level)
+
+**Comparison Methods**:
+- **Action Sequence**: Use sequence alignment (e.g., Levenshtein distance) to compare action lists
+- **DAG Structure**: Calculate node/edge overlap between ground truth and agent execution graphs
+- **Step Completeness**: Identify missing or extra steps
+
+**Example**:
+- GT: 8 steps, Agent: 6 steps (missing `load_events`, `dedupe`)
+- Alignment Score: 6/8 = 0.75
+
+### 8.5 Planning Alignment (Strategic Level)
+
+Use ROUGE metrics to compare planning strategies:
+- **ROUGE-1**: Unigram overlap
+- **ROUGE-2**: Bigram overlap
+- **ROUGE-L**: Longest common subsequence
+
+**Example**: GT planning vs Agent planning → ROUGE-L: 0.65
+
+### 8.6 Hybrid Scoring Formula
+
+```
+Final_Score = 0.5 × Rubric_Score + 0.3 × Execution_Score + 0.2 × Planning_Score
+```
+
+**Complete Example** (Scenario 421):
+```json
+{
+ "final_score": 0.78,
+ "rubric_score": 0.87,
+ "execution_score": 0.68,
+ "planning_score": 0.65,
+ "failure_analysis": {
+ "missing_steps": ["load_events", "dedupe"],
+ "output_issues": "Missing 2 of 5 expected codes"
+ }
+}
+```
+
+### 8.7 Batch Evaluation
+
+**Aggregate Metrics Across Benchmark**:
+- Average final score
+- Task completion rate
+- Per-scenario breakdown
+- Failure mode distribution
+
+**AssetOpsBench Example**: GPT-4.1 achieved 0.72 avg score across 141 scenarios (70% completion rate)
+
+### 8.8 Automated Failure Detection
+
+Ground truth enables automatic identification of:
+- **Tool Selection Errors**: Wrong action at specific step
+- **Parameter Errors**: Incorrect arguments passed
+- **Missing Steps**: Skipped required operations
+- **Incorrect Sequencing**: Steps executed out of order
+- **Premature Termination**: Task incomplete
+
+### 8.9 Evaluation Best Practices
+
+| Scenario Type | Approach |
+|---------------|----------|
+| **Deterministic** | Exact matching, strict validation, focus on correctness |
+| **Non-Deterministic** | Semantic similarity, allow trajectory variation, focus on outcome quality |
+| **Multi-Agent** | Track coordination, evaluate inter-agent communication, verify specialization |
+| **Multi-MCP** | Verify server selection, check tool parameters, validate resource URIs |
+
+---
+
+## 9. Illustrated Examples from AssetOpsBench
+
+### Example 1: Simple Query (Scenario 3)
+**Utterance**: "What assets can be found at the MAIN site?"
+
+```json
+{
+ "id": 3,
+ "type": "IoT",
+ "text": "What assets can be found at the MAIN site?",
+ "category": "Knowledge Query",
+ "deterministic": true,
+ "characteristic_form": "The expected response should be the return value from querying the assets at the MAIN site. The response should be a reference to a file containing the list of assets",
+ "group": "retrospective",
+ "entity": "Site",
+ "note": "Source: IoT data operations; Deterministic query with single correct answer; Category: Knowledge Query",
+ "planning_steps": [
+ {"agent": "IoTAgent", "instruction": "list assets for site MAIN"}
+ ],
+ "execution_steps": [
+ {
+ "name": "step1",
+ "action": "assets",
+ "agent": "IoTAgent",
+ "arguments": {"site_name": "MAIN"},
+ "outputs": []
+ },
+ {
+ "name": "finish",
+ "action": "Finish",
+ "agent": "IoTAgent",
+ "argument": "The assets at the MAIN site are: CQPA AHU 1, CQPA AHU 2B, Chiller 4, Chiller 6, Chiller 9, Chiller 3."
+ }
+ ],
+ "execution_links": [{"source": "step1", "target": "finish"}]
+}
+```
+
+**Key Features**: Linear DAG (2 steps), fully deterministic, single agent. Includes complete meta information: `text`, `characteristic_form`, `group` (retrospective), `entity` (Site), and `note` fields.
+
+### Example 2: Complex Diagnostic (Scenario 421)
+**Utterance**: "Review anomalies/alerts for Chiller 9 (CWC04009) in May-June 2020 and suggest work orders."
+
+```json
+{
+ "id": 421,
+ "type": "Workorder",
+ "deterministic": false,
+ "planning_steps": [
+ {"agent": "WOAgent", "instruction": "Create Equipment(CWC04009) and DateRange(2020-05-01 to 2020-06-30)"},
+ {"agent": "WOAgent", "instruction": "Retrieve and filter events to ALERT and ANOMALY"},
+ {"agent": "WOAgent", "instruction": "Get KPI-based and alert-based recommendations in parallel"},
+ {"agent": "WOAgent", "instruction": "Combine and deduplicate recommendations"}
+ ],
+ "execution_steps": [
+ {"name": "create_equipment", "action": "Equipment", "arguments": {"equipment_id": "CWC04009"}, "outputs": ["equipment"]},
+ {"name": "create_date_range", "action": "DateRange", "arguments": {"start_date": "2020-05-01", "end_date": "2020-06-30"}, "outputs": ["date_range"]},
+ {"name": "get_events", "action": "get_events", "arguments": {"equipment": "equipment", "date_range": "date_range"}, "outputs": ["events_pickle_path"]},
+ {"name": "load_events", "action": "pickle_load", "arguments": {"file_path": "events_pickle_path"}, "outputs": ["events"]},
+ {"name": "filter_anomaly_alert", "action": "filter", "arguments": {"obj": "events", "condition": "event.event_category in {'ALERT','ANOMALY'}"}, "outputs": ["anomaly_alert_events"]},
+ {"name": "get_top_kpi_recommendations", "action": "recommend_from_top_kpi_anomalies", "arguments": {"events": "anomaly_alert_events", "top_k": 3}, "outputs": ["kpi_recommendations"]},
+ {"name": "alert_recommendation", "action": "recommend_work_order_from_alert", "arguments": {"rule_id": "RUL0018"}, "outputs": ["alert_rec"]},
+ {"name": "combine_and_dedupe", "action": "dedupe_by_key", "arguments": {"items": "[alert_rec] + kpi_recommendations", "key": "rec.primary_code"}, "outputs": ["final_result"]},
+ {"name": "print_final", "action": "print", "arguments": {"obj": "final_result"}}
+ ],
+ "execution_links": [
+ {"source": "create_equipment", "target": "get_events"},
+ {"source": "create_date_range", "target": "get_events"},
+ {"source": "get_events", "target": "load_events"},
+ {"source": "load_events", "target": "filter_anomaly_alert"},
+ {"source": "filter_anomaly_alert", "target": "get_top_kpi_recommendations"},
+ {"source": "filter_anomaly_alert", "target": "alert_recommendation"},
+ {"source": "get_top_kpi_recommendations", "target": "combine_and_dedupe"},
+ {"source": "alert_recommendation", "target": "combine_and_dedupe"},
+ {"source": "combine_and_dedupe", "target": "print_final"}
+ ],
+ "final_out": {"CWC04009": ["M005", "M006", "OP004", "M013", "OP002"]}
+}
+```
+
+**Key Features**: Parallel branches (KPI + alert paths converge at dedupe), non-deterministic filtering/reasoning, 9 steps with 8 dependencies, ~3-4 hours to create.
+
+**DAG Pattern**:
+```
+create_equipment ──┐
+ ├─> get_events ─> load_events ─> filter ─┬─> kpi_rec ──┐
+create_date_range ─┘ │ ├─> dedupe ─> print
+ └─> alert_rec ─┘
+```
+
+**Comparison**:
+
+| Aspect | Example 1 | Example 2 |
+|--------|-----------|-----------|
+| Steps | 2 | 9 |
+| Dependencies | 1 | 8 |
+| Deterministic | Fully | Partially |
+| DAG Pattern | Linear | Parallel + Convergence |
+| Creation Time | 30 min | 3-4 hours |
+
+---
+
+## 10. Two Paradigms: Multi-Agent vs Multi-MCP
+
+### 10.1 Unified Ground Truth Structure
+
+**Key Insight**: Both paradigms use the **same ground truth structure**. The only difference is semantic interpretation:
+
+| Field | Multi-Agent Interpretation | Multi-MCP Interpretation |
+|-------|---------------------------|-------------------------|
+| `type` | Agent type (IoTAgent, WOAgent) | MCP server type (filesystem, database) |
+| `agent` | Agent name | MCP server name |
+| `action` | Agent-specific function | MCP tool name |
+| `planning_steps` | Agent orchestration plan | MCP server coordination plan |
+| `execution_steps` | Agent function calls | MCP tool invocations |
+
+### 10.2 Understanding the Two Approaches
+
+Ground truth creation supports two distinct but structurally similar paradigms:
+
+#### Multi-Agent Systems
+
+**Concept**: Multiple specialized AI agents, each with domain expertise, collaborate to solve complex tasks.
+
+**Characteristics**:
+- Each agent has specific capabilities (e.g., IoTAgent handles sensor data, WOAgent manages work orders)
+- Agents communicate and coordinate through orchestration
+- Agent selection is based on task requirements and domain knowledge
+- Example: AssetOpsBench with IoTAgent, FMSRAgent, TSFMAgent, WOAgent
+
+**Ground Truth Representation**:
+```json
+{
+ "type": "Workorder", // Primary agent type
+ "planning_steps": [
+ {
+ "agent": "WOAgent", // Specific agent name
+ "instruction": "Create Equipment instance"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "create_equipment",
+ "action": "Equipment", // Agent-specific action
+ "agent": "WOAgent", // Agent executing this step
+ "arguments": {...}
+ }
+ ]
+}
+```
+
+#### Multi-MCP Systems
+
+**Concept**: Multiple Model Context Protocol (MCP) servers provide standardized tools and resources that LLMs can use to accomplish tasks.
+
+**Characteristics**:
+- Each MCP server exposes specific capabilities (e.g., filesystem operations, database queries, web search)
+- MCP servers are service locators providing tools to LLMs
+- Server selection is based on required operations and data sources
+- Example: MCP-Bench with filesystem, database, web-search, api-client servers
+
+**Ground Truth Representation**:
+```json
+{
+ "type": "filesystem", // Primary MCP server type
+ "planning_steps": [
+ {
+ "agent": "filesystem", // MCP server name
+ "instruction": "Read configuration file"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "read_config",
+ "action": "read_file", // MCP tool name
+ "agent": "filesystem", // MCP server executing this step
+ "arguments": {
+ "path": "/config/app.json"
+ }
+ }
+ ]
+}
+```
+
+### 10.3 When to Use Each Paradigm
+
+**Use Multi-Agent Ground Truth When**:
+- Evaluating specialized AI agent systems
+- Agents have distinct domain expertise and reasoning capabilities
+- Focus is on agent coordination and decision-making
+- Example domains: Industrial operations, healthcare diagnostics, financial analysis
+
+**Use Multi-MCP Ground Truth When**:
+- Evaluating LLM tool-using capabilities
+- Focus is on tool selection and parameter passing
+- Servers provide standardized interfaces (MCP protocol)
+- Example domains: File operations, database queries, API integrations, web research
+
+**Use Hybrid Approach When**:
+- System combines both paradigms
+- AI agents use MCP servers as tools
+- Example: An IoTAgent that uses filesystem MCP server to read sensor data files
+
+### 10.4 Conversion Between Paradigms
+
+Ground truth can be adapted between paradigms by reinterpreting fields:
+
+**Multi-Agent → Multi-MCP**:
+```json
+// Multi-Agent
+{
+ "type": "IoTAgent",
+ "action": "get_sensor_data"
+}
+
+// Equivalent Multi-MCP
+{
+ "type": "database",
+ "action": "query"
+}
+```
+
+**Multi-MCP → Multi-Agent**:
+```json
+// Multi-MCP
+{
+ "type": "filesystem",
+ "action": "read_file"
+}
+
+// Equivalent Multi-Agent
+{
+ "type": "FileAgent",
+ "action": "read"
+}
+```
+
+### 10.5 Examples of Both Paradigms
+
+#### Example 1: Multi-Agent Ground Truth
+
+**Scenario**: Retrieve work orders for equipment
+
+```json
+{
+ "id": 400,
+ "text": "Get the work order of equipment CWC04013 for year 2017.",
+ "type": "Workorder",
+ "category": "Data Query",
+ "deterministic": true,
+ "planning_steps": [
+ {
+ "agent": "WOAgent",
+ "instruction": "Create Equipment instance for CWC04013"
+ },
+ {
+ "agent": "WOAgent",
+ "instruction": "Retrieve work orders for 2017"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "create_equipment",
+ "action": "Equipment",
+ "agent": "WOAgent",
+ "arguments": {"equipment_id": "CWC04013"}
+ },
+ {
+ "name": "get_work_orders",
+ "action": "get_work_orders",
+ "agent": "WOAgent",
+ "arguments": {
+ "equipment": "equipment",
+ "date_range": "date_range"
+ }
+ }
+ ]
+}
+```
+
+#### Example 2: Multi-MCP Ground Truth
+
+**Scenario**: Read and analyze configuration file
+
+```json
+{
+ "id": 501,
+ "text": "Read the application configuration and validate required fields.",
+ "type": "filesystem",
+ "category": "Data Query",
+ "deterministic": true,
+ "planning_steps": [
+ {
+ "agent": "filesystem",
+ "instruction": "Read configuration file from disk"
+ },
+ {
+ "agent": "json-parser",
+ "instruction": "Parse JSON configuration"
+ },
+ {
+ "agent": "validator",
+ "instruction": "Validate required fields"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "read_config",
+ "action": "read_file",
+ "agent": "filesystem",
+ "arguments": {
+ "path": "/config/app.json"
+ },
+ "outputs": ["config_content"]
+ },
+ {
+ "name": "parse_json",
+ "action": "parse",
+ "agent": "json-parser",
+ "arguments": {
+ "content": "config_content"
+ },
+ "outputs": ["config_object"]
+ },
+ {
+ "name": "validate",
+ "action": "validate_schema",
+ "agent": "validator",
+ "arguments": {
+ "data": "config_object",
+ "required_fields": ["api_key", "endpoint", "timeout"]
+ },
+ "outputs": ["validation_result"]
+ }
+ ]
+}
+```
+
+#### Example 3: Hybrid Ground Truth
+
+**Scenario**: IoT agent uses filesystem MCP server
+
+```json
+{
+ "id": 502,
+ "text": "Load sensor data from CSV file and detect anomalies.",
+ "type": "multiagent",
+ "category": "Complex Query",
+ "deterministic": false,
+ "planning_steps": [
+ {
+ "agent": "filesystem",
+ "instruction": "Read sensor data CSV file"
+ },
+ {
+ "agent": "IoTAgent",
+ "instruction": "Parse and validate sensor data"
+ },
+ {
+ "agent": "AnomalyDetectionAgent",
+ "instruction": "Detect anomalies in sensor readings"
+ }
+ ],
+ "execution_steps": [
+ {
+ "name": "read_csv",
+ "action": "read_file",
+ "agent": "filesystem",
+ "arguments": {"path": "/data/sensors.csv"},
+ "outputs": ["csv_content"]
+ },
+ {
+ "name": "parse_data",
+ "action": "parse_sensor_data",
+ "agent": "IoTAgent",
+ "arguments": {"csv_data": "csv_content"},
+ "outputs": ["sensor_readings"]
+ },
+ {
+ "name": "detect_anomalies",
+ "action": "detect",
+ "agent": "AnomalyDetectionAgent",
+ "arguments": {"data": "sensor_readings"},
+ "outputs": ["anomalies"]
+ }
+ ]
+}
+```
+
+### 10.6 Choosing the Right Paradigm for Your Ground Truth
+
+**Decision Framework**:
+
+1. **What are you evaluating?**
+ - AI agent reasoning and coordination → Multi-Agent
+ - LLM tool-using capabilities → Multi-MCP
+ - Both → Hybrid
+
+2. **What is your system architecture?**
+ - Specialized agents with domain logic → Multi-Agent
+ - MCP servers with standardized tools → Multi-MCP
+ - Agents using MCP tools → Hybrid
+
+3. **What is your evaluation focus?**
+ - Agent selection and orchestration → Multi-Agent
+ - Tool selection and parameter passing → Multi-MCP
+ - End-to-end task completion → Either/Both
+
+**Recommendation**: Start with the paradigm that matches your system architecture, then adapt as needed. The ground truth structure is flexible enough to support both.
+
+---
+
+## 11. Summary
+
+### Core Requirements for High-Quality Ground Truth
+
+1. **Clear Understanding**: Thoroughly understand utterance intent and domain context
+2. **Structured Planning**: Break down problems into logical `planning_steps`
+3. **Detailed Execution**: Specify exact tools, arguments, and data flow in `execution_steps`
+4. **Valid DAG**: Ensure `execution_links` form proper directed acyclic graph
+5. **Accurate Determinism**: Mark determinism correctly at all levels
+6. **Complete Output**: Provide `final_out` or `final_out_description` with validation criteria
+7. **Thorough Validation**: Verify completeness, correctness, and testability
+
+### Key Principles
+
+- **Consistency**: Standard patterns and naming conventions
+- **Completeness**: All necessary steps and information
+- **Clarity**: Explicit and unambiguous expectations
+- **Validity**: Structural and semantic correctness
+- **Testability**: Automated evaluation and comparison
+
+### Document Organization
+
+**Main Sections** (1-12): Core concepts, workflow, structure, best practices, examples, value proposition
+**References** (Section 13): Citations and related work
+**Appendices** (A-E): Detailed technical content
+ - **Appendix A**: JSON Schema
+ - **Appendix B**: Evaluation Metrics Details
+ - **Appendix C**: Common Failure Patterns
+ - **Appendix D**: Domain-Specific Examples
+ - **Appendix E**: Quick Reference Checklist
+
+### Getting Started
+
+1. Review **Section 2** (High-Level Workflow) for the 5-phase process
+2. Study **Section 11** (Illustrated Examples) for concrete patterns
+3. Use **Appendix E** (Quick Reference Checklist) during creation
+4. Consult **Section 4** (Ground Truth Structure) for field definitions
+5. Reference **Section 10** (Paradigm-Specific Practices) for multi-agent vs. multi-MCP guidance
+
+### Value Proposition
+
+Ground truth trajectories enable:
+- Systematic benchmarking of agent architectures
+- Objective comparison of models and approaches
+- Automated discovery of failure modes
+- Reproducible research across the community
+- Confident deployment to production systems
+
+**ROI**: 564 hours to create 141 scenarios → 1000+ hours saved in evaluation (AssetOpsBench experience)
+
+By following this best practice, you can create ground truth data that enables systematic, reproducible evaluation of AI agent systems across diverse domains and use cases.
+
+---
+
+## 12. References
+
+### Primary References
+
+1. **Patel, D., Lin, S., Rayfield, J., Zhou, N., Vaculin, R., Martinez, N., O'donncha, F., & Kalagnanam, J. (2025)**. "AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance." *arXiv preprint arXiv:2506.03828*.
+ - URL: https://arxiv.org/abs/2506.03828
+ - Key Contribution: 141 ground truth scenarios, evaluation metrics (rubric-based + ROUGE), architectural comparison
+2. **AssetOpsBench Utterance Dataset**. IBM Research (2025).
+ - URL: https://huggingface.co/datasets/ibm-research/AssetOpsBench
+ - Contains: 152 task utterances for industrial asset operations and maintenance scenarios
+ - Platform: HuggingFace Datasets
+3. **AssetOpsBench Ground Truth Dataset (2026 - IBM Internal)**.
+ - Repository: `AssetOpsBenchGroundTruth/`
+ - Contains: 152 validated scenarios across IoT, Workorder, TSFM, and FMSR domains
+4. **Lin, C. Y. (2004)**. "ROUGE: A Package for Automatic Evaluation of Summaries." *Proceedings of the ACL Workshop on Text Summarization Branches Out*.
+ - Used for planning and execution alignment scoring in ground truth evaluation
+
+### Technical Specifications
+
+5. **Model Context Protocol (MCP) Specification**. Anthropic.
+ - URL: https://modelcontextprotocol.io/
+ - Framework for standardized tool interfaces in multi-MCP systems
+
+---
+
+## Appendix A: Ground Truth JSON Schema
+
+```json
+{
+ "$schema": "http://json-schema.org/draft-07/schema#",
+ "type": "object",
+ "required": ["id", "text", "type", "category", "deterministic", "characteristic_form", "planning_steps", "execution_steps", "execution_links"],
+ "properties": {
+ "id": {"type": "integer"},
+ "uuid": {"type": "string", "format": "uuid"},
+ "text": {"type": "string"},
+ "type": {"type": "string"},
+ "category": {"type": "string"},
+ "deterministic": {"type": "boolean"},
+ "characteristic_form": {"type": "string"},
+ "planning_steps": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "required": ["agent", "instruction"],
+ "properties": {
+ "agent": {"type": "string"},
+ "instruction": {"type": "string"}
+ }
+ }
+ },
+ "execution_steps": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "required": ["name", "action", "agent"],
+ "properties": {
+ "name": {"type": "string"},
+ "action": {"type": "string"},
+ "agent": {"type": "string"},
+ "arguments": {"type": ["object", "string"]},
+ "outputs": {"type": "array"},
+ "deterministic": {
+ "type": "object",
+ "properties": {
+ "name": {"type": "boolean"},
+ "action": {"type": "boolean"},
+ "arguments": {"type": "boolean"},
+ "outputs": {"type": "boolean"}
+ }
+ }
+ }
+ }
+ },
+ "execution_links": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "required": ["source", "target"],
+ "properties": {
+ "source": {"type": "string"},
+ "target": {"type": "string"}
+ }
+ }
+ },
+ "final_out": {"type": ["object", "null"]},
+ "final_out_description": {
+ "type": "array",
+ "items": {"type": "string"}
+ }
+ }
+}
+```
+
+---
+
+## Appendix B: Common Failure Patterns
+
+### C.1 Tool Selection Errors
+
+**Pattern**: Agent uses wrong tool for the task
+```json
+// Ground Truth
+{"action": "get_work_orders"}
+
+// Agent Execution
+{"action": "get_events"} // Wrong tool
+```
+
+**Detection**: Compare `execution_steps[].action` fields
+**Fix**: Improve tool descriptions, add examples
+
+### C.2 Parameter Errors
+
+**Pattern**: Agent uses incorrect arguments
+```json
+// Ground Truth
+{"equipment_id": "CWC04013"}
+
+// Agent Execution
+{"equipment_id": "CWC04009"} // Wrong equipment
+```
+
+**Detection**: Compare `execution_steps[].arguments` fields
+**Fix**: Validate parameter extraction, add constraints
+
+### C.3 Missing Steps
+
+**Pattern**: Agent skips required operations
+```json
+// Ground Truth: 5 steps
+create_equipment → create_date_range → get_events → load_events → process
+
+// Agent: 3 steps (missing load_events, process)
+create_equipment → create_date_range → get_events
+```
+
+**Detection**: Compare DAG node counts and structure
+**Fix**: Improve planning, add step-by-step guidance
+
+### C.4 Incorrect Sequencing
+
+**Pattern**: Agent executes steps in wrong order
+```json
+// Ground Truth
+A → B → C
+
+// Agent Execution
+A → C → B // Wrong order, may cause dependency errors
+```
+
+**Detection**: Compare `execution_links` topology
+**Fix**: Enforce dependency constraints, validate prerequisites
+
+### C.5 Premature Termination
+
+**Pattern**: Agent stops before completing workflow
+```json
+// Ground Truth
+... → final_step → Finish
+
+// Agent Execution
+... → partial_step // Stops early, no Finish
+```
+
+**Detection**: Check for `Finish` action and `final_out` presence
+**Fix**: Add completion criteria, validate output requirements
+
+---
+
+## Appendix D: Domain-Specific Examples
+
+### D.1 Healthcare Diagnosis
+
+```json
+{
+ "id": 701,
+ "text": "Review patient symptoms and lab results to recommend diagnostic tests.",
+ "type": "Healthcare",
+ "category": "Diagnostic Recommendation",
+ "planning_steps": [
+ {"agent": "DiagnosticAgent", "instruction": "Retrieve patient symptoms and history"},
+ {"agent": "DiagnosticAgent", "instruction": "Analyze lab results"},
+ {"agent": "DiagnosticAgent", "instruction": "Recommend diagnostic tests based on findings"}
+ ]
+}
+```
+
+### D.2 Financial Analysis
+
+```json
+{
+ "id": 801,
+ "text": "Analyze Q4 revenue trends and forecast Q1 performance.",
+ "type": "Finance",
+ "category": "Forecasting",
+ "planning_steps": [
+ {"agent": "FinanceAgent", "instruction": "Retrieve Q4 revenue data"},
+ {"agent": "FinanceAgent", "instruction": "Calculate trends and growth rates"},
+ {"agent": "FinanceAgent", "instruction": "Generate Q1 forecast using time series model"}
+ ]
+}
+```
+
+### D.3 Customer Support
+
+```json
+{
+ "id": 901,
+ "text": "Resolve customer complaint about delayed shipment.",
+ "type": "CustomerService",
+ "category": "Issue Resolution",
+ "planning_steps": [
+ {"agent": "SupportAgent", "instruction": "Retrieve order and shipment details"},
+ {"agent": "SupportAgent", "instruction": "Identify delay cause"},
+ {"agent": "SupportAgent", "instruction": "Propose resolution and compensation"}
+ ]
+}
+```
+
+---
+
diff --git a/docs/guideline/utterance_design_guideline.md b/docs/guideline/utterance_design_guideline.md
new file mode 100644
index 00000000..ecde023d
--- /dev/null
+++ b/docs/guideline/utterance_design_guideline.md
@@ -0,0 +1,584 @@
+# Utterance Design Guideline for Domain-Agnostic Applications
+
+## Purpose
+
+This guideline helps domain experts and developers create well-structured, consistent utterances across diverse domains. Utterances are natural language queries or commands that users make to interact with AI-powered systems.
+
+**Why This Guideline**: Well-designed utterances enable:
+- **Consistent format**: Standardized structure makes utterances digestible and analyzable by AI agents
+- **Clear categorization**: Organized by problem type (understanding what happened, predicting what will happen, helping humans make decisions)
+- **Domain adaptability**: Reusable patterns that work across different contexts
+
+**Domain-Agnostic Design**: This guideline is designed to handle various domains through a consistent abstraction pattern. The same patterns apply across:
+- **Manufacturing**: Production lines, robotics, quality control systems
+- **Energy & Utilities**: Power generation, transmission, distribution networks
+- **Transportation**: Fleet management, logistics, rail/aviation systems
+- **Oil & Gas**: Refineries, pipelines, drilling operations
+- **Process Industries**: Chemical plants, food processing, pharmaceuticals
+- **Building Operations**: HVAC, elevators, lighting, security systems
+- **Healthcare**: Patient monitoring, diagnosis support, treatment planning
+- **Financial Services**: Risk assessment, fraud detection, portfolio optimization
+- **Insurance**: Claims processing, risk evaluation, policy recommendations
+- **Education**: Student performance tracking, curriculum planning, resource allocation
+- **Industrial Asset Management**: Equipment monitoring, predictive maintenance, lifecycle optimization
+
+**A Good Set of Utterances for a Given Domain Should**:
+- **Help understand what happened**: Enable historical data analysis, knowledge extraction, and current state assessment
+- **Help predict what will happen**: Support future state forecasting, risk estimation, and trend projection
+- **Help humans make decisions**: Provide recommendations, optimization, and strategic guidance
+
+**For Example, in Industrial Asset Management Contexts, This Translates To**:
+
+- Operational telemetry and equipment monitoring (understanding what happened)
+- Predictive maintenance and failure forecasting (predicting what will happen)
+- Performance optimization and decision support (helping humans make decisions)
+
+---
+
+## Core Problem Groups
+
+To design effective utterances across any domain, we define three universal problem categories that span the complete lifecycle of data-driven decision-making:
+
+### 1. retrospective (Knowledge Extraction)
+
+**Definition**
+Understanding historical or current state through data retrieval and analysis.
+
+**Universal Coverage**
+- Entity inventory and configuration
+- Historical measurements and telemetry
+- Event, incident, and transaction records
+- Knowledge base and documentation lookup
+- Data completeness and quality assessment
+
+**Focus Question**
+*What has happened or what is known?*
+
+**Cross-Domain Examples**
+- **Manufacturing**: "What production lines are operational?" | "List all quality defects from last shift."
+- **Energy & Utilities**: "What substations are in the network?" | "Retrieve all outage events from last month."
+- **Transportation**: "What vehicles are in the fleet?" | "List all maintenance records for Train-456."
+- **Healthcare**: "What is the patient's medical history?" | "List all lab results from the past year."
+- **Financial Services**: "What were the transaction volumes last quarter?" | "Retrieve all trades for account XYZ."
+
+---
+
+### 2. predictive
+
+**Definition**
+Estimating future behavior or states under status quo conditions (no interventions).
+
+**Universal Coverage**
+- Performance and condition forecasting
+- Risk and anomaly likelihood estimation
+- Event probability and trend projection
+- Demand and workload prediction
+- Degradation and lifecycle estimation
+
+**Focus Question**
+*What is likely to happen next?*
+
+**Cross-Domain Examples**
+- **Manufacturing**: "Predict production line downtime risk." | "Forecast product quality metrics for next batch."
+- **Energy & Utilities**: "Forecast power demand for next week." | "Predict transformer failure probability."
+- **Transportation**: "Predict vehicle maintenance needs." | "Estimate delivery delays for Route-A."
+- **Healthcare**: "What is the patient's readmission risk?" | "Predict disease progression over the next 6 months."
+- **Financial Services**: "Forecast stock price for next quarter." | "Estimate default probability for this loan."
+
+---
+
+### 3. prescriptive
+
+**Definition**
+Determining optimal interventions, actions, or strategies under operational constraints.
+
+**Universal Coverage**
+- Corrective and preventive action planning
+- Strategy selection and timing optimization
+- Entity prioritization based on risk and impact
+- Resource, budget, and capacity allocation
+- Policy and rule definition
+- Trade-off analysis across multiple objectives
+
+**Focus Question**
+*What should be done and when?*
+
+**Cross-Domain Examples**
+- **Manufacturing**: "Recommend corrective actions for quality issue." | "Prioritize production orders for next shift."
+- **Energy & Utilities**: "Optimize load balancing strategy." | "Recommend preventive maintenance for grid equipment."
+- **Transportation**: "Optimize route assignments." | "Recommend vehicle replacement strategy."
+- **Healthcare**: "Recommend treatment plan for this diagnosis." | "Prioritize patients for ICU admission."
+- **Financial Services**: "Suggest portfolio rebalancing strategy." | "Recommend fraud prevention actions."
+
+---
+
+## Domain Abstraction Pattern
+
+To create domain-agnostic utterances that can be adapted across applications, use a consistent placeholder system:
+
+### Core Placeholders
+
+| Placeholder | Description | Examples |
+|------------|-------------|----------|
+| `{ENTITY}` | Primary object of analysis | Patient, Account, Equipment, Vehicle, Student |
+| `{ENTITY_CLASS}` | Category or type | Diagnosis, Portfolio, HVAC, Fleet, Course |
+| `{LOCATION}` | Physical or logical location | Hospital, Branch, Site, Route, Campus |
+| `{METRIC}` | Measurement or KPI | Heart Rate, Balance, Temperature, Speed, Grade |
+| `{EVENT_TYPE}` | Type of occurrence | Admission, Transaction, Alert, Delay, Enrollment |
+| `{EVENT_NAME}` | Specific event label | Readmission, Withdrawal, Failure, Accident, Dropout |
+| `{TIME_RANGE}` | Temporal scope | "Last month", "2020-06-01 to 2020-06-30", "Next quarter" |
+| `{WINDOW}` | Time horizon | "Within 14 days", "Next maintenance window", "By end of semester" |
+| `{ACTION}` | Intervention or decision | Treatment, Investment, Maintenance, Reroute, Tutoring |
+| `{CONDITION}` | Trigger or state | "When anomaly detected", "If balance < threshold", "When grade drops" |
+| `{OUTPUT_FORM}` | Response format | List, Table, Chart, File, Summary |
+
+### Usage Example
+
+**Generic Template**:
+"Retrieve {METRIC} for {ENTITY} at {LOCATION} over {TIME_RANGE}"
+
+**Domain Adaptations**:
+- Healthcare: "Retrieve blood pressure for Patient-123 at Boston General over last month"
+- Finance: "Retrieve transaction volume for Account-456 at NYC Branch over Q1 2024"
+- Manufacturing: "Retrieve cycle time for Robot-5 at Assembly Line A over last week"
+- Education: "Retrieve attendance for Student-789 at Main Campus over last semester"
+
+---
+
+## Utterance Schema Structure
+
+Each utterance should be documented with the following attributes in this order:
+
+### Required Fields
+
+1. **id** (integer): Unique identifier for the utterance
+2. **text** (string): The actual utterance text in natural language
+3. **type** (string): The MCP server, AI agent, service, or system component that processes this utterance
+ - In MCP (Model Context Protocol) systems: Identifies which MCP server handles the request (e.g., "filesystem", "database", "web-search")
+ - In multi-agent systems: Identifies which agent handles the request (e.g., "IoT", "FMSA", "TSFM", "Workorder", "multiagent")
+ - In single-agent systems: Can represent the domain or service type (e.g., "Healthcare", "Finance", "CustomerService")
+ - In microservices: Can indicate the service endpoint (e.g., "PatientAPI", "BillingService", "InventoryService")
+4. **category** (string): Classification of the utterance type (see Category Definitions section)
+5. **deterministic** (boolean): Whether the utterance has a single correct answer (true) or allows multiple valid responses (false)
+6. **characteristic_form** (string): Description of the expected response format and content
+7. **group** (string or array): Problem category - RETROSPECTIVE, PREDICTIVE, PRESCRIPTIVE, or combination (e.g., ["retrospective", "predictive"])
+8. **entity** (string or array): Primary entity/entities the utterance addresses (e.g., "Chiller", "Patient", "Account", "Vehicle")
+9. **note** (string): A flexible field for any information useful to explain or contextualize the utterance, including source, owner/contributor, omitted details, design rationale, implementation notes, or any other relevant metadata
+
+---
+
+### Field Descriptions
+
+**type**: Specifies which MCP server, AI agent, service, or system component processes this utterance:
+- **MCP (Model Context Protocol) systems**: MCP server identifier (e.g., "filesystem" for file operations, "database" for data queries, "web-search" for internet searches, "git" for version control operations)
+- **Multi-agent systems**: Agent identifier (e.g., "IoT" for sensor data, "FMSA" for failure analysis, "TSFM" for time series forecasting, "Workorder" for maintenance management)
+- **Single-agent systems**: Domain or service type (e.g., "Healthcare" for medical queries, "Finance" for financial operations)
+- **Microservices architecture**: Service endpoint or API name (e.g., "PatientAPI", "BillingService")
+- **Purpose**: Enables routing, load balancing, and proper request handling in distributed systems
+
+**group**: Indicates which of the three core problem categories the utterance belongs to:
+- **RETROSPECTIVE**: Understanding what happened or what is known
+- **PREDICTIVE**: Estimating what will happen next
+- **PRESCRIPTIVE**: Determining what should be done and when
+- Can be a single value or array if utterance spans multiple categories. Note: Use lowercase values in actual JSON (e.g., "retrospective", "predictive", "prescriptive")
+
+**entity**: Specifies the primary physical subject(s) of the utterance - the tangible things being analyzed, monitored, or acted upon:
+- In industrial domains: Physical equipment (e.g., "Chiller", "Pump", "Transformer", "Sensor")
+- In healthcare: Physical entities (e.g., "Patient", "MedicalDevice", "Facility")
+- In finance: Physical or tangible assets (e.g., "Branch", "ATM", "DataCenter")
+- In transportation: Physical assets (e.g., "Vehicle", "Aircraft", "Vessel")
+- Can be a single value or array for multi-entity queries
+- Note: Use physical things, not abstract concepts (e.g., use "Chiller" not "Anomaly", use "Equipment" not "WorkOrder")
+
+**note**: A flexible, required field for storing any information useful to explain or contextualize the utterance. Common uses include:
+- **Source**: Where the utterance originated (e.g., "SME Interview - John Doe", "Customer Request #1234", "Internal Analysis")
+- **Owner/Contributor**: Who provided or is responsible for this utterance
+- **Omitted Details**: Information intentionally left out for brevity or abstraction
+- **Design Rationale**: Why this utterance was structured this way
+- **Implementation Notes**: Technical considerations or constraints
+- **Cross-references**: Related utterances or dependencies
+- **Domain-Specific Context**: Any additional information that aids understanding, maintenance, or usage
+
+The field is intentionally flexible to accommodate diverse documentation needs across different domains and use cases.
+
+---
+
+### Examples
+
+**Basic Example**:
+```json
+{
+ "id": 1,
+ "text": "What IoT sites are available?",
+ "type": "IoT",
+ "category": "Knowledge Query",
+ "deterministic": true,
+ "characteristic_form": "The expected response should be the return value of all sites, either as text or as a reference to a file",
+ "group": "retrospective",
+ "entity": "Site",
+ "note": "Source: Initial domain analysis; Owner: Domain SME Team; Basic inventory query pattern"
+}
+```
+
+**Predictive Example**:
+```json
+{
+ "id": 416,
+ "text": "When an anomaly happens for equipment CWC04009, can you recommend top three work orders?",
+ "type": "Workorder",
+ "category": "Decision Support",
+ "deterministic": false,
+ "characteristic_form": "It gives a list of work order with a primary failure code.",
+ "group": "prescriptive",
+ "entity": "Chiller",
+ "note": "Source: Operations team request; Owner: Maintenance SME; Supports proactive maintenance workflow"
+}
+```
+
+**Multi-Group Example**:
+```json
+{
+ "id": 420,
+ "text": "Review the performance of chiller 9 for June 2020 and track any anomalies or operation violations as alerts.",
+ "type": "Workorder",
+ "category": "Decision Support",
+ "deterministic": false,
+ "characteristic_form": "Review operational performance and detect anomalies or alerts for Chiller 9 during June 2020.",
+ "group": ["retrospective", "predictive"],
+ "entity": "Chiller",
+ "note": "Source: Operations dashboard requirements; Owner: Analytics Team; Multi-category query combining historical review with anomaly detection"
+}
+```
+
+**Example with Note**:
+```json
+{
+ "id": 409,
+ "text": "Get the daily count of alert, anomaly and work order event for May 2020.",
+ "type": "Workorder",
+ "category": "Knowledge Query",
+ "deterministic": false,
+ "characteristic_form": "There are 26 days with records. Depending on the LLM used, the result could be daily total event summary or daily summary for each event type.",
+ "group": "retrospective",
+ "entity": ["Alert", "Anomaly", "WorkOrder"],
+ "note": "We have both work order business data object and work order event as a group type in the event file."
+}
+```
+
+---
+
+## Category Definitions (Suggested)
+
+The following categories are **suggested classifications** that have proven useful across multiple domains. You may adapt, combine, or create new categories based on your specific domain needs.
+
+### Information Retrieval
+Requests for accessing existing information from knowledge bases or databases.
+- **Characteristics**: Direct lookup, factual responses, deterministic answers
+- **Generic Pattern**: "What is...", "List all...", "Show me...", "Retrieve..."
+- **Cross-Domain Examples**:
+ - Healthcare: "What is the patient's blood type?"
+ - Finance: "List all accounts for customer ID 12345"
+ - Education: "Show me all courses in the Computer Science department"
+ - Manufacturing: "List all equipment in Building A"
+
+### Data Extraction
+Requests for retrieving specific data points, measurements, or records over time.
+- **Characteristics**: Time-bound queries, data export, file generation
+- **Generic Pattern**: "Download...", "Export...", "Get data for...", "Retrieve records from..."
+- **Cross-Domain Examples**:
+ - Healthcare: "Download patient vital signs from last week"
+ - Finance: "Export transaction history for Q1 2024"
+ - Education: "Get attendance records for Student-123 from last semester"
+ - Manufacturing: "Retrieve production metrics from last shift"
+
+### Analysis & Inference
+Requests for AI-powered analysis, predictions, or pattern recognition on provided data.
+- **Characteristics**: Model execution, computational analysis, insight generation
+- **Generic Pattern**: "Analyze...", "Predict...", "Forecast...", "Identify patterns in..."
+- **Cross-Domain Examples**:
+ - Healthcare: "Predict readmission risk using patient data"
+ - Finance: "Analyze portfolio performance and identify trends"
+ - Education: "Predict student dropout risk based on engagement data"
+ - Manufacturing: "Forecast production output for next week"
+
+### Model Customization
+Requests for adapting or fine-tuning AI models to specific contexts or data.
+- **Characteristics**: Model training, parameter optimization, domain adaptation
+- **Generic Pattern**: "Fine-tune...", "Train model on...", "Customize...", "Adapt model for..."
+- **Cross-Domain Examples**:
+ - Healthcare: "Fine-tune diagnosis model using hospital-specific data"
+ - Finance: "Train fraud detection model on our transaction patterns"
+ - Education: "Customize recommendation engine for our curriculum"
+ - Manufacturing: "Adapt quality prediction model to our production line"
+
+### Anomaly & Exception Detection
+Requests for identifying unusual patterns, outliers, or deviations from normal behavior.
+- **Characteristics**: Pattern deviation detection, threshold monitoring, outlier identification
+- **Generic Pattern**: "Detect anomalies in...", "Identify unusual...", "Find outliers in...", "Are there any exceptions in..."
+- **Cross-Domain Examples**:
+ - Healthcare: "Detect anomalies in patient vital signs"
+ - Finance: "Identify unusual transaction patterns"
+ - Education: "Find students with abnormal attendance patterns"
+ - Manufacturing: "Detect quality anomalies in production batch"
+
+### Recommendation & Optimization
+Requests for actionable recommendations, strategic guidance, or optimization suggestions.
+- **Characteristics**: Multi-factor analysis, trade-off evaluation, decision support
+- **Generic Pattern**: "Recommend...", "Suggest...", "What should I do when...", "Optimize..."
+- **Cross-Domain Examples**:
+ - Healthcare: "Recommend treatment plan for this diagnosis"
+ - Finance: "Suggest portfolio rebalancing strategy"
+ - Education: "Recommend courses for student based on interests and performance"
+ - Manufacturing: "Suggest process improvements to reduce defects"
+
+### Future State Prediction
+Requests for forecasting future states, events, or probabilities.
+- **Characteristics**: Probabilistic outputs, time-horizon projections, scenario analysis
+- **Generic Pattern**: "What will happen...", "Forecast...", "Estimate probability of...", "Predict future..."
+- **Cross-Domain Examples**:
+ - Healthcare: "Predict patient deterioration risk over next 48 hours"
+ - Finance: "Forecast revenue for next quarter"
+ - Education: "Predict enrollment numbers for next academic year"
+ - Manufacturing: "Estimate equipment failure probability over next month"
+
+### Multi-Step Orchestration
+Complex requests requiring coordination of multiple operations in sequence.
+- **Characteristics**: Multiple tool invocations, sequential processing, workflow execution
+- **Generic Pattern**: "First... then... finally...", "Retrieve X, analyze Y, and recommend Z"
+- **Cross-Domain Examples**:
+ - Healthcare: "Retrieve patient history, analyze trends, and recommend interventions"
+ - Finance: "Analyze market data, forecast returns, and suggest investment strategy"
+ - Education: "Review student performance, identify at-risk students, and recommend support programs"
+ - Manufacturing: "Collect quality data, detect issues, and recommend corrective actions"
+
+**Note**: These categories are not mutually exclusive. Some utterances may span multiple categories. The key is to choose the primary category that best describes the utterance's main intent. For domain-specific category examples, see the case studies below.
+
+---
+
+## Deterministic vs. Non-Deterministic Utterances
+
+### Deterministic (deterministic: true)
+- Has a **single, verifiable correct answer**
+- Response can be validated against ground truth
+- Typically involves data retrieval or well-defined calculations
+
+### Non-Deterministic (deterministic: false)
+- May have **multiple valid answers** or interpretations
+- Response depends on model reasoning, domain knowledge, or heuristics
+- Typically involves recommendations, predictions, or complex analysis
+
+---
+
+## Characteristic Form Guidelines
+
+The `characteristic_form` field describes the expected response structure and validation criteria.
+
+### For Deterministic Utterances
+
+1. **Exact Expected Output**: Specify the precise data or format
+2. **Validation Criteria**: How to verify correctness
+3. **Data Format**: Specify file types, data structures
+
+### For Non-Deterministic Utterances
+
+1. **Response Constraints**: What must be included
+2. **Quality Criteria**: What makes a good response
+3. **Acceptable Variations**: Note when multiple approaches are valid
+
+### General Guidelines
+
+- **Be Specific**: Clearly define what constitutes a correct or acceptable response
+- **Include Context**: Mention relevant constraints (time ranges, entity IDs, data sources)
+- **Specify Format**: Indicate if response should be text, file reference, JSON, list, etc.
+- **Note Dependencies**: Mention if response depends on available data or models
+- **Validation Steps**: Describe how to verify the response meets requirements
+
+---
+
+## Domain-Agnostic Utterance Templates
+
+### Template 1: Historical Data Retrieval
+```
+"Retrieve {METRIC} for {ENTITY} at {LOCATION} from {TIME_RANGE}"
+```
+
+### Template 2: Knowledge Lookup
+```
+"List all {ATTRIBUTE} of {ENTITY_CLASS}"
+```
+
+### Template 3: Predictive Analysis
+```
+"Forecast {METRIC} for {ENTITY} for {FUTURE_RANGE} based on data from {PAST_RANGE}"
+```
+
+### Template 4: Anomaly Detection
+```
+"Is there any anomaly detected in {ENTITY}'s {METRIC} in {TIME_RANGE} at {LOCATION}?"
+```
+
+### Template 5: Recommendation
+```
+"When {CONDITION} happens for {ENTITY}, recommend {ACTION_TYPE}"
+```
+
+### Template 6: Multi-Factor Decision Support
+```
+"Review {DATA_SOURCES} for {ENTITY} during {PERIOD} and {recommend/analyze/prioritize} {ACTION}"
+```
+
+---
+
+## Utterance Set Completeness
+
+### Definition
+**Completeness** refers to the degree to which a set of utterances covers the full scope of requirements for a specific domain. A complete utterance set should address all critical operational scenarios, decision points, and information needs within the domain.
+
+### Coverage Dimensions
+
+**Operational Lifecycle Coverage**
+- [ ] Normal operations monitoring
+- [ ] Degradation detection and tracking
+- [ ] Event prediction and diagnosis
+- [ ] Action planning and execution
+- [ ] Performance optimization
+- [ ] Compliance and reporting
+
+**Problem Category Balance**
+- [ ] Retrospective queries (historical analysis, knowledge lookup)
+- [ ] Predictive queries (forecasting, risk assessment)
+- [ ] Prescriptive queries (recommendations, optimization)
+
+**Entity/System Coverage**
+- [ ] All critical entity types in the domain
+- [ ] Key subsystems and components
+- [ ] Cross-entity dependencies and interactions
+
+**Stakeholder Perspective Coverage**
+- [ ] Operations team needs
+- [ ] Management team needs
+- [ ] Engineering/Analysis team needs
+- [ ] End-user needs
+
+### Achieving Completeness
+
+To ensure comprehensive coverage, consider mapping domain requirements to the three core categories (Retrospective, Predictive, Prescriptive) and identifying critical operational scenarios. Create a coverage matrix to highlight gaps, then iteratively refine the utterance set by starting with high-priority scenarios, testing with domain experts, and expanding based on actual usage patterns and user feedback.
+
+---
+
+## Utterance Specificity and Incremental Development
+
+### Collaborative Development with Subject Matter Experts
+
+**Subject Matter Experts (SMEs)** are domain specialists with deep knowledge of specific equipment, processes, or operational contexts. Rather than requiring one SME to create all utterances at once, the guideline supports **incremental, collaborative development**.
+
+**Suggested Approach**:
+- **Divide by expertise**: Different SMEs can contribute utterances for their specific areas (equipment types, operational functions, or problem categories)
+- **Start small**: Begin with 5-10 critical, high-frequency queries per SME
+- **Iterate and expand**: Add edge cases, test with users, refine based on feedback, and continuously improve
+- **Vary specificity**: Use generic templates for reusability, domain-specific utterances for tailored needs, and scenario-specific queries for deep expertise
+
+**Prioritization**: Focus on critical operations, high-frequency needs, and high-impact decisions first. Quality and relevance matter more than quantity.
+- **Data Availability**: Create utterances for scenarios where data actually exists
+
+---
+
+## Validation Checklist
+
+Before finalizing an utterance, verify:
+
+- [ ] **Clarity**: Is the request unambiguous?
+- [ ] **Completeness**: Are all necessary parameters specified?
+- [ ] **Category**: Is the category correctly assigned?
+- [ ] **Deterministic Flag**: Is the deterministic field accurately set?
+- [ ] **Characteristic Form**: Does it clearly describe expected output?
+- [ ] **Domain Relevance**: Does it align with real-world use cases?
+- [ ] **Testability**: Can the response be validated or evaluated?
+- [ ] **Consistency**: Does it follow established patterns and terminology?
+- [ ] **Domain-Neutrality**: Can it be adapted to other domains using placeholders?
+- [ ] **SME Validation**: Has a domain expert reviewed and approved this utterance?
+- [ ] **Coverage Contribution**: Does this utterance fill a gap in the coverage matrix?
+
+---
+
+## Summary
+
+Effective domain-agnostic utterance design requires:
+
+### Framework Fundamentals
+1. **Universal categorization** into Retrospective, Predictive, or Prescriptive
+2. **Consistent abstraction** using domain-agnostic placeholders
+3. **Proper classification** by query type
+4. **Accurate deterministic flag** based on answer uniqueness
+5. **Detailed characteristic form** describing expected responses
+
+### Completeness and Coverage
+6. **Comprehensive coverage** across operational lifecycle and stakeholder needs
+7. **Balanced distribution** across problem categories
+8. **Gap analysis** using coverage matrices
+9. **Iterative refinement** based on feedback
+
+### Collaborative Development
+10. **SME-driven creation** leveraging domain experts
+11. **Incremental development** allowing phased rollout
+12. **Distributed ownership** with clear responsibilities
+13. **Prioritization** focusing on critical scenarios
+14. **Quality over quantity** ensuring practical utility
+
+### Practical Implementation
+15. **Domain-specific adaptation** while maintaining consistency
+16. **Validation criteria** ensuring accuracy
+17. **Documentation standards** including ownership and dependencies
+18. **Continuous improvement** through monitoring and feedback
+
+---
+
+## Getting Started
+
+**For New Domains**:
+1. Identify 3-5 critical entity types or operational scenarios
+2. Engage relevant SMEs for each area
+3. Create 5-10 core utterances per SME covering high-priority needs
+4. Build coverage matrix to track completeness
+5. Iterate and expand based on user feedback
+
+**For Existing Systems**:
+1. Analyze current query patterns and user needs
+2. Map existing queries to the three core categories
+3. Identify gaps using the completeness assessment methods
+4. Prioritize gap-filling based on business impact
+5. Engage SMEs to create missing utterances incrementally
+
+---
+
+## Case Studies
+
+For detailed examples of applying this guideline to specific domains, see:
+
+### 1. Industrial Asset Management Case Study
+**File**: [case_study_industrial_asset_management.md](case_study_industrial_asset_management.md)
+**Domain**: Building automation and HVAC systems
+**Focus**: Chiller and AHU equipment monitoring, predictive maintenance
+**Dataset**: 152 utterances evolved incrementally with SME input
+**Status**: Comprehensive production example showing real-world development process
+
+**Key Highlights**:
+- Demonstrates incremental development with domain experts
+- Shows evolution from initial requirements to comprehensive coverage
+- Heavily weighted toward retrospective queries (77/152) reflecting actual usage patterns
+- Includes detailed coverage analysis and SME contribution tracking
+
+### 2. Wind Turbine Operations Case Study
+**File**: [case_study_wind_turbine.md](case_study_wind_turbine.md)
+**Domain**: Renewable energy and wind power generation
+**Focus**: Wind turbine monitoring, power forecasting, maintenance optimization
+**Dataset**: 30 utterances designed complete from the start
+**Status**: Complete demonstration example showing balanced framework application
+
+**Key Highlights**:
+- Demonstrates clean-slate application of the guideline to a new domain
+- Perfectly balanced distribution (10 retrospective, 10 predictive, 10 prescriptive)
+- Includes rigorous completeness framework with validation criteria
+- Shows how to create a complete case study efficiently using the guideline
+
+**Comparison**: The two case studies complement each other - Industrial Asset Management shows organic growth in a real project, while Wind Turbine demonstrates planned, balanced design from inception. Both validate the guideline's cross-domain applicability.
diff --git a/docs/guideline/wind_turbine/wind_turbine_utterances.json b/docs/guideline/wind_turbine/wind_turbine_utterances.json
new file mode 100644
index 00000000..5aff94c1
--- /dev/null
+++ b/docs/guideline/wind_turbine/wind_turbine_utterances.json
@@ -0,0 +1,332 @@
+[
+ {
+ "id": 1,
+ "text": "What wind turbines are available in the wind farm?",
+ "type": "IoT",
+ "category": "Information Retrieval",
+ "deterministic": true,
+ "characteristic_form": "List of all wind turbine assets in the farm with their identifiers and locations",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Asset inventory requirements; Owner: Operations Team; Basic inventory query for wind farm overview"
+ },
+ {
+ "id": 2,
+ "text": "List all sensors installed on Wind Turbine WT-105.",
+ "type": "IoT",
+ "category": "Information Retrieval",
+ "deterministic": true,
+ "characteristic_form": "Complete list of sensors including vibration, temperature, wind speed, power output, pitch angle, and yaw position sensors",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Maintenance planning; Owner: Engineering Team; Sensor configuration query for specific turbine"
+ },
+ {
+ "id": 3,
+ "text": "Retrieve power output data for Wind Turbine WT-105 from January 2024.",
+ "type": "IoT",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "Time series data of power output in kW for the specified turbine and time period",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Performance analysis; Owner: Analytics Team; Historical power generation data retrieval"
+ },
+ {
+ "id": 4,
+ "text": "Get the maintenance history for Wind Turbine WT-105 in 2023.",
+ "type": "Maintenance",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "List of all maintenance activities including preventive and corrective maintenance with dates and descriptions",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Maintenance records; Owner: Maintenance Team; Historical maintenance activity query"
+ },
+ {
+ "id": 5,
+ "text": "What are the common failure modes for wind turbine gearboxes?",
+ "type": "FMEA",
+ "category": "Information Retrieval",
+ "deterministic": true,
+ "characteristic_form": "List of gearbox failure modes including bearing wear, gear tooth damage, lubrication issues, and overheating",
+ "group": "retrospective",
+ "entity": "Gearbox",
+ "note": "Source: Failure mode analysis; Owner: Reliability Engineering; Knowledge base query for failure modes"
+ },
+ {
+ "id": 6,
+ "text": "Show me the wind speed and power output correlation for Wind Turbine WT-105 last month.",
+ "type": "Analytics",
+ "category": "Analysis & Inference",
+ "deterministic": true,
+ "characteristic_form": "Scatter plot or correlation analysis showing relationship between wind speed and power output",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Performance analysis; Owner: Analytics Team; Performance curve validation query"
+ },
+ {
+ "id": 7,
+ "text": "List all alarm events for Wind Turbine WT-105 in the past week.",
+ "type": "IoT",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "Chronological list of alarm events with timestamps, alarm codes, and descriptions",
+ "group": "retrospective",
+ "entity": "WindTurbine",
+ "note": "Source: Operations monitoring; Owner: Operations Team; Recent alarm history query"
+ },
+ {
+ "id": 8,
+ "text": "What is the average capacity factor for the wind farm in Q1 2024?",
+ "type": "Analytics",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "Calculated capacity factor percentage based on actual vs. rated power output",
+ "group": "retrospective",
+ "entity": "WindFarm",
+ "note": "Source: Performance reporting; Owner: Management Team; Quarterly performance metric query"
+ },
+ {
+ "id": 9,
+ "text": "Retrieve vibration data from the gearbox of Wind Turbine WT-105 for the last 24 hours.",
+ "type": "IoT",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "Time series vibration data in mm/s or g-force from gearbox accelerometers",
+ "group": "retrospective",
+ "entity": "Gearbox",
+ "note": "Source: Condition monitoring; Owner: Predictive Maintenance Team; Recent vibration data for analysis"
+ },
+ {
+ "id": 10,
+ "text": "What was the total energy production of the wind farm yesterday?",
+ "type": "Analytics",
+ "category": "Data Extraction",
+ "deterministic": true,
+ "characteristic_form": "Total energy in MWh produced by all turbines in the farm for the previous day",
+ "group": "retrospective",
+ "entity": "WindFarm",
+ "note": "Source: Daily operations report; Owner: Operations Team; Daily energy production summary"
+ },
+ {
+ "id": 11,
+ "text": "Forecast power output for Wind Turbine WT-105 for the next 48 hours.",
+ "type": "Forecasting",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Predicted power output time series based on weather forecast and historical performance",
+ "group": "predictive",
+ "entity": "WindTurbine",
+ "note": "Source: Energy trading requirements; Owner: Energy Management Team; Short-term power forecasting for grid planning"
+ },
+ {
+ "id": 12,
+ "text": "Predict the remaining useful life of the gearbox in Wind Turbine WT-105.",
+ "type": "Prognostics",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Estimated remaining operational hours or days before gearbox replacement needed",
+ "group": "predictive",
+ "entity": "Gearbox",
+ "note": "Source: Predictive maintenance initiative; Owner: Reliability Engineering; RUL estimation for maintenance planning"
+ },
+ {
+ "id": 13,
+ "text": "Is there any anomaly detected in the vibration pattern of Wind Turbine WT-105?",
+ "type": "AnomalyDetection",
+ "category": "Anomaly & Exception Detection",
+ "deterministic": false,
+ "characteristic_form": "Boolean result with anomaly score and description of detected abnormal vibration patterns",
+ "group": "predictive",
+ "entity": "WindTurbine",
+ "note": "Source: Condition monitoring system; Owner: Predictive Maintenance Team; Real-time anomaly detection query"
+ },
+ {
+ "id": 14,
+ "text": "Predict the probability of gearbox failure for Wind Turbine WT-105 in the next 30 days.",
+ "type": "Prognostics",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Failure probability percentage with confidence interval based on condition indicators",
+ "group": "predictive",
+ "entity": "Gearbox",
+ "note": "Source: Risk assessment; Owner: Reliability Engineering; Failure probability estimation for risk management"
+ },
+ {
+ "id": 15,
+ "text": "Forecast wind speed at the wind farm location for the next 7 days.",
+ "type": "Forecasting",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Hourly wind speed forecast in m/s based on meteorological models",
+ "group": "predictive",
+ "entity": "WindFarm",
+ "note": "Source: Weather service integration; Owner: Operations Team; Medium-term wind forecast for production planning"
+ },
+ {
+ "id": 16,
+ "text": "Detect any abnormal temperature rise in the generator of Wind Turbine WT-105.",
+ "type": "AnomalyDetection",
+ "category": "Anomaly & Exception Detection",
+ "deterministic": false,
+ "characteristic_form": "Alert if temperature exceeds normal operating range with severity level",
+ "group": "predictive",
+ "entity": "Generator",
+ "note": "Source: Thermal monitoring; Owner: Operations Team; Temperature anomaly detection for generator protection"
+ },
+ {
+ "id": 17,
+ "text": "Predict the next maintenance event for Wind Turbine WT-105.",
+ "type": "Prognostics",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Predicted maintenance type and estimated date based on condition trends",
+ "group": "predictive",
+ "entity": "WindTurbine",
+ "note": "Source: Maintenance optimization; Owner: Maintenance Team; Next maintenance event prediction"
+ },
+ {
+ "id": 18,
+ "text": "Forecast the monthly energy production for the wind farm for Q2 2024.",
+ "type": "Forecasting",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Monthly energy production forecast in MWh for April, May, and June 2024",
+ "group": "predictive",
+ "entity": "WindFarm",
+ "note": "Source: Business planning; Owner: Management Team; Quarterly production forecast for revenue planning"
+ },
+ {
+ "id": 19,
+ "text": "Identify any degradation trend in the power curve of Wind Turbine WT-105.",
+ "type": "Analytics",
+ "category": "Analysis & Inference",
+ "deterministic": false,
+ "characteristic_form": "Trend analysis showing power curve degradation over time with degradation rate",
+ "group": "predictive",
+ "entity": "WindTurbine",
+ "note": "Source: Performance monitoring; Owner: Analytics Team; Power curve degradation analysis for performance optimization"
+ },
+ {
+ "id": 20,
+ "text": "Predict ice formation risk on the blades of Wind Turbine WT-105 for tomorrow.",
+ "type": "Forecasting",
+ "category": "Future State Prediction",
+ "deterministic": false,
+ "characteristic_form": "Ice formation probability based on temperature, humidity, and precipitation forecast",
+ "group": "predictive",
+ "entity": "Blade",
+ "note": "Source: Weather-based risk assessment; Owner: Operations Team; Ice formation prediction for cold climate operations"
+ },
+ {
+ "id": 21,
+ "text": "Recommend maintenance actions for Wind Turbine WT-105 based on current condition.",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Prioritized list of recommended maintenance actions with justification and urgency level",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Condition-based maintenance; Owner: Maintenance Team; Maintenance recommendation based on condition indicators"
+ },
+ {
+ "id": 22,
+ "text": "Optimize the pitch angle for Wind Turbine WT-105 to maximize power output at current wind speed.",
+ "type": "Optimization",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Optimal pitch angle in degrees for current wind conditions to maximize power capture",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Performance optimization; Owner: Control Systems Team; Real-time pitch optimization for power maximization"
+ },
+ {
+ "id": 23,
+ "text": "Which wind turbines should be prioritized for maintenance this month?",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Ranked list of turbines with maintenance priority scores based on condition and risk",
+ "group": "prescriptive",
+ "entity": "WindFarm",
+ "note": "Source: Maintenance planning; Owner: Maintenance Manager; Monthly maintenance prioritization for resource allocation"
+ },
+ {
+ "id": 24,
+ "text": "Recommend the optimal time to schedule maintenance for Wind Turbine WT-105 to minimize production loss.",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Recommended maintenance window with expected wind conditions and production impact",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Maintenance scheduling optimization; Owner: Operations Team; Optimal maintenance timing to minimize downtime cost"
+ },
+ {
+ "id": 25,
+ "text": "Should Wind Turbine WT-105 be curtailed based on grid demand and current wind conditions?",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Curtailment recommendation with reasoning based on grid requirements and economic factors",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Grid integration; Owner: Energy Management Team; Curtailment decision support for grid stability"
+ },
+ {
+ "id": 26,
+ "text": "Recommend spare parts to stock based on failure predictions for the wind farm.",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "List of recommended spare parts with quantities based on failure probability and lead time",
+ "group": "prescriptive",
+ "entity": "WindFarm",
+ "note": "Source: Inventory optimization; Owner: Supply Chain Team; Spare parts recommendation for inventory management"
+ },
+ {
+ "id": 27,
+ "text": "Optimize the yaw angle for Wind Turbine WT-105 to align with wind direction.",
+ "type": "Optimization",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Optimal yaw angle adjustment to maximize wind capture efficiency",
+ "group": "prescriptive",
+ "entity": "WindTurbine",
+ "note": "Source: Control optimization; Owner: Control Systems Team; Yaw optimization for wind alignment"
+ },
+ {
+ "id": 28,
+ "text": "Recommend a maintenance strategy for the gearbox of Wind Turbine WT-105 based on its condition trend.",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Recommended maintenance strategy (preventive, predictive, or run-to-failure) with justification",
+ "group": "prescriptive",
+ "entity": "Gearbox",
+ "note": "Source: Maintenance strategy optimization; Owner: Reliability Engineering; Strategy recommendation based on condition monitoring"
+ },
+ {
+ "id": 29,
+ "text": "Which turbines should be shut down during the approaching storm to minimize damage risk?",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "List of turbines recommended for shutdown with risk assessment and timing",
+ "group": "prescriptive",
+ "entity": "WindFarm",
+ "note": "Source: Weather risk management; Owner: Operations Team; Storm protection decision support"
+ },
+ {
+ "id": 30,
+ "text": "Recommend the optimal replacement schedule for aging wind turbine components across the farm.",
+ "type": "DecisionSupport",
+ "category": "Recommendation & Optimization",
+ "deterministic": false,
+ "characteristic_form": "Multi-year component replacement schedule optimized for cost and reliability",
+ "group": "prescriptive",
+ "entity": "WindFarm",
+ "note": "Source: Asset lifecycle management; Owner: Asset Manager; Long-term replacement planning for fleet optimization"
+ }
+]
\ No newline at end of file
diff --git a/docs/guideline/wind_turbine/wind_turbine_utterances.jsonl b/docs/guideline/wind_turbine/wind_turbine_utterances.jsonl
new file mode 100644
index 00000000..c8e9b882
--- /dev/null
+++ b/docs/guideline/wind_turbine/wind_turbine_utterances.jsonl
@@ -0,0 +1,30 @@
+{"id":1,"text":"What wind turbines are available in the wind farm?","type":"IoT","category":"Information Retrieval","deterministic":true,"characteristic_form":"List of all wind turbine assets in the farm with their identifiers and locations","group":"retrospective","entity":"WindTurbine","note":"Source: Asset inventory requirements; Owner: Operations Team; Basic inventory query for wind farm overview"}
+{"id":2,"text":"List all sensors installed on Wind Turbine WT-105.","type":"IoT","category":"Information Retrieval","deterministic":true,"characteristic_form":"Complete list of sensors including vibration, temperature, wind speed, power output, pitch angle, and yaw position sensors","group":"retrospective","entity":"WindTurbine","note":"Source: Maintenance planning; Owner: Engineering Team; Sensor configuration query for specific turbine"}
+{"id":3,"text":"Retrieve power output data for Wind Turbine WT-105 from January 2024.","type":"IoT","category":"Data Extraction","deterministic":true,"characteristic_form":"Time series data of power output in kW for the specified turbine and time period","group":"retrospective","entity":"WindTurbine","note":"Source: Performance analysis; Owner: Analytics Team; Historical power generation data retrieval"}
+{"id":4,"text":"Get the maintenance history for Wind Turbine WT-105 in 2023.","type":"Maintenance","category":"Data Extraction","deterministic":true,"characteristic_form":"List of all maintenance activities including preventive and corrective maintenance with dates and descriptions","group":"retrospective","entity":"WindTurbine","note":"Source: Maintenance records; Owner: Maintenance Team; Historical maintenance activity query"}
+{"id":5,"text":"What are the common failure modes for wind turbine gearboxes?","type":"FMEA","category":"Information Retrieval","deterministic":true,"characteristic_form":"List of gearbox failure modes including bearing wear, gear tooth damage, lubrication issues, and overheating","group":"retrospective","entity":"Gearbox","note":"Source: Failure mode analysis; Owner: Reliability Engineering; Knowledge base query for failure modes"}
+{"id":6,"text":"Show me the wind speed and power output correlation for Wind Turbine WT-105 last month.","type":"Analytics","category":"Analysis & Inference","deterministic":true,"characteristic_form":"Scatter plot or correlation analysis showing relationship between wind speed and power output","group":"retrospective","entity":"WindTurbine","note":"Source: Performance analysis; Owner: Analytics Team; Performance curve validation query"}
+{"id":7,"text":"List all alarm events for Wind Turbine WT-105 in the past week.","type":"IoT","category":"Data Extraction","deterministic":true,"characteristic_form":"Chronological list of alarm events with timestamps, alarm codes, and descriptions","group":"retrospective","entity":"WindTurbine","note":"Source: Operations monitoring; Owner: Operations Team; Recent alarm history query"}
+{"id":8,"text":"What is the average capacity factor for the wind farm in Q1 2024?","type":"Analytics","category":"Data Extraction","deterministic":true,"characteristic_form":"Calculated capacity factor percentage based on actual vs. rated power output","group":"retrospective","entity":"WindFarm","note":"Source: Performance reporting; Owner: Management Team; Quarterly performance metric query"}
+{"id":9,"text":"Retrieve vibration data from the gearbox of Wind Turbine WT-105 for the last 24 hours.","type":"IoT","category":"Data Extraction","deterministic":true,"characteristic_form":"Time series vibration data in mm/s or g-force from gearbox accelerometers","group":"retrospective","entity":"Gearbox","note":"Source: Condition monitoring; Owner: Predictive Maintenance Team; Recent vibration data for analysis"}
+{"id":10,"text":"What was the total energy production of the wind farm yesterday?","type":"Analytics","category":"Data Extraction","deterministic":true,"characteristic_form":"Total energy in MWh produced by all turbines in the farm for the previous day","group":"retrospective","entity":"WindFarm","note":"Source: Daily operations report; Owner: Operations Team; Daily energy production summary"}
+{"id":11,"text":"Forecast power output for Wind Turbine WT-105 for the next 48 hours.","type":"Forecasting","category":"Future State Prediction","deterministic":false,"characteristic_form":"Predicted power output time series based on weather forecast and historical performance","group":"predictive","entity":"WindTurbine","note":"Source: Energy trading requirements; Owner: Energy Management Team; Short-term power forecasting for grid planning"}
+{"id":12,"text":"Predict the remaining useful life of the gearbox in Wind Turbine WT-105.","type":"Prognostics","category":"Future State Prediction","deterministic":false,"characteristic_form":"Estimated remaining operational hours or days before gearbox replacement needed","group":"predictive","entity":"Gearbox","note":"Source: Predictive maintenance initiative; Owner: Reliability Engineering; RUL estimation for maintenance planning"}
+{"id":13,"text":"Is there any anomaly detected in the vibration pattern of Wind Turbine WT-105?","type":"AnomalyDetection","category":"Anomaly & Exception Detection","deterministic":false,"characteristic_form":"Boolean result with anomaly score and description of detected abnormal vibration patterns","group":"predictive","entity":"WindTurbine","note":"Source: Condition monitoring system; Owner: Predictive Maintenance Team; Real-time anomaly detection query"}
+{"id":14,"text":"Predict the probability of gearbox failure for Wind Turbine WT-105 in the next 30 days.","type":"Prognostics","category":"Future State Prediction","deterministic":false,"characteristic_form":"Failure probability percentage with confidence interval based on condition indicators","group":"predictive","entity":"Gearbox","note":"Source: Risk assessment; Owner: Reliability Engineering; Failure probability estimation for risk management"}
+{"id":15,"text":"Forecast wind speed at the wind farm location for the next 7 days.","type":"Forecasting","category":"Future State Prediction","deterministic":false,"characteristic_form":"Hourly wind speed forecast in m/s based on meteorological models","group":"predictive","entity":"WindFarm","note":"Source: Weather service integration; Owner: Operations Team; Medium-term wind forecast for production planning"}
+{"id":16,"text":"Detect any abnormal temperature rise in the generator of Wind Turbine WT-105.","type":"AnomalyDetection","category":"Anomaly & Exception Detection","deterministic":false,"characteristic_form":"Alert if temperature exceeds normal operating range with severity level","group":"predictive","entity":"Generator","note":"Source: Thermal monitoring; Owner: Operations Team; Temperature anomaly detection for generator protection"}
+{"id":17,"text":"Predict the next maintenance event for Wind Turbine WT-105.","type":"Prognostics","category":"Future State Prediction","deterministic":false,"characteristic_form":"Predicted maintenance type and estimated date based on condition trends","group":"predictive","entity":"WindTurbine","note":"Source: Maintenance optimization; Owner: Maintenance Team; Next maintenance event prediction"}
+{"id":18,"text":"Forecast the monthly energy production for the wind farm for Q2 2024.","type":"Forecasting","category":"Future State Prediction","deterministic":false,"characteristic_form":"Monthly energy production forecast in MWh for April, May, and June 2024","group":"predictive","entity":"WindFarm","note":"Source: Business planning; Owner: Management Team; Quarterly production forecast for revenue planning"}
+{"id":19,"text":"Identify any degradation trend in the power curve of Wind Turbine WT-105.","type":"Analytics","category":"Analysis & Inference","deterministic":false,"characteristic_form":"Trend analysis showing power curve degradation over time with degradation rate","group":"predictive","entity":"WindTurbine","note":"Source: Performance monitoring; Owner: Analytics Team; Power curve degradation analysis for performance optimization"}
+{"id":20,"text":"Predict ice formation risk on the blades of Wind Turbine WT-105 for tomorrow.","type":"Forecasting","category":"Future State Prediction","deterministic":false,"characteristic_form":"Ice formation probability based on temperature, humidity, and precipitation forecast","group":"predictive","entity":"Blade","note":"Source: Weather-based risk assessment; Owner: Operations Team; Ice formation prediction for cold climate operations"}
+{"id":21,"text":"Recommend maintenance actions for Wind Turbine WT-105 based on current condition.","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Prioritized list of recommended maintenance actions with justification and urgency level","group":"prescriptive","entity":"WindTurbine","note":"Source: Condition-based maintenance; Owner: Maintenance Team; Maintenance recommendation based on condition indicators"}
+{"id":22,"text":"Optimize the pitch angle for Wind Turbine WT-105 to maximize power output at current wind speed.","type":"Optimization","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Optimal pitch angle in degrees for current wind conditions to maximize power capture","group":"prescriptive","entity":"WindTurbine","note":"Source: Performance optimization; Owner: Control Systems Team; Real-time pitch optimization for power maximization"}
+{"id":23,"text":"Which wind turbines should be prioritized for maintenance this month?","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Ranked list of turbines with maintenance priority scores based on condition and risk","group":"prescriptive","entity":"WindFarm","note":"Source: Maintenance planning; Owner: Maintenance Manager; Monthly maintenance prioritization for resource allocation"}
+{"id":24,"text":"Recommend the optimal time to schedule maintenance for Wind Turbine WT-105 to minimize production loss.","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Recommended maintenance window with expected wind conditions and production impact","group":"prescriptive","entity":"WindTurbine","note":"Source: Maintenance scheduling optimization; Owner: Operations Team; Optimal maintenance timing to minimize downtime cost"}
+{"id":25,"text":"Should Wind Turbine WT-105 be curtailed based on grid demand and current wind conditions?","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Curtailment recommendation with reasoning based on grid requirements and economic factors","group":"prescriptive","entity":"WindTurbine","note":"Source: Grid integration; Owner: Energy Management Team; Curtailment decision support for grid stability"}
+{"id":26,"text":"Recommend spare parts to stock based on failure predictions for the wind farm.","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"List of recommended spare parts with quantities based on failure probability and lead time","group":"prescriptive","entity":"WindFarm","note":"Source: Inventory optimization; Owner: Supply Chain Team; Spare parts recommendation for inventory management"}
+{"id":27,"text":"Optimize the yaw angle for Wind Turbine WT-105 to align with wind direction.","type":"Optimization","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Optimal yaw angle adjustment to maximize wind capture efficiency","group":"prescriptive","entity":"WindTurbine","note":"Source: Control optimization; Owner: Control Systems Team; Yaw optimization for wind alignment"}
+{"id":28,"text":"Recommend a maintenance strategy for the gearbox of Wind Turbine WT-105 based on its condition trend.","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Recommended maintenance strategy (preventive, predictive, or run-to-failure) with justification","group":"prescriptive","entity":"Gearbox","note":"Source: Maintenance strategy optimization; Owner: Reliability Engineering; Strategy recommendation based on condition monitoring"}
+{"id":29,"text":"Which turbines should be shut down during the approaching storm to minimize damage risk?","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"List of turbines recommended for shutdown with risk assessment and timing","group":"prescriptive","entity":"WindFarm","note":"Source: Weather risk management; Owner: Operations Team; Storm protection decision support"}
+{"id":30,"text":"Recommend the optimal replacement schedule for aging wind turbine components across the farm.","type":"DecisionSupport","category":"Recommendation & Optimization","deterministic":false,"characteristic_form":"Multi-year component replacement schedule optimized for cost and reliability","group":"prescriptive","entity":"WindFarm","note":"Source: Asset lifecycle management; Owner: Asset Manager; Long-term replacement planning for fleet optimization"}
diff --git a/pyproject.toml b/pyproject.toml
index 00ae3929..990dcdb8 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -3,7 +3,7 @@ requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
-packages = ["src/workflow", "src/llm"]
+packages = ["src/agent", "src/llm"]
[project]
name = "assetopsbench-mcp"
@@ -21,13 +21,15 @@ dependencies = [
"numpy>=1.24",
"pandas>=2.0",
"pyyaml>=6.0",
- "litellm>=1.0",
+ "litellm==1.81.13",
+ "claude-agent-sdk>=0.0.14",
"python-dotenv>=1.0",
"scipy>=1.10.0",
]
[project.scripts]
-plan-execute = "workflow.cli:main"
+plan-execute = "agent.cli:main"
+claude-agent = "agent.claude_agent.cli:main"
iot-mcp-server = "servers.iot.main:main"
utilities-mcp-server = "servers.utilities.main:main"
fmsr-mcp-server = "servers.fmsr.main:main"
diff --git a/src/agent/__init__.py b/src/agent/__init__.py
new file mode 100644
index 00000000..a082c898
--- /dev/null
+++ b/src/agent/__init__.py
@@ -0,0 +1,18 @@
+"""MCP plan-execute orchestration package."""
+
+from .runner import AgentRunner
+from .models import AgentResult
+from .plan_execute.runner import PlanExecuteRunner
+from .plan_execute.models import OrchestratorResult, Plan, PlanStep, StepResult
+from .claude_agent.runner import ClaudeAgentRunner
+
+__all__ = [
+ "AgentRunner",
+ "AgentResult",
+ "PlanExecuteRunner",
+ "OrchestratorResult",
+ "Plan",
+ "PlanStep",
+ "StepResult",
+ "ClaudeAgentRunner",
+]
diff --git a/src/agent/claude_agent/__init__.py b/src/agent/claude_agent/__init__.py
new file mode 100644
index 00000000..104bd72f
--- /dev/null
+++ b/src/agent/claude_agent/__init__.py
@@ -0,0 +1,6 @@
+"""Claude Agent SDK runner subpackage."""
+
+from .models import ToolCall, Trajectory, TurnRecord
+from .runner import ClaudeAgentRunner
+
+__all__ = ["ClaudeAgentRunner", "Trajectory", "TurnRecord", "ToolCall"]
diff --git a/src/agent/claude_agent/cli.py b/src/agent/claude_agent/cli.py
new file mode 100644
index 00000000..e083e90d
--- /dev/null
+++ b/src/agent/claude_agent/cli.py
@@ -0,0 +1,137 @@
+"""CLI entry point for the ClaudeAgentRunner.
+
+Usage:
+ claude-agent "What sensors are on Chiller 6?"
+ claude-agent --model-id claude-opus-4-6 --max-turns 20 "List failure modes for pumps"
+ claude-agent --show-trajectory "What sensors are on Chiller 6?"
+ claude-agent --json "What is the current time?"
+"""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import dataclasses
+import json
+import logging
+import sys
+
+_DEFAULT_MODEL = "claude-opus-4-6"
+_LOG_FORMAT = "%(asctime)s %(levelname)-8s %(name)s %(message)s"
+_LOG_DATE_FORMAT = "%H:%M:%S"
+_HR = "─" * 60
+
+
+def _build_parser() -> argparse.ArgumentParser:
+ parser = argparse.ArgumentParser(
+ prog="claude-agent",
+ description="Run a question through the Claude Agent SDK with AssetOpsBench MCP servers.",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog=f"""
+environment variables:
+ LITELLM_API_KEY LiteLLM / Anthropic API key (required)
+ LITELLM_BASE_URL LiteLLM proxy URL (required for litellm_proxy/* models)
+
+examples:
+ claude-agent "What assets are at site MAIN?"
+ claude-agent --model-id claude-opus-4-6 --max-turns 20 "List sensors on Chiller 6"
+ claude-agent --model-id litellm_proxy/aws/claude-opus-4-6 "What is the current time?"
+ claude-agent --show-history "What sensors are on Chiller 6?"
+ claude-agent --json "What is the current time?"
+""",
+ )
+ parser.add_argument("question", help="The question to answer.")
+ parser.add_argument(
+ "--model-id",
+ default=_DEFAULT_MODEL,
+ metavar="MODEL_ID",
+ help=f"Claude model ID (default: {_DEFAULT_MODEL}).",
+ )
+ parser.add_argument(
+ "--max-turns",
+ type=int,
+ default=30,
+ metavar="N",
+ help="Maximum agentic loop turns (default: 30).",
+ )
+ parser.add_argument(
+ "--show-trajectory",
+ action="store_true",
+ help="Print each turn's text, tool calls, and token usage.",
+ )
+ parser.add_argument(
+ "--json",
+ action="store_true",
+ dest="output_json",
+ help="Output the full result as JSON.",
+ )
+ parser.add_argument(
+ "--verbose",
+ action="store_true",
+ help="Show INFO-level logs on stderr.",
+ )
+ return parser
+
+
+def _setup_logging(verbose: bool) -> None:
+ level = logging.INFO if verbose else logging.WARNING
+ handler = logging.StreamHandler(sys.stderr)
+ handler.setFormatter(logging.Formatter(_LOG_FORMAT, datefmt=_LOG_DATE_FORMAT))
+ logging.root.handlers.clear()
+ logging.root.addHandler(handler)
+ logging.root.setLevel(level)
+
+
+def _print_trace(trajectory) -> None:
+ print(f"\n{_HR}")
+ print(" Trace")
+ print(_HR)
+ for turn in trajectory.turns:
+ print(f"\n [Turn {turn.index}] "
+ f"in={turn.input_tokens} out={turn.output_tokens} tokens")
+ if turn.text:
+ snippet = turn.text[:200] + ("..." if len(turn.text) > 200 else "")
+ print(f" text: {snippet}")
+ for tc in turn.tool_calls:
+ print(f" tool: {tc.name} input: {tc.input}")
+ if tc.output is not None:
+ out_str = str(tc.output)
+ snippet = out_str[:200] + ("..." if len(out_str) > 200 else "")
+ print(f" output: {snippet}")
+ print(f"\n Total: {trajectory.total_input_tokens} input / "
+ f"{trajectory.total_output_tokens} output tokens "
+ f"({len(trajectory.turns)} turns, "
+ f"{len(trajectory.all_tool_calls)} tool calls)")
+
+
+async def _run(args: argparse.Namespace) -> None:
+ from agent.claude_agent.runner import ClaudeAgentRunner
+
+ runner = ClaudeAgentRunner(model=args.model_id, max_turns=args.max_turns)
+ result = await runner.run(args.question)
+
+ if args.output_json:
+ print(json.dumps(dataclasses.asdict(result.trajectory), indent=2))
+ return
+
+ if args.show_trajectory:
+ _print_trace(result.trajectory)
+
+ print(f"\n{_HR}")
+ print(" Answer")
+ print(_HR)
+ print(result.answer)
+ print()
+
+
+def main() -> None:
+ from dotenv import load_dotenv
+
+ load_dotenv()
+ args = _build_parser().parse_args()
+ _setup_logging(args.verbose)
+ asyncio.run(_run(args))
+
+
+if __name__ == "__main__":
+ main()
diff --git a/src/agent/claude_agent/models.py b/src/agent/claude_agent/models.py
new file mode 100644
index 00000000..b7e6b30a
--- /dev/null
+++ b/src/agent/claude_agent/models.py
@@ -0,0 +1,45 @@
+"""Trajectory data models for ClaudeAgentRunner."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+
+@dataclass
+class ToolCall:
+ """A single tool invocation made by the agent."""
+
+ name: str
+ input: dict
+ id: str = ""
+ output: object = None
+
+
+@dataclass
+class TurnRecord:
+ """One assistant turn: text output, tool calls, and token usage."""
+
+ index: int
+ text: str
+ tool_calls: list[ToolCall] = field(default_factory=list)
+ input_tokens: int = 0
+ output_tokens: int = 0
+
+
+@dataclass
+class Trajectory:
+ """Full execution trace across all agent turns."""
+
+ turns: list[TurnRecord] = field(default_factory=list)
+
+ @property
+ def total_input_tokens(self) -> int:
+ return sum(t.input_tokens for t in self.turns)
+
+ @property
+ def total_output_tokens(self) -> int:
+ return sum(t.output_tokens for t in self.turns)
+
+ @property
+ def all_tool_calls(self) -> list[ToolCall]:
+ return [tc for turn in self.turns for tc in turn.tool_calls]
diff --git a/src/agent/claude_agent/runner.py b/src/agent/claude_agent/runner.py
new file mode 100644
index 00000000..7ad91f77
--- /dev/null
+++ b/src/agent/claude_agent/runner.py
@@ -0,0 +1,205 @@
+"""AgentRunner implementation backed by the claude-agent-sdk.
+
+Each registered MCP server is connected as a stdio MCP server so Claude can
+call IoT / FMSR / TSFM / utilities tools directly without a custom plan loop.
+
+Usage::
+
+ import anyio
+ from agent.claude_agent import ClaudeAgentRunner
+
+ runner = ClaudeAgentRunner()
+ result = anyio.run(runner.run, "What sensors are on Chiller 6?")
+ print(result.answer)
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+from pathlib import Path
+
+from claude_agent_sdk import AssistantMessage, ClaudeAgentOptions, HookMatcher, ResultMessage, query
+from claude_agent_sdk import TextBlock, ToolUseBlock
+
+from ..models import AgentResult
+from .models import ToolCall, Trajectory, TurnRecord
+from ..plan_execute.executor import DEFAULT_SERVER_PATHS
+from ..runner import AgentRunner
+
+_log = logging.getLogger(__name__)
+
+_DEFAULT_MODEL = "claude-opus-4-6"
+_LITELLM_PREFIX = "litellm_proxy/"
+
+
+def _resolve_model(model_id: str) -> str:
+ """Strip the ``litellm_proxy/`` prefix from a model ID.
+
+ Examples::
+
+ "litellm_proxy/aws/claude-opus-4-6" -> "aws/claude-opus-4-6"
+ "claude-opus-4-6" -> "claude-opus-4-6"
+ """
+ if model_id.startswith(_LITELLM_PREFIX):
+ return model_id[len(_LITELLM_PREFIX):]
+ return model_id
+
+
+def _sdk_env(model_id: str) -> dict[str, str] | None:
+ """Build env overrides for the claude-agent-sdk subprocess.
+
+ When routing through a LiteLLM proxy the SDK needs the proxy URL and key
+ under its own env var names. We derive them from the LITELLM_* vars so
+ the user never has to set SDK-internal vars directly.
+ """
+ if not model_id.startswith(_LITELLM_PREFIX):
+ return None
+ env: dict[str, str] = {}
+ if base_url := os.environ.get("LITELLM_BASE_URL"):
+ env["ANTHROPIC_BASE_URL"] = base_url
+ if api_key := os.environ.get("LITELLM_API_KEY"):
+ env["ANTHROPIC_API_KEY"] = api_key
+ return env or None
+
+_SYSTEM_PROMPT = """\
+You are an industrial asset operations assistant with access to MCP tools for
+querying IoT sensor data, failure mode and symptom records, time-series
+forecasting models, and work order management.
+
+Answer the user's question concisely and accurately using the available tools.
+When you retrieve data, include the key numbers or names in your answer.
+"""
+
+
+def _build_mcp_servers(
+ server_paths: dict[str, Path | str],
+) -> dict[str, dict]:
+ """Convert server_paths entries into claude-agent-sdk mcp_servers dicts.
+
+ Entry-point names (str without path separators) become
+ ``{"command": "uv", "args": ["run", name]}``.
+ Path objects become ``{"command": "uv", "args": ["run", str(path)]}``.
+ """
+ mcp: dict[str, dict] = {}
+ for name, spec in server_paths.items():
+ if isinstance(spec, Path):
+ mcp[name] = {"command": "uv", "args": ["run", str(spec)]}
+ else:
+ # uv entry-point name, e.g. "iot-mcp-server"
+ mcp[name] = {"command": "uv", "args": ["run", spec]}
+ return mcp
+
+
+class ClaudeAgentRunner(AgentRunner):
+ """Agent runner that delegates to the claude-agent-sdk agentic loop.
+
+ The sdk handles tool discovery, invocation, and multi-turn conversation
+ against the registered MCP servers.
+
+ Args:
+ llm: Unused — ClaudeAgentRunner uses the claude-agent-sdk directly.
+ Accepted for interface compatibility with ``AgentRunner``.
+ server_paths: MCP server specs identical to ``PlanExecuteRunner``.
+ Defaults to all registered servers.
+ model: Claude model ID to use (default: ``claude-opus-4-6``).
+ max_turns: Maximum agentic loop turns (default: 30).
+ permission_mode: claude-agent-sdk permission mode (default: ``"default"``).
+ """
+
+ def __init__(
+ self,
+ llm=None,
+ server_paths: dict[str, Path | str] | None = None,
+ model: str = _DEFAULT_MODEL,
+ max_turns: int = 30,
+ permission_mode: str = "bypassPermissions",
+ ) -> None:
+ super().__init__(llm, server_paths)
+ self._model = _resolve_model(model)
+ self._sdk_env = _sdk_env(model)
+ self._max_turns = max_turns
+ self._permission_mode = permission_mode
+ self._resolved_server_paths: dict[str, Path | str] = (
+ server_paths if server_paths is not None else dict(DEFAULT_SERVER_PATHS)
+ )
+
+ async def run(self, question: str) -> AgentResult:
+ """Run the claude-agent-sdk loop for *question*.
+
+ Args:
+ question: Natural-language question to answer.
+
+ Returns:
+ AgentResult with the final answer and full execution trajectory.
+ """
+ mcp_servers = _build_mcp_servers(self._resolved_server_paths)
+
+ options = ClaudeAgentOptions(
+ model=self._model,
+ system_prompt=_SYSTEM_PROMPT,
+ mcp_servers=mcp_servers,
+ max_turns=self._max_turns,
+ permission_mode=self._permission_mode,
+ env=self._sdk_env,
+ )
+
+ _log.info("ClaudeAgentRunner: starting query (model=%s)", self._model)
+ answer = ""
+ trajectory = Trajectory()
+ turn_index = 0
+ tool_outputs: dict[str, object] = {}
+
+ async def _capture_tool_output(input_data, tool_use_id: str, context) -> dict:
+ resp = input_data.get("tool_response") if isinstance(input_data, dict) else input_data
+ if isinstance(resp, dict):
+ tool_outputs[tool_use_id] = resp.get("content", resp)
+ else:
+ tool_outputs[tool_use_id] = resp
+ return {}
+
+ options.hooks = {"PostToolUse": [HookMatcher(matcher=".*", hooks=[_capture_tool_output])]}
+
+ def _flush_tool_outputs() -> None:
+ """Patch any pending hook outputs onto the last turn's tool calls."""
+ if tool_outputs and trajectory.turns:
+ for tc in trajectory.turns[-1].tool_calls:
+ if tc.id in tool_outputs:
+ tc.output = tool_outputs.pop(tc.id)
+
+ async for message in query(prompt=question, options=options):
+ if isinstance(message, AssistantMessage):
+ _flush_tool_outputs()
+ text = ""
+ tool_calls: list[ToolCall] = []
+ for block in message.content:
+ if isinstance(block, TextBlock):
+ text += block.text
+ elif isinstance(block, ToolUseBlock):
+ tool_calls.append(
+ ToolCall(name=block.name, input=block.input, id=block.id)
+ )
+ usage = message.usage or {}
+ trajectory.turns.append(
+ TurnRecord(
+ index=turn_index,
+ text=text,
+ tool_calls=tool_calls,
+ input_tokens=usage.get("input_tokens", 0),
+ output_tokens=usage.get("output_tokens", 0),
+ )
+ )
+ turn_index += 1
+ elif isinstance(message, ResultMessage):
+ _flush_tool_outputs()
+ answer = message.result or ""
+ _log.info(
+ "ClaudeAgentRunner: done (stop_reason=%s, turns=%d, "
+ "input_tokens=%d, output_tokens=%d)",
+ message.stop_reason,
+ len(trajectory.turns),
+ trajectory.total_input_tokens,
+ trajectory.total_output_tokens,
+ )
+
+ return AgentResult(question=question, answer=answer, trajectory=trajectory)
diff --git a/src/tmp/agent_hive/__init__.py b/src/agent/claude_agent/tests/__init__.py
similarity index 100%
rename from src/tmp/agent_hive/__init__.py
rename to src/agent/claude_agent/tests/__init__.py
diff --git a/src/agent/claude_agent/tests/test_runner.py b/src/agent/claude_agent/tests/test_runner.py
new file mode 100644
index 00000000..b90c9b34
--- /dev/null
+++ b/src/agent/claude_agent/tests/test_runner.py
@@ -0,0 +1,269 @@
+"""Unit tests for ClaudeAgentRunner.
+
+These tests patch claude_agent_sdk.query so no real API calls are made.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import pytest
+
+from agent.claude_agent.models import Trajectory
+from agent.claude_agent.runner import ClaudeAgentRunner, _build_mcp_servers, _resolve_model, _sdk_env
+from agent.models import AgentResult
+
+
+# ---------------------------------------------------------------------------
+# _resolve_model
+# ---------------------------------------------------------------------------
+
+
+def test_resolve_model_strips_litellm_prefix():
+ assert _resolve_model("litellm_proxy/aws/claude-opus-4-6") == "aws/claude-opus-4-6"
+
+
+def test_resolve_model_passthrough():
+ assert _resolve_model("claude-opus-4-6") == "claude-opus-4-6"
+
+
+def test_resolve_model_stored_on_runner():
+ runner = ClaudeAgentRunner(model="litellm_proxy/aws/claude-opus-4-6")
+ assert runner._model == "aws/claude-opus-4-6"
+
+
+def test_sdk_env_no_prefix_returns_none():
+ assert _sdk_env("claude-opus-4-6") is None
+
+
+def test_sdk_env_litellm_prefix_maps_vars(monkeypatch):
+ monkeypatch.setenv("LITELLM_BASE_URL", "http://localhost:4000")
+ monkeypatch.setenv("LITELLM_API_KEY", "sk-1234")
+ env = _sdk_env("litellm_proxy/aws/claude-opus-4-6")
+ assert env == {
+ "ANTHROPIC_BASE_URL": "http://localhost:4000",
+ "ANTHROPIC_API_KEY": "sk-1234",
+ }
+
+
+def test_sdk_env_missing_litellm_vars_returns_none(monkeypatch):
+ monkeypatch.delenv("LITELLM_BASE_URL", raising=False)
+ monkeypatch.delenv("LITELLM_API_KEY", raising=False)
+ assert _sdk_env("litellm_proxy/aws/claude-opus-4-6") is None
+
+
+# ---------------------------------------------------------------------------
+# _build_mcp_servers
+# ---------------------------------------------------------------------------
+
+
+def test_build_mcp_servers_entrypoint():
+ specs = {"iot": "iot-mcp-server", "utilities": "utilities-mcp-server"}
+ result = _build_mcp_servers(specs)
+ assert result["iot"] == {"command": "uv", "args": ["run", "iot-mcp-server"]}
+ assert result["utilities"] == {
+ "command": "uv",
+ "args": ["run", "utilities-mcp-server"],
+ }
+
+
+def test_build_mcp_servers_path():
+ p = Path("/some/server.py")
+ result = _build_mcp_servers({"custom": p})
+ assert result["custom"] == {"command": "uv", "args": ["run", "/some/server.py"]}
+
+
+def test_build_mcp_servers_empty():
+ assert _build_mcp_servers({}) == {}
+
+
+# ---------------------------------------------------------------------------
+# ClaudeAgentRunner.__init__
+# ---------------------------------------------------------------------------
+
+
+def test_runner_defaults():
+ runner = ClaudeAgentRunner()
+ assert runner._model == "claude-opus-4-6"
+ assert runner._max_turns == 30
+ assert runner._permission_mode == "bypassPermissions"
+ assert "iot" in runner._resolved_server_paths
+
+
+def test_runner_custom_server_paths():
+ paths = {"iot": "iot-mcp-server"}
+ runner = ClaudeAgentRunner(server_paths=paths)
+ assert runner._resolved_server_paths == paths
+
+
+# ---------------------------------------------------------------------------
+# ClaudeAgentRunner.run
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.anyio
+async def test_run_returns_orchestrator_result():
+ from claude_agent_sdk import ResultMessage
+
+ mock_result = MagicMock(spec=ResultMessage)
+ mock_result.result = "42 sensors found"
+ mock_result.stop_reason = "end_turn"
+
+ async def fake_query(prompt, options):
+ yield mock_result
+
+ with patch("agent.claude_agent.runner.query", side_effect=fake_query):
+ runner = ClaudeAgentRunner(server_paths={"iot": "iot-mcp-server"})
+ result = await runner.run("How many sensors are there?")
+
+ assert isinstance(result, AgentResult)
+ assert result.question == "How many sensors are there?"
+ assert result.answer == "42 sensors found"
+ assert isinstance(result.trajectory, Trajectory)
+ assert result.trajectory.total_input_tokens == 0
+ assert result.trajectory.total_output_tokens == 0
+
+
+@pytest.mark.anyio
+async def test_run_collects_trajectory():
+ from claude_agent_sdk import AssistantMessage, ResultMessage, TextBlock, ToolUseBlock
+
+ mock_tool = MagicMock(spec=ToolUseBlock)
+ mock_tool.name = "sensors"
+ mock_tool.input = {"asset_id": "CH-6"}
+ mock_tool.id = "tu_123"
+
+ mock_text = MagicMock(spec=TextBlock)
+ mock_text.text = "Calling sensors tool..."
+
+ mock_assistant = MagicMock(spec=AssistantMessage)
+ mock_assistant.content = [mock_text, mock_tool]
+ mock_assistant.usage = {"input_tokens": 100, "output_tokens": 20}
+
+ mock_result = MagicMock(spec=ResultMessage)
+ mock_result.result = "Chiller 6 has 5 sensors."
+ mock_result.stop_reason = "end_turn"
+
+ async def fake_query(prompt, options):
+ yield mock_assistant
+ yield mock_result
+
+ with patch("agent.claude_agent.runner.query", side_effect=fake_query):
+ runner = ClaudeAgentRunner(server_paths={})
+ result = await runner.run("What sensors are on Chiller 6?")
+
+ traj = result.trajectory
+ assert isinstance(traj, Trajectory)
+ assert len(traj.turns) == 1
+ turn = traj.turns[0]
+ assert turn.text == "Calling sensors tool..."
+ assert len(turn.tool_calls) == 1
+ assert turn.tool_calls[0].name == "sensors"
+ assert turn.tool_calls[0].input == {"asset_id": "CH-6"}
+ assert turn.input_tokens == 100
+ assert turn.output_tokens == 20
+ assert traj.total_input_tokens == 100
+ assert traj.total_output_tokens == 20
+ assert len(traj.all_tool_calls) == 1
+
+
+@pytest.mark.anyio
+async def test_run_tool_output_captured():
+ """PostToolUse hook output is attached to the matching ToolCall."""
+ from claude_agent_sdk import AssistantMessage, ResultMessage, TextBlock, ToolUseBlock
+
+ mock_tool = MagicMock(spec=ToolUseBlock)
+ mock_tool.name = "sensors"
+ mock_tool.input = {"asset_id": "CH-6"}
+ mock_tool.id = "tu_456"
+
+ mock_text = MagicMock(spec=TextBlock)
+ mock_text.text = ""
+
+ mock_assistant1 = MagicMock(spec=AssistantMessage)
+ mock_assistant1.content = [mock_text, mock_tool]
+ mock_assistant1.usage = {"input_tokens": 50, "output_tokens": 10}
+
+ mock_assistant2 = MagicMock(spec=AssistantMessage)
+ mock_assistant2.content = [MagicMock(spec=TextBlock, text="Done.")]
+ mock_assistant2.usage = {"input_tokens": 60, "output_tokens": 5}
+
+ mock_result = MagicMock(spec=ResultMessage)
+ mock_result.result = "5 sensors."
+ mock_result.stop_reason = "end_turn"
+
+ async def fake_query(prompt, options):
+ # Simulate hook firing between turns by calling it directly
+ hook_matcher = options.hooks["PostToolUse"][0]
+ hook_fn = hook_matcher.hooks[0]
+ yield mock_assistant1
+ await hook_fn(
+ {"tool_response": {"content": [{"type": "text", "text": "sensor data"}]}},
+ "tu_456",
+ {},
+ )
+ yield mock_assistant2
+ yield mock_result
+
+ with patch("agent.claude_agent.runner.query", side_effect=fake_query):
+ runner = ClaudeAgentRunner(server_paths={})
+ result = await runner.run("What sensors are on Chiller 6?")
+
+ traj = result.trajectory
+ assert len(traj.turns) == 2
+ tc = traj.turns[0].tool_calls[0]
+ assert tc.id == "tu_456"
+ assert tc.output == [{"type": "text", "text": "sensor data"}]
+
+
+@pytest.mark.anyio
+async def test_run_tool_output_string_response():
+ """PostToolUse hook handles string tool_response (no .get)."""
+ from claude_agent_sdk import AssistantMessage, ResultMessage, TextBlock, ToolUseBlock
+
+ mock_tool = MagicMock(spec=ToolUseBlock)
+ mock_tool.name = "sites"
+ mock_tool.input = {}
+ mock_tool.id = "tu_789"
+
+ mock_assistant = MagicMock(spec=AssistantMessage)
+ mock_assistant.content = [MagicMock(spec=TextBlock, text=""), mock_tool]
+ mock_assistant.usage = {"input_tokens": 10, "output_tokens": 5}
+
+ mock_result = MagicMock(spec=ResultMessage)
+ mock_result.result = "MAIN"
+ mock_result.stop_reason = "end_turn"
+
+ async def fake_query(prompt, options):
+ hook_fn = options.hooks["PostToolUse"][0].hooks[0]
+ yield mock_assistant
+ # Simulate SDK passing tool_response as a plain string
+ await hook_fn(
+ {"tool_response": '{"sites": ["MAIN"]}'},
+ "tu_789",
+ {},
+ )
+ yield mock_result
+
+ with patch("agent.claude_agent.runner.query", side_effect=fake_query):
+ runner = ClaudeAgentRunner(server_paths={})
+ result = await runner.run("What sites?")
+
+ tc = result.trajectory.turns[0].tool_calls[0]
+ assert tc.output == '{"sites": ["MAIN"]}'
+
+
+@pytest.mark.anyio
+async def test_run_empty_result():
+ async def fake_query(prompt, options):
+ return
+ yield # make it an async generator
+
+ with patch("agent.claude_agent.runner.query", side_effect=fake_query):
+ runner = ClaudeAgentRunner(server_paths={})
+ result = await runner.run("What time is it?")
+
+ assert result.answer == ""
+ assert isinstance(result.trajectory, Trajectory)
+ assert result.trajectory.turns == []
diff --git a/src/workflow/cli.py b/src/agent/cli.py
similarity index 94%
rename from src/workflow/cli.py
rename to src/agent/cli.py
index 7f04d75b..6836cdc8 100644
--- a/src/workflow/cli.py
+++ b/src/agent/cli.py
@@ -47,7 +47,7 @@ def _build_parser() -> argparse.ArgumentParser:
plan-execute "What assets are at site MAIN?"
plan-execute --model-id watsonx/ibm/granite-3-3-8b-instruct --show-plan "List sensors"
plan-execute --model-id litellm_proxy/GCP/claude-4-sonnet "What are the failure modes?"
- plan-execute --verbose --show-history --json "How many IoT observations exist for CH-1?"
+ plan-execute --verbose --show-trajectory --json "How many IoT observations exist for CH-1?"
""",
)
parser.add_argument("question", help="The question to answer.")
@@ -75,7 +75,7 @@ def _build_parser() -> argparse.ArgumentParser:
help="Print the generated plan before execution.",
)
parser.add_argument(
- "--show-history",
+ "--show-trajectory",
action="store_true",
help="Print each step result after execution.",
)
@@ -83,7 +83,7 @@ def _build_parser() -> argparse.ArgumentParser:
"--json",
action="store_true",
dest="output_json",
- help="Output the full result (answer, plan, history) as JSON.",
+ help="Output the full result (answer, plan, trajectory) as JSON.",
)
parser.add_argument(
"--verbose",
@@ -141,7 +141,7 @@ def _print_section(title: str) -> None:
async def _run(args: argparse.Namespace) -> None:
- from workflow.runner import PlanExecuteRunner
+ from agent.plan_execute.runner import PlanExecuteRunner
llm = _build_llm(args.model_id)
server_paths = _parse_servers(args.servers)
@@ -164,7 +164,7 @@ async def _run(args: argparse.Namespace) -> None:
}
for s in result.plan.steps
],
- "history": [
+ "trajectory": [
{
"step": r.step_number,
"task": r.task,
@@ -175,7 +175,7 @@ async def _run(args: argparse.Namespace) -> None:
"error": r.error,
"success": r.success,
}
- for r in result.history
+ for r in result.trajectory
],
}
print(json.dumps(output, indent=2))
@@ -189,9 +189,9 @@ async def _run(args: argparse.Namespace) -> None:
print(f" tool: {step.tool} args: {step.tool_args}")
print(f" deps={deps} | expected: {step.expected_output}")
- if args.show_history:
- _print_section("Execution History")
- for r in result.history:
+ if args.show_trajectory:
+ _print_section("Trajectory")
+ for r in result.trajectory:
status = "OK " if r.success else "ERR"
print(f" [{status}] Step {r.step_number} ({r.server}): {r.task}")
if r.tool and r.tool.lower() not in ("none", "null", ""):
diff --git a/src/agent/models.py b/src/agent/models.py
new file mode 100644
index 00000000..078a9f85
--- /dev/null
+++ b/src/agent/models.py
@@ -0,0 +1,15 @@
+"""Top-level data models for the agent orchestration layer."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+
+@dataclass
+class AgentResult:
+ """Result returned by any AgentRunner."""
+
+ question: str
+ answer: str
+ trajectory: Any
diff --git a/src/agent/plan_execute/__init__.py b/src/agent/plan_execute/__init__.py
new file mode 100644
index 00000000..c53b259e
--- /dev/null
+++ b/src/agent/plan_execute/__init__.py
@@ -0,0 +1 @@
+"""Plan-and-execute agent package."""
diff --git a/src/workflow/executor.py b/src/agent/plan_execute/executor.py
similarity index 99%
rename from src/workflow/executor.py
rename to src/agent/plan_execute/executor.py
index cb82e3ab..10409ad4 100644
--- a/src/workflow/executor.py
+++ b/src/agent/plan_execute/executor.py
@@ -18,8 +18,7 @@
_log = logging.getLogger(__name__)
-_MCP_ROOT = Path(__file__).parent.parent
-_REPO_ROOT = _MCP_ROOT.parent
+_REPO_ROOT = Path(__file__).parent.parent.parent.parent
# Maps agent names to either a uv entry-point name (str) or a script Path.
# Entry-point names are invoked as ``uv run ``; Paths fall back to
diff --git a/src/workflow/models.py b/src/agent/plan_execute/models.py
similarity index 98%
rename from src/workflow/models.py
rename to src/agent/plan_execute/models.py
index f5b16a1f..aa6ba045 100644
--- a/src/workflow/models.py
+++ b/src/agent/plan_execute/models.py
@@ -74,4 +74,4 @@ class OrchestratorResult:
question: str
answer: str
plan: Plan
- history: list[StepResult]
+ trajectory: list[StepResult]
diff --git a/src/workflow/planner.py b/src/agent/plan_execute/planner.py
similarity index 99%
rename from src/workflow/planner.py
rename to src/agent/plan_execute/planner.py
index 45e1b0ed..8c683943 100644
--- a/src/workflow/planner.py
+++ b/src/agent/plan_execute/planner.py
@@ -128,4 +128,4 @@ def generate_plan(
)
prompt = _PLAN_PROMPT.format(servers=servers_text, question=question)
raw = self._llm.generate(prompt)
- return parse_plan(raw)
\ No newline at end of file
+ return parse_plan(raw)
diff --git a/src/workflow/runner.py b/src/agent/plan_execute/runner.py
similarity index 89%
rename from src/workflow/runner.py
rename to src/agent/plan_execute/runner.py
index 5954ffab..c3e9f3f8 100644
--- a/src/workflow/runner.py
+++ b/src/agent/plan_execute/runner.py
@@ -1,4 +1,4 @@
-"""Entry-point runner for the plan-execute workflow using MCP servers.
+"""Plan-and-execute agent runner using MCP servers as tool providers.
Replaces AgentHive's combination of PlanningWorkflow + SequentialWorkflow with
an MCP-native implementation:
@@ -17,11 +17,12 @@
from llm import LLMBackend
-_log = logging.getLogger(__name__)
-
from .executor import Executor
from .models import OrchestratorResult
from .planner import Planner
+from ..runner import AgentRunner
+
+_log = logging.getLogger(__name__)
_SUMMARIZE_PROMPT = """\
You are summarizing the results of a multi-step task execution for an \
@@ -37,12 +38,12 @@
"""
-class PlanExecuteRunner:
+class PlanExecuteRunner(AgentRunner):
"""Entry-point for plan-and-execute workflows using MCP servers as tool providers.
Usage::
- from plan_execute import PlanExecuteRunner
+ from agent import PlanExecuteRunner
from llm import LiteLLMBackend
runner = PlanExecuteRunner(llm=LiteLLMBackend("watsonx/meta-llama/llama-3-3-70b-instruct"))
@@ -62,7 +63,7 @@ def __init__(
llm: LLMBackend,
server_paths: dict[str, Path | str] | None = None,
) -> None:
- self._llm = llm
+ super().__init__(llm, server_paths)
self._planner = Planner(llm)
self._executor = Executor(llm, server_paths)
@@ -80,7 +81,7 @@ async def run(self, question: str) -> OrchestratorResult:
Returns:
OrchestratorResult with the final answer, the generated plan, and
- the per-step execution history.
+ the per-step execution trajectory.
"""
# 1. Discover
_log.info("Discovering server capabilities...")
@@ -92,14 +93,14 @@ async def run(self, question: str) -> OrchestratorResult:
_log.info("Plan has %d step(s).", len(plan.steps))
# 3. Execute
- history = await self._executor.execute_plan(plan, question)
+ trajectory = await self._executor.execute_plan(plan, question)
# 4. Summarise
_log.info("Summarising...")
results_text = "\n\n".join(
f"Step {r.step_number} — {r.task} (server: {r.server}):\n"
+ (r.response if r.success else f"ERROR: {r.error}")
- for r in history
+ for r in trajectory
)
answer = self._llm.generate(
_SUMMARIZE_PROMPT.format(question=question, results=results_text)
@@ -109,5 +110,5 @@ async def run(self, question: str) -> OrchestratorResult:
question=question,
answer=answer,
plan=plan,
- history=history,
+ trajectory=trajectory,
)
diff --git a/src/agent/runner.py b/src/agent/runner.py
new file mode 100644
index 00000000..615d7f2b
--- /dev/null
+++ b/src/agent/runner.py
@@ -0,0 +1,31 @@
+"""Abstract base class for all agent runners."""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from pathlib import Path
+
+from llm import LLMBackend
+
+from .models import AgentResult
+
+
+class AgentRunner(ABC):
+ """Abstract base class for all agent runners.
+
+ Subclasses implement :meth:`run` to handle a natural-language question and
+ return an :class:`AgentResult`. The ``llm`` and ``server_paths``
+ attributes are available to all subclasses.
+ """
+
+ def __init__(
+ self,
+ llm: LLMBackend,
+ server_paths: dict[str, Path | str] | None = None,
+ ) -> None:
+ self._llm = llm
+ self._server_paths = server_paths
+
+ @abstractmethod
+ async def run(self, question: str) -> AgentResult:
+ """Run the agent on *question* and return a structured result."""
diff --git a/src/tmp/agent_hive/agents/__init__.py b/src/agent/tests/__init__.py
similarity index 100%
rename from src/tmp/agent_hive/agents/__init__.py
rename to src/agent/tests/__init__.py
diff --git a/src/workflow/tests/conftest.py b/src/agent/tests/conftest.py
similarity index 100%
rename from src/workflow/tests/conftest.py
rename to src/agent/tests/conftest.py
diff --git a/src/workflow/tests/test_models.py b/src/agent/tests/test_models.py
similarity index 97%
rename from src/workflow/tests/test_models.py
rename to src/agent/tests/test_models.py
index 7ae3c560..b9ce43f0 100644
--- a/src/workflow/tests/test_models.py
+++ b/src/agent/tests/test_models.py
@@ -1,6 +1,6 @@
"""Tests for Plan and StepResult data models."""
-from workflow.models import Plan, PlanStep, StepResult
+from agent.plan_execute.models import Plan, PlanStep, StepResult
def _step(n: int, deps: list[int] | None = None) -> PlanStep:
diff --git a/src/workflow/tests/test_planner.py b/src/agent/tests/test_planner.py
similarity index 98%
rename from src/workflow/tests/test_planner.py
rename to src/agent/tests/test_planner.py
index 61885fc7..77bc4497 100644
--- a/src/workflow/tests/test_planner.py
+++ b/src/agent/tests/test_planner.py
@@ -1,6 +1,6 @@
"""Tests for the Planner and parse_plan()."""
-from workflow.planner import Planner, parse_plan
+from agent.plan_execute.planner import Planner, parse_plan
_TWO_STEP = """\
#Task1: List all available IoT sites
diff --git a/src/workflow/tests/test_runner.py b/src/agent/tests/test_runner.py
similarity index 89%
rename from src/workflow/tests/test_runner.py
rename to src/agent/tests/test_runner.py
index 8441e369..98647f3a 100644
--- a/src/workflow/tests/test_runner.py
+++ b/src/agent/tests/test_runner.py
@@ -7,15 +7,15 @@
import pytest
-from workflow.executor import (
+from agent.plan_execute.executor import (
Executor,
_parse_json,
_parse_tool_call,
_resolve_args,
_resolve_args_with_llm,
)
-from workflow.models import Plan, PlanStep, StepResult
-from workflow.runner import PlanExecuteRunner
+from agent.plan_execute.models import Plan, PlanStep, StepResult
+from agent.plan_execute.runner import PlanExecuteRunner
# ── shared plan strings ───────────────────────────────────────────────────────
@@ -50,9 +50,9 @@
def _patch_mcp(tool_response: str = _TOOL_RESPONSE):
return (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
patch(
- "workflow.executor._call_tool", new=AsyncMock(return_value=tool_response)
+ "agent.plan_execute.executor._call_tool", new=AsyncMock(return_value=tool_response)
),
)
@@ -104,7 +104,7 @@ async def test_orchestrator_run_returns_result(sequential_llm):
assert result.question == "What are the IoT sites?"
assert result.answer == _FINAL_ANSWER
assert len(result.plan.steps) == 2
- assert len(result.history) == 2
+ assert len(result.trajectory) == 2
@pytest.mark.anyio
@@ -113,7 +113,7 @@ async def test_orchestrator_all_steps_succeed(sequential_llm):
with _patch_mcp()[0], _patch_mcp()[1]:
result = await PlanExecuteRunner(llm).run("Q")
- assert all(r.success for r in result.history)
+ assert all(r.success for r in result.trajectory)
@pytest.mark.anyio
@@ -130,9 +130,9 @@ async def test_orchestrator_unknown_server_recorded_as_error(sequential_llm):
with _patch_mcp()[0], _patch_mcp()[1]:
result = await PlanExecuteRunner(llm).run("Q")
- assert len(result.history) == 1
- assert result.history[0].success is False
- assert "ghost" in result.history[0].error
+ assert len(result.trajectory) == 1
+ assert result.trajectory[0].success is False
+ assert "ghost" in result.trajectory[0].error
@pytest.mark.anyio
@@ -150,8 +150,8 @@ async def test_orchestrator_no_tool_returns_expected_output(sequential_llm):
with _patch_mcp()[0], _patch_mcp()[1]:
result = await PlanExecuteRunner(llm).run("Simple Q")
- assert result.history[0].response == "42"
- assert result.history[0].success is True
+ assert result.trajectory[0].response == "42"
+ assert result.trajectory[0].success is True
# ── executor unit tests ───────────────────────────────────────────────────────
@@ -174,7 +174,7 @@ async def test_executor_get_server_descriptions(mock_llm):
executor = Executor(mock_llm(), server_paths={"TestServer": None})
with patch(
- "workflow.executor._list_tools",
+ "agent.plan_execute.executor._list_tools",
new=AsyncMock(
return_value=[{"name": "foo", "description": "does foo", "parameters": []}]
),
@@ -211,8 +211,8 @@ async def test_executor_step_result_carries_resolved_args(sequential_llm):
step = _make_step(1, tool="assets")
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=AsyncMock(return_value="{}")),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=AsyncMock(return_value="{}")),
):
result = await executor.execute_step(step, {}, "List assets at MAIN")
@@ -229,8 +229,8 @@ async def test_executor_tool_call_exception_recorded_as_error(sequential_llm):
step = _make_step(1, tool="sites")
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=AsyncMock(side_effect=RuntimeError("timeout"))),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=AsyncMock(side_effect=RuntimeError("timeout"))),
):
result = await executor.execute_step(step, {}, "Q")
@@ -261,8 +261,8 @@ async def test_executor_calls_llm_to_generate_args(sequential_llm):
json.dumps({"sensors": ["temp"]}),
])
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=call_mock),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=call_mock),
):
results = await executor.execute_plan(plan, "Q")
@@ -290,8 +290,8 @@ async def test_executor_prior_step_results_in_llm_prompt():
site_resp = json.dumps({"sites": ["MAIN"]})
call_mock = AsyncMock(side_effect=[site_resp, '{"sensors": []}'])
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=call_mock),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=call_mock),
):
await executor.execute_plan(plan, "List sensors for CH-1")
@@ -309,8 +309,8 @@ async def test_executor_no_prior_context_shows_none_in_prompt():
step = _make_step(1, tool="sites")
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=AsyncMock(return_value="{}")),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=AsyncMock(return_value="{}")),
):
await executor.execute_step(step, {}, "Q")
@@ -336,8 +336,8 @@ async def test_executor_context_accumulates_across_steps():
resp1, resp2, resp3 = '{"sites":["MAIN"]}', '{"assets":["CH-1"]}', '{"sensors":[]}'
call_mock = AsyncMock(side_effect=[resp1, resp2, resp3])
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=call_mock),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=call_mock),
):
await executor.execute_plan(plan, "Q")
@@ -370,12 +370,12 @@ async def test_pipeline_uses_llm_args_for_each_step(sequential_llm):
call_mock = AsyncMock(side_effect=['{"sites": ["MAIN"]}', '{"assets": ["CH-1"]}'])
with (
- patch("workflow.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
- patch("workflow.executor._call_tool", new=call_mock),
+ patch("agent.plan_execute.executor._list_tools", new=AsyncMock(return_value=_MOCK_TOOLS)),
+ patch("agent.plan_execute.executor._call_tool", new=call_mock),
):
result = await PlanExecuteRunner(llm).run("List all assets at site MAIN")
- assert all(r.success for r in result.history)
+ assert all(r.success for r in result.trajectory)
step2_args = call_mock.call_args_list[1].args[2]
assert step2_args["site_name"] == "MAIN"
diff --git a/src/servers/fmsr/main.py b/src/servers/fmsr/main.py
index 75dbc4d0..1638b762 100644
--- a/src/servers/fmsr/main.py
+++ b/src/servers/fmsr/main.py
@@ -21,6 +21,8 @@
from pathlib import Path
from typing import Dict, List, Union
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
import yaml
from dotenv import load_dotenv
from mcp.server.fastmcp import FastMCP
@@ -116,13 +118,22 @@ def _build_llm():
# ── LLM call helpers with retry ───────────────────────────────────────────────
+_asset2fm_cache: dict[str, list[str]] = {}
+
+
def _call_asset2fm(asset_name: str) -> list[str]:
- """Query the LLM for failure modes of an asset. Retries up to _MAX_RETRIES times."""
+ """Query the LLM for failure modes of an asset. Retries up to _MAX_RETRIES times.
+ Results are cached to avoid redundant LLM calls for the same asset."""
+ if asset_name in _asset2fm_cache:
+ return _asset2fm_cache[asset_name]
+
prompt = _ASSET2FM_PROMPT.format(asset_name=asset_name)
last_exc: Exception | None = None
for _ in range(_MAX_RETRIES):
try:
- return _parse_numbered_list(_llm.generate(prompt))
+ result = _parse_numbered_list(_llm.generate(prompt))
+ _asset2fm_cache[asset_name] = result
+ return result
except Exception as exc:
last_exc = exc
raise last_exc
@@ -177,10 +188,10 @@ class FailureModeSensorMappingResult(BaseModel):
# ── FastMCP server ────────────────────────────────────────────────────────────
-mcp = FastMCP("fmsr")
+mcp = FastMCP("fmsr", instructions="Failure mode and sensor reasoning: get failure modes for assets and determine which sensors can detect each failure.")
-@mcp.tool()
+@mcp.tool(title="Get Failure Modes")
def get_failure_modes(asset_name: str) -> Union[FailureModesResult, ErrorResult]:
"""Returns a list of known failure modes for the given asset.
For chillers and AHUs returns a curated list. For other assets queries the LLM."""
@@ -205,7 +216,7 @@ def get_failure_modes(asset_name: str) -> Union[FailureModesResult, ErrorResult]
return ErrorResult(error=str(exc))
-@mcp.tool()
+@mcp.tool(title="Get Failure Mode Sensor Mapping")
def get_failure_mode_sensor_mapping(
asset_name: str,
failure_modes: List[str],
@@ -233,9 +244,15 @@ def get_failure_mode_sensor_mapping(
sensor2fm: Dict[str, List[str]] = {}
try:
- for s in sensors:
- for fm in failure_modes:
- gen = _call_relevancy(asset_name, fm, s)
+ pairs = [(s, fm) for s in sensors for fm in failure_modes]
+ with ThreadPoolExecutor() as executor:
+ futures = {
+ executor.submit(_call_relevancy, asset_name, fm, s): (s, fm)
+ for s, fm in pairs
+ }
+ for future in as_completed(futures):
+ s, fm = futures[future]
+ gen = future.result()
entry = RelevancyEntry(
asset_name=asset_name,
failure_mode=fm,
diff --git a/src/servers/iot/main.py b/src/servers/iot/main.py
index 04c89c3a..9e473208 100644
--- a/src/servers/iot/main.py
+++ b/src/servers/iot/main.py
@@ -1,6 +1,7 @@
import os
import logging
from datetime import datetime
+from functools import lru_cache
from typing import Any, Dict, List, Optional, Union
from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel
@@ -36,7 +37,7 @@
logger.error(f"Failed to connect to CouchDB: {e}")
db = None
-mcp = FastMCP("iot")
+mcp = FastMCP("iot", instructions="IoT sensor data: browse sites, assets, sensors, and query historical readings from CouchDB.")
# Static site as per original requirement
SITES = ["MAIN"]
@@ -75,28 +76,41 @@ class HistoryResult(BaseModel):
message: str
+_asset_list_cache: Optional[List[str]] = None
+
+
def get_asset_list() -> List[str]:
- """Helper to fetch unique asset IDs from CouchDB."""
+ """Helper to fetch unique asset IDs from CouchDB. Result is cached after
+ the first successful call to avoid repeated full-table scans."""
+ global _asset_list_cache
+ if _asset_list_cache is not None:
+ return _asset_list_cache
+
if not db:
return []
- # Using a mango query to find unique asset_ids might be slow without an index,
- # but for this benchmark we'll query documents and unique them.
- # In a production environment, we'd use a CouchDB view.
try:
# We limit the fields to just asset_id to minimize data transfer
res = db.find(
{"asset_id": {"$exists": True}}, fields=["asset_id"], limit=100000
)
assets = {doc["asset_id"] for doc in res["docs"] if "asset_id" in doc}
- return sorted(list(assets))
+ _asset_list_cache = sorted(list(assets))
+ return _asset_list_cache
except Exception as e:
logger.error(f"Error fetching assets: {e}")
return []
+_sensor_list_cache: Dict[str, List[str]] = {}
+
+
def get_sensor_list(asset_id: str) -> List[str]:
- """Helper to fetch sensor names for a given asset from CouchDB."""
+ """Helper to fetch sensor names for a given asset from CouchDB.
+ Result is cached per asset_id after the first successful call."""
+ if asset_id in _sensor_list_cache:
+ return _sensor_list_cache[asset_id]
+
if not db:
return []
@@ -109,20 +123,21 @@ def get_sensor_list(asset_id: str) -> List[str]:
doc = res["docs"][0]
# Exclude metadata and standard fields
exclude = {"_id", "_rev", "asset_id", "timestamp"}
- sensors = [key for key in doc.keys() if key not in exclude]
- return sorted(sensors)
+ sensors = sorted(key for key in doc.keys() if key not in exclude)
+ _sensor_list_cache[asset_id] = sensors
+ return sensors
except Exception as e:
logger.error(f"Error fetching sensors for {asset_id}: {e}")
return []
-@mcp.tool()
+@mcp.tool(title="List Sites")
def sites() -> SitesResult:
"""Retrieves a list of sites. Each site is represented by a name."""
return SitesResult(sites=SITES)
-@mcp.tool()
+@mcp.tool(title="List Assets")
def assets(site_name: str) -> Union[AssetsResult, ErrorResult]:
"""Returns a list of assets for a given site. Each asset includes an id and a name."""
if site_name not in SITES:
@@ -137,7 +152,7 @@ def assets(site_name: str) -> Union[AssetsResult, ErrorResult]:
)
-@mcp.tool()
+@mcp.tool(title="List Sensors")
def sensors(site_name: str, asset_id: str) -> Union[SensorsResult, ErrorResult]:
"""Lists the sensors available for a specified asset at a given site."""
if site_name not in SITES:
@@ -156,7 +171,7 @@ def sensors(site_name: str, asset_id: str) -> Union[SensorsResult, ErrorResult]:
)
-@mcp.tool()
+@mcp.tool(title="Get Sensor History")
def history(
site_name: str, asset_id: str, start: str, final: Optional[str] = None
) -> Union[HistoryResult, ErrorResult]:
diff --git a/src/servers/tsfm/main.py b/src/servers/tsfm/main.py
index 005ddbbf..288a388b 100644
--- a/src/servers/tsfm/main.py
+++ b/src/servers/tsfm/main.py
@@ -28,7 +28,8 @@
import os
import tempfile
import uuid
-from typing import List, Optional, Union
+from functools import lru_cache
+from typing import Dict, List, Optional, Union
import numpy as np
import pandas as pd
@@ -75,6 +76,13 @@
# ── Internal helpers ──────────────────────────────────────────────────────────
+@lru_cache(maxsize=16)
+def _load_model_config(model_checkpoint: str) -> dict:
+ """Load and cache model config.json to avoid repeated disk reads."""
+ with open(model_checkpoint + "/config.json") as f:
+ return json.load(f)
+
+
def _build_dataset_config(
timestamp_column: str,
target_columns: List[str],
@@ -107,13 +115,13 @@ def _tsad_output_to_df(output: dict) -> pd.DataFrame:
# ── FastMCP server ────────────────────────────────────────────────────────────
-mcp = FastMCP("tsfm")
+mcp = FastMCP("tsfm", instructions="Time-series foundation models: forecasting, finetuning, and anomaly detection using IBM Granite TinyTimeMixer.")
# ── Static tools ──────────────────────────────────────────────────────────────
-@mcp.tool()
+@mcp.tool(title="Get AI Tasks")
def get_ai_tasks() -> AITasksResult:
"""Returns the list of supported AI task types for time-series analysis.
@@ -123,7 +131,7 @@ def get_ai_tasks() -> AITasksResult:
return AITasksResult(tasks=[AITaskEntry(**t) for t in _AI_TASKS])
-@mcp.tool()
+@mcp.tool(title="Get TSFM Models")
def get_tsfm_models() -> TSFMModelsResult:
"""Returns the list of available pre-trained TinyTimeMixer (TTM) model checkpoints.
@@ -136,7 +144,7 @@ def get_tsfm_models() -> TSFMModelsResult:
# ── TSFM Forecasting (zero-shot inference) ────────────────────────────────────
-@mcp.tool()
+@mcp.tool(title="Run TSFM Forecasting")
def run_tsfm_forecasting(
dataset_path: str,
timestamp_column: str,
@@ -191,8 +199,7 @@ def run_tsfm_forecasting(
try:
data_df = _read_ts_data(dataset_path, dataset_config_dictionary=dataset_config)
- with open(model_checkpoint + "/config.json") as _f:
- model_config = json.load(_f)
+ model_config = _load_model_config(model_checkpoint)
output_data_quality = _tsfm_data_quality_filter(
data_df, dataset_config, model_config, task="inference"
@@ -264,7 +271,7 @@ def run_tsfm_forecasting(
# ── TSFM Finetuning ───────────────────────────────────────────────────────────
-@mcp.tool()
+@mcp.tool(title="Run TSFM Finetuning")
def run_tsfm_finetuning(
dataset_path: str,
timestamp_column: str,
@@ -326,8 +333,7 @@ def run_tsfm_finetuning(
try:
data_df = _read_ts_data(dataset_path, dataset_config_dictionary=dataset_config)
- with open(model_checkpoint + "/config.json") as _f:
- model_config = json.load(_f)
+ model_config = _load_model_config(model_checkpoint)
os.makedirs(abs_save_dir, exist_ok=True)
@@ -400,7 +406,7 @@ def run_tsfm_finetuning(
# ── TSAD (conformal anomaly detection on top of TSFM forecasts) ──────────────
-@mcp.tool()
+@mcp.tool(title="Run Anomaly Detection")
def run_tsad(
dataset_path: str,
tsfm_output_json: str,
@@ -508,7 +514,7 @@ def run_tsad(
# ── Integrated TSAD (forecasting + anomaly detection in one call) ─────────────
-@mcp.tool()
+@mcp.tool(title="Run Integrated Forecasting + Anomaly Detection")
def run_integrated_tsad(
dataset_path: str,
timestamp_column: str,
@@ -558,10 +564,20 @@ def run_integrated_tsad(
ad_model_save = _get_outputs_path("tsad_model_save/")
os.makedirs(ad_model_save, exist_ok=True)
- with open(model_checkpoint + "/config.json") as _f:
- model_config = json.load(_f)
+ model_config = _load_model_config(model_checkpoint)
df_combined = pd.DataFrame()
+ # Read the full dataset once with all target columns, then subset per column
+ full_config = _build_dataset_config(
+ timestamp_column,
+ target_columns,
+ conditional_columns,
+ id_columns,
+ frequency_sampling,
+ autoregressive_modeling,
+ )
+ full_data_df = _read_ts_data(dataset_path, dataset_config_dictionary=full_config)
+
for col in target_columns:
col_config = _build_dataset_config(
timestamp_column,
@@ -572,8 +588,8 @@ def run_integrated_tsad(
autoregressive_modeling,
)
- # 1. Load and quality-filter data for this column
- data_df = _read_ts_data(dataset_path, dataset_config_dictionary=col_config)
+ # 1. Quality-filter data for this column (reuse already-loaded data)
+ data_df = full_data_df
output_dq = _tsfm_data_quality_filter(
data_df, col_config, model_config, task="inference"
)
diff --git a/src/servers/utilities/main.py b/src/servers/utilities/main.py
index 7d49eb0b..48e1858b 100644
--- a/src/servers/utilities/main.py
+++ b/src/servers/utilities/main.py
@@ -17,7 +17,7 @@
logging.basicConfig(level=_log_level)
logger = logging.getLogger("utilities-mcp-server")
-mcp = FastMCP("utilities")
+mcp = FastMCP("utilities", instructions="General utilities: read JSON files and get current date/time.")
class DateTimeResult(BaseModel):
@@ -49,7 +49,7 @@ def get_temp_filename() -> str:
# --- JSON Tools ---
-@mcp.tool()
+@mcp.tool(title="Read JSON File")
def json_reader(file_name: str) -> str:
"""Reads a JSON file, parses its content, and returns the parsed data."""
try:
@@ -64,7 +64,7 @@ def json_reader(file_name: str) -> str:
# --- Time Tools ---
-@mcp.tool()
+@mcp.tool(title="Get Current Date and Time")
def current_date_time() -> DateTimeResult:
"""Provides the current date time as a JSON object."""
now = datetime.now(timezone.utc)
@@ -78,7 +78,7 @@ def current_date_time() -> DateTimeResult:
return DateTimeResult(currentDateTime=now_iso, currentDateTimeDescription=description)
-@mcp.tool()
+@mcp.tool(title="Get Current Time in English")
def current_time_english() -> TimeEnglishResult:
"""Returns the current time in English text."""
now = datetime.now(timezone.utc)
diff --git a/src/servers/vibration/main.py b/src/servers/vibration/main.py
index 652cf588..398da8bd 100644
--- a/src/servers/vibration/main.py
+++ b/src/servers/vibration/main.py
@@ -43,7 +43,7 @@
logging.basicConfig(level=_log_level)
logger = logging.getLogger("vibration-mcp-server")
-mcp = FastMCP("vibration")
+mcp = FastMCP("vibration", instructions="Vibration signal analysis: FFT, envelope spectrum, bearing fault detection, and ISO 10816 severity assessment.")
# ---------------------------------------------------------------------------
@@ -130,7 +130,7 @@ def _resolve_signal(data_id: str) -> tuple[np.ndarray, float]:
# ---------------------------------------------------------------------------
-@mcp.tool()
+@mcp.tool(title="Get Vibration Data")
def get_vibration_data(
site_name: str,
asset_id: str,
@@ -174,7 +174,7 @@ def get_vibration_data(
return {"data_id": data_id, **entry.summary()}
-@mcp.tool()
+@mcp.tool(title="List Vibration Sensors")
def list_vibration_sensors(
site_name: str,
asset_id: str,
@@ -198,7 +198,7 @@ def list_vibration_sensors(
}
-@mcp.tool()
+@mcp.tool(title="Compute FFT Spectrum")
def compute_fft_spectrum(
data_id: str,
window: str = "hann",
@@ -236,7 +236,7 @@ def compute_fft_spectrum(
return summary
-@mcp.tool()
+@mcp.tool(title="Compute Envelope Spectrum")
def compute_envelope_spectrum(
data_id: str,
band_low_hz: Optional[float] = None,
@@ -271,7 +271,7 @@ def compute_envelope_spectrum(
return summary
-@mcp.tool()
+@mcp.tool(title="Assess Vibration Severity")
def assess_vibration_severity(
rms_velocity_mm_s: float,
machine_group: str = "group2",
@@ -294,7 +294,7 @@ def assess_vibration_severity(
return assess_iso10816(rms_velocity_mm_s, machine_group)
-@mcp.tool()
+@mcp.tool(title="Calculate Bearing Frequencies")
def calculate_bearing_frequencies(
rpm: float,
n_balls: int,
@@ -324,13 +324,13 @@ def calculate_bearing_frequencies(
return result.to_dict()
-@mcp.tool()
+@mcp.tool(title="List Known Bearings")
def list_known_bearings() -> dict:
"""List all bearings in the built-in database with their geometric parameters."""
return {"bearings": list_bearings()}
-@mcp.tool()
+@mcp.tool(title="Diagnose Vibration")
def diagnose_vibration(
data_id: str,
rpm: Optional[float] = None,
diff --git a/src/servers/wo/data.py b/src/servers/wo/data.py
index 4aeacf71..a2c6bf73 100644
--- a/src/servers/wo/data.py
+++ b/src/servers/wo/data.py
@@ -68,11 +68,21 @@ def _get_db():
}
+_dataset_cache: Dict[str, Optional[pd.DataFrame]] = {}
+
+
def load(dataset: str) -> Optional[pd.DataFrame]:
"""Fetch all documents with ``_dataset == dataset`` and return a DataFrame.
+ Results are cached after the first successful load to avoid repeated
+ full-collection scans against CouchDB on every tool call.
+
Returns ``None`` when CouchDB is unavailable or the dataset is empty.
"""
+ if dataset in _dataset_cache:
+ cached = _dataset_cache[dataset]
+ return cached.copy() if cached is not None else None
+
db = _get_db()
if db is None:
return None
@@ -84,6 +94,7 @@ def load(dataset: str) -> Optional[pd.DataFrame]:
docs = result.get("docs", [])
if not docs:
logger.warning("No documents found for dataset '%s'", dataset)
+ _dataset_cache[dataset] = None
return None
df = pd.DataFrame(docs)
@@ -96,7 +107,8 @@ def load(dataset: str) -> Optional[pd.DataFrame]:
df[col] = pd.to_datetime(df[col], errors="coerce")
logger.info("Loaded %d rows for dataset '%s'", len(df), dataset)
- return df
+ _dataset_cache[dataset] = df
+ return df.copy()
except Exception as exc:
logger.error("Failed to load dataset '%s': %s", dataset, exc)
return None
diff --git a/src/servers/wo/main.py b/src/servers/wo/main.py
index 2fb6d853..1dbde8ed 100644
--- a/src/servers/wo/main.py
+++ b/src/servers/wo/main.py
@@ -16,23 +16,23 @@
_log_level = getattr(logging, os.environ.get("LOG_LEVEL", "WARNING").upper(), logging.WARNING)
logging.basicConfig(level=_log_level)
-mcp = FastMCP("wo")
+mcp = FastMCP("wo", instructions="Work order analytics: query work orders, events, failure codes, and predict maintenance patterns.")
# Register tools — imported after mcp is created to avoid circular imports.
from . import tools # noqa: E402
_TOOLS = [
- tools.get_work_orders,
- tools.get_preventive_work_orders,
- tools.get_corrective_work_orders,
- tools.get_events,
- tools.get_failure_codes,
- tools.get_work_order_distribution,
- tools.predict_next_work_order,
- tools.analyze_alert_to_failure,
+ (tools.get_work_orders, "Get Work Orders"),
+ (tools.get_preventive_work_orders, "Get Preventive Work Orders"),
+ (tools.get_corrective_work_orders, "Get Corrective Work Orders"),
+ (tools.get_events, "Get Events"),
+ (tools.get_failure_codes, "Get Failure Codes"),
+ (tools.get_work_order_distribution, "Get Work Order Distribution"),
+ (tools.predict_next_work_order, "Predict Next Work Order"),
+ (tools.analyze_alert_to_failure, "Analyze Alert to Failure"),
]
-for _fn in _TOOLS:
- mcp.tool()(_fn)
+for _fn, _title in _TOOLS:
+ mcp.tool(title=_title)(_fn)
def main():
diff --git a/src/tmp/IoTAgent/BMSFewShots.py b/src/tmp/IoTAgent/BMSFewShots.py
deleted file mode 100644
index 44d3509f..00000000
--- a/src/tmp/IoTAgent/BMSFewShots.py
+++ /dev/null
@@ -1,192 +0,0 @@
-BMSFEWSHOTS = """Question: what sites are there
-Thought 1: I need to get the list of available sites to answer the question.
-Action 1: sites
-Action Input 1: {}
-Observation 1: {"sites": "MAIN"}
-Thought 2: I now have the list of available sites
-Action 2: Finish
-Action Input 2: The sites are: "MAIN"
-
-Question: what assets are at site MAIN
-Thought 1: I need to get the list of assets at site MAIN to answer the question.
-Action 1: assets
-Action Input 1: site_name=MAIN
-Observation 1: Available assets: found the following 483 assets, listed in file /var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/9e17f07b-0b4a-43af-85c1-562dc21f105e.
-Thought 2: I now have the list of assets for site MAIN, which is stored in a file.
-Action 2: Finish
-Action Input 2: The assets at site MAIN are listed in file /var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/9e17f07b-0b4a-43af-85c1-562dc21f105e.
-
-Question: download sensor data for Chiller 4 at MAIN site
-Thought 1: I need to get the sensor data for Chiller 4 at site MAIN to answer the question.
-Action 1: sensors
-Action Input 1: assetnum=Chiller 4, site_name=MAIN
-Observation 1: {"site_name": "MAIN", "assetnum": "Chiller 4", "total_sensors": 2, "file_path": "/var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/c6571941-4857-4701-bd8a-9a28fa2435c3.json", "message": "found 2 sensors for assetnum Chiller 4 and site_name MAIN. file_path contains a JSON array of Sensor data"}
-Thought 2: I now have the sensor data for Chiller 4 at site MAIN, which is stored in a file.
-Action 2: Finish
-Action Input 2: The sensor data for Chiller 4 at site MAIN has been downloaded and is listed in file /var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/c6571941-4857-4701-bd8a-9a28fa2435c3.json.
-
-Question: download asset history for Chiller 4 at MAIN site from 2016-07-14T20:30:00-04:00 to 2016-07-14T23:30:00-04:00
-Thought 1: I need to get the history for Chiller 4 at MAIN site within the specified time range.
-Action 1: history
-Action input 1: site_name=MAIN, assetnum=Chiller 4, start=2025-03-26T00:00:00.000000+00:00, final=2025-04-02T00:00:00.000000+00:00
-Observation 1: {"site_name": "MAIN", "assetnum": "Chiller 4", "total_observations": 3, "start": "2025-03-26T00:00:00.000000+00:00", "final": "2025-04-02T00:00:00.000000+00:00", "file_path": "/var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/44934c53-0ad8-4551-adbe-5b11ad03865a.json", "message": "found 3 observations. file_path contains a JSON array of Observation data"}
-Thought 2: I now have the history for asset Chiller 4 at MAIN site within the specified time range.
-Action 2: Finish
-Action Input 2: The asset history for Chiller 4 at MAIN site from 2025-03-26T00:00:00.000000+00:00 to 2025-04-02T00:00:00.000000+00:00 has been downloaded and is listed in file /var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/44934c53-0ad8-4551-adbe-5b11ad03865a.json.
-
-Question: merge these JSON files file1.json and file2.json into a single JSON file
-Thought 1: I need to combine the data from JSON files file1.json and file2.json into a single file.
-Action 1: jsonfilemerge
-Action input 1: file_name_1=file1.json, file_name_2=file2.json
-Observation 1: The files have been merged: Files file1.json and file2.json have been merged into file3.json.
-Thought 2: I now have the merged data from file1.json and file2.json in a single file.
-Action 2: Finish
-Action Input 2: The contents of file1.json and file2.json have been successfully merged into file3.json.
-
-Question: How do I get a list of properties from a JSON file
-Thought 1: I need to call jsonproperties on the file
-Action 1: jsonproperties
-Action input 1: file_name_1=file3.json
-Observation 1: { "properties": [ "a", "b", "c"] }
-Thought 2: I now have the list of properties from the JSON file
-Action 2: Finish
-Action Input 2: the properties are a, b, c
-
-Question: I need to read the JSON file 0001.json.
-Thought 1: I need to read the contents of the JSON file to access its data.
-Action 1: jsonreader
-Action Input 1: file_name=0001.json
-Observation 1: The contents of the JSON file have been read.
-Thought 2: I now have the data from the JSON file.
-Action 2: Finish
-Action Input 2: The contents of 0001.json have been successfully read.
-
-Question: how do I calculate the start date for last week or past week?
-Thought 1: I should first obtain the current date and then subtract 7 days from the current date.
-Action 1: currentdatetime
-Action Input 1: {}
-Observation 1: {"currentDateTime": "2025-01-08T19:40:04.664816+00:00", "currentDateTimeDescription": "Today's date is 2025-01-08 and time is 19:40:04."}
-Thought 2: Now, I have the current date and I should proceed to generate the start date for last week.
-Action 2: Self-Ask
-Action Input 2: what is the start date for last week if current date is 2025-01-08T19:40:04.664816+00:00
-Observation 2: based on current date, the last week starts on 2025-01-01T00:00:00.000000+00:00
-Thought 3: Now I have the start date for last week
-Action 3: Finish
-Action Input 3: the start date for last week is 2025-01-01T00:00:00.000000+00:00.
-
-"""
-
-temp = """
-
-Question: download asset history for sensor c5_B027_CH4_B/027_CH1_AVGERAGE_AMPS_() on Chiller 4 at MAIN site from 2016-07-14T20:30:00-04:00 to 2016-07-14T23:30:00-04:00
-Thought 1: I need to get the history for sensor c5_B027_CH4_B/027_CH1_AVGERAGE_AMPS_() on Chiller 4 at MAIN site within the specified time range.
-Action 1: history
-Action input 1: site_name=MAIN, assetnum=Chiller 4, sensor_name_list=c5_B027_CH4_B/027_CH1_AVGERAGE_AMPS_(), start=2025-01-01T00:00:00.000000+00:00, final=2025-01-08T00:00:00.000000+00:00
-Observation 1: {"site_name": "MAIN", "assetnum": "Chiller 4", "total_observations": 0, "start": "2025-01-01T00:00:00.000000+00:00", "final": "2025-01-08T00:00:00.000000+00:00", "file_path": "/var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/97fbb3de-f442-438a-88c4-989abfde2c21.json", "message": "found 0 observations. file_path contains a JSON array of Observation data"}
-Thought 2: I now have the history for asset Chiller 4 at MAIN site within the specified time range.
-Action 2: Finish
-Action Input 2: The asset history for Chiller 4 at MAIN site from 2025-01-01T00:00:00.000000+00:00 to 2025-01-08T00:00:00.000000+00:00 has been downloaded and is listed in file /var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/97fbb3de-f442-438a-88c4-989abfde2c21.json.
-
-Question: how do I learn the correct name for the sensor Condenser Return Temperature on Chiller 1 at site MAIN?
-Thought 1: I should first list all the sensors on the asset
-Action 1: sensors
-Action Input 1: asssetnum=Chiller 1, site_name=MAIN
-Observation 1: {"site_name": "MAIN", "asssetnum": "Chiller 1", "total_sensors": 12, "file_path": "/var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/383594ff-bdf4-4d1e-8947-f1f95ef0761c.json", "message": "found 12 sensors for asssetnum Chiller 1 and site_name MAIN. file_path contains a JSON array of Sensor data"}
-Thought 2: I now have the list of sensors for Chiller 1 at site MAIN, which is stored in a file.
-Action 2: jsonreader
-Action Input 2: file_name=/var/folders/fz/l1h7gpv96rv5lg6m_d6bk0gc0000gn/T/cbmdir/383594ff-bdf4-4d1e-8947-f1f95ef0761c.json
-Observation 2: [{"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Condenser Water Return To Tower Temperature"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Chiller Efficiency"}]
-Thought 3: I should look at the sensor_name and pick the closest match to Condenser Return Temperature. I should use exactly the sensor name that is specified by the sensor_name property.
-Action 3: Finish
-Action Input 3: the correct sensor_name is Chiller 1 Condenser Water Return To Tower Temperature
-
-"""
-
-DHAVAL_EXTRA = """
-Question: How assets are represented? Give five examples.
-Thought 1: I need to get the list of assets for a specific site to understand how they are represented.
-Action 1: assets
-Action Input 1: site_name=MAIN
-Observation 1: {"site_name": "MAIN", "total_assets": 5, "file_path": "/var/folders/wn/ddv2p1854tj06tv1h92t4hp00000gn/T/cbmdir/360fea44-6bb2-4f41-89d7-9a8313714589.json", "message": "found 5 assets for site_name MAIN. file_path contains a JSON array of Asset data"}
-Thought 2: I now have the list of assets for site MAIN, which is stored in a file.
-Action 2: jsonreader
-Action Input 2: file_name=/var/folders/wn/ddv2p1854tj06tv1h92t4hp00000gn/T/cbmdir/360fea44-6bb2-4f41-89d7-9a8313714589.json
-Observation 2: [{"site_name": "MAIN", "asssetnum": "B706_LC26"}, {"site_name": "MAIN", "asssetnum": "B012 SCWP 1"}, {"site_name": "MAIN", "asssetnum": "B710 FCU 1-39"}, {"site_name": "MAIN", "asssetnum": "B006_LC74"}, {"site_name": "MAIN", "asssetnum": "B006A AHU 2"}]
-Thought 3: I now have the list of assets for site MAIN. I will provide five examples of how assets are represented.
-Action 3: Finish
-Action Input 3: Assets are represented with a site_name and an asssetnum. Here are five examples: 1. {"site_name": "MAIN", "asssetnum": "B706_LC26"} 2. {"site_name": "MAIN", "asssetnum": "B012 SCWP 1"} 3. {"site_name": "MAIN", "asssetnum": "B710 FCU 1-39"} 4. {"site_name": "MAIN", "asssetnum": "B006_LC74"} 5. {"site_name": "MAIN", "asssetnum": "B006A AHU 2"}
-
-Question: How sites are represented? Give five examples.
-Thought 1: I need to get the list of available sites to answer the question.
-Action 1: sites
-Action Input 1: {}
-Observation 1: {"total_sites": 6, "file_path": "/var/folders/wn/ddv2p1854tj06tv1h92t4hp00000gn/T/cbmdir/f096b286-88b9-4817-8af6-a1783c205213.json", "message": "found 6 sites. file_path contains a JSON array of Site data"}
-Thought 2: I now have the list of available sites, which is stored in a file.
-Action 2: jsonreader
-Action Input 2: file_name=/var/folders/wn/ddv2p1854tj06tv1h92t4hp00000gn/T/cbmdir/f096b286-88b9-4817-8af6-a1783c205213.json
-Observation 2: [{"site_name": "HQ3SBY"}, {"site_name": "CNPANGU"}, {"site_name": "ARCMAIN"}, {"site_name": "AUSWEST"}, {"site_name": "TOR8200"}, {"site_name": "AUSEAST"}]
-Thought 3: I now have the list of available sites.
-Action 3: Finish
-Action Input 3: Sites are represented by their names. Here are five examples: HQ3SBY, CNPANGU, ARCMAIN, AUSWEST, TOR8200.
-
-Question: How sensors are represented? Give five examples.
-Thought 1: I need to get the list of sensors for a specific asset to provide examples of how sensors are represented.
-Action 1: sensors
-Action Input 1: asssetnum=Chiller 1, site_name=MAIN
-Observation 1: {"site_name": "MAIN", "asssetnum": "Chiller 1", "total_sensors": 12, "file_path": "/var/folders/wn/ddv2p1854tj06tv1h92t4hp00000gn/T/cbmdir/ed99f451-7e4e-4b7c-8bad-b56021a87332.json", "message": "found 12 sensors for asssetnum Chiller 1 and site_name MAIN. file_path contains a JSON array of Sensor data"}
-Thought 2: I now have the list of sensors for Chiller 1 at site MAIN, which is stored in a file.
-Action 2: jsonreader
-Action Input 2: file_name=/var/folders/wn/ddv2p1854tj06tv1h92t4hp00000gn/T/cbmdir/ed99f451-7e4e-4b7c-8bad-b56021a87332.json
-Observation 2: [{"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Supply Temperature"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Condenser Water Flow"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Liquid Refrigerant Evaporator Temperature"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Chiller % Loaded"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Chiller Efficiency"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Return Temperature"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Tonnage"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Condenser Water Supply To Chiller Temperature"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Power Input"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Run Status"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Setpoint Temperature"}, {"site_name": "MAIN", "asssetnum": "Chiller 1", "sensor_name": "Chiller 1 Schedule"}]
-Thought 3: I now have the list of sensors for Chiller 1 at site MAIN.
-Action 3: Finish
-Action Input 3: Sensors are represented by their names. Here are five examples: Chiller 1 Supply Temperature, Chiller 1 Condenser Water Flow, Chiller 1 Liquid Refrigerant Evaporator Temperature, Chiller 1 Chiller % Loaded, Chiller 1 Chiller Efficiency.
-
-Question: How asset is described? Give an example.
-Thought 1: I need to get the description of an asset to answer the question.
-Action 1: assetdescription
-Action Input 1: site_name=MAIN, asssetnum=Chiller 6
-Observation 1: {"asssetnum": "Chiller 6", "asset_type": "Chiller", "asset_location": "Building 020", "site_name": "MAIN"}
-Thought 2: I now have the description of the asset Chiller 6 at site MAIN.
-Action 2: Finish
-Action Input 2: The asset Chiller 6 at site MAIN is described as follows: {"asssetnum": "Chiller 6", "asset_type": "Chiller", "asset_location": "Building 020", "site_name": "MAIN"}.
-
-Question: How asset location is described? use MAIN site for B012 SCWP 1 asset as an example.
-Thought 1: I need to get the asset description for B012 SCWP 1 at site MAIN to answer the question.
-Action 1: assetdescription
-Action Input 1: site_name=MAIN, asssetnum=B012 SCWP 1
-Observation 1: {"asssetnum": "B012 SCWP 1", "asset_type": "Pump", "asset_location": "Building 012", "site_name": "MAIN", "asset_id": "B012 SCWP 1"}
-Thought 2: I now have the asset description for B012 SCWP 1 at site MAIN.
-Action 2: Finish
-Action Input 2: The asset location for B012 SCWP 1 at site MAIN is described as "Building 012".
-"""
-
-SKYSPARK1Reflection = """
-
-
-Previous Trial:
-Question: Download sensor data for Chiller 6's Tonnage from last week at the MAIN site.
-Thought 1: To download sensor data for Chiller 6's Tonnage from last week at the MAIN site, I need to follow these steps: 1. Get the current time to determine the date range for last week. 2. Retrieve the asset ID for Chiller 6 at the MAIN site. 3. Get the sensor data for the Tonnage sensor for Chiller 6 within the specified date range.
-Action 1: currenttime
-Action Input 1: {}
-{"current_time": "2024-12-04T01:29:49.006112+00:00"}
-Observation 1: {"current_time": "2024-12-04T01:29:49.006112+00:00"}
-Thought 2: I have the current time. Now I need to calculate the date range for last week.
-Action 2: None
-Action Input 2: None
-Expecting value: line 1 column 1 (char 0)
-Observation 2: Taking action: Issue while parsing Action
-Thought 3: I have the current time. Now I need to calculate the date range for last week.
-Action 3: None
-Action Input 3: None
-Expecting value: line 1 column 1 (char 0)
-Observation 3: Taking action: Issue while parsing Action
-Thought 4: I have the current time. Now I need to calculate the date range for last week.
-Action 4: None
-Action Input 4: None
-Expecting value: line 1 column 1 (char 0)
-Observation 4: Taking action: Issue while parsing Action
-
-Reflection: I got stuck calculating the date range for last week, resulting in an invalid Action. I should have just figured out the date range from the current_time value
-
-"""
diff --git a/src/tmp/IoTAgent/bmstool.py b/src/tmp/IoTAgent/bmstool.py
deleted file mode 100644
index 98288921..00000000
--- a/src/tmp/IoTAgent/bmstool.py
+++ /dev/null
@@ -1,445 +0,0 @@
-import couchdb3
-import json
-import logging
-
-from typing import Optional, Type
-from typing import Optional
-from pydantic import BaseModel, Field
-import json
-from typing import Type
-from langchain_core.tools import BaseTool
-from iotagent.demo.tool import getTempFilename
-from datetime import datetime
-
-client = couchdb3.Server("http://admin:password@localhost:5984/")
-
-db = client.get("main")
-
-print(db)
-
-logger: logging.Logger = logging.getLogger(__name__)
-
-
-def custom_json(obj):
-
- if isinstance(obj, SitesMessage):
- return {
- # "sites": '[' + ', '.join(obj.sites) + ']',
- 'sites': obj.sites
- }
-
- if isinstance(obj, AssetMessage):
- return {
- 'site_name': obj.site_name,
- 'total_assets': obj.total_assets,
- 'file_path': obj.file_path,
- 'message': obj.message,
- }
-
- # if isinstance(obj, SensorDescription):
- # return {
- # 'sensor_name': obj.sensor_name,
- # 'name': obj.name,
- # }
-
- if isinstance(obj, SensorMessage):
- return {
- 'site_name': obj.site_name,
- 'assetnum': obj.assetnum,
- 'total_sensors': obj.total_sensors,
- # 'sensor_list': custom_json(obj.sensor_list),
- 'file_path': obj.file_path,
- 'message': obj.message
- }
-
- if isinstance(obj, HistoryMessage):
- return {
- 'site_name': obj.site_name,
- 'assetnum': obj.assetnum,
- 'total_observations': obj.total_observations,
- 'start': obj.start,
- 'final': obj.final,
- 'file_path': obj.file_path,
- 'message': obj.message,
- }
-
- raise TypeError(f"Cannot serialize object of {type(obj)}")
-
-SITES = [
- 'MAIN'
-]
-
-# ASSETS = [ 'Chiller 4' ]
-
-SENSORS = {
- 'CQPA AHU 1': [
- 'CQPA AHU 1 Pre Heating Temp',
- 'CQPA AHU 1 Cooling Valve %',
- 'CQPA AHU 1 Supply Relative Humidity Setpoint %',
- 'CQPA AHU 1 Preheat Valve %',
- 'CQPA AHU 1 Return Air Temperature',
- 'CQPA AHU 1 Supply Relative Humidity %',
- 'CQPA AHU 1 Zone Relative Humidity %',
- 'CQPA AHU 1 Occupied Command',
- 'CQPA AHU 1 Setpoint Temperature',
- 'CQPA AHU 1 Supply Fan Output %',
- 'CQPA AHU 1 Supply Air Temperature',
- 'CQPA AHU 1 Supply Fan Status',
- 'CQPA AHU 1 Power (Calc)',
- 'CQPA AHU 1 Schedule',
- 'CQPA AHU 1 Mixed Air Temperature',
- 'CQPA AHU 1 Humidifier Valve %',
- ],
- 'CQPA AHU 2B': [
- 'CQPA AHU 2B Cooling Valve %',
- 'CQPA AHU 2B Supply Fan Output %',
- 'CQPA AHU 2B Pre Heating Temp',
- 'CQPA AHU 2B Return Air Temperature',
- 'CQPA AHU 2B Mixed Air Temperature',
- 'CQPA AHU 2B Supply Air Temperature',
- 'CQPA AHU 2B Economizer Mode',
- 'CQPA AHU 2B Setpoint Temperature',
- 'CQPA AHU 2B Power (Calc)',
- 'CQPA AHU 2B Supply Fan Status',
- 'CQPA AHU 2B Supply Fan Current',
- 'CQPA AHU 2B Duct Static Pressure',
- 'CQPA AHU 2B Preheat Valve %',
- 'CQPA AHU 2B Schedule',
- 'CQPA AHU 2B Static Pressure Setpoint',
- 'CQPA AHU 2B Occupied Command',
- ],
- 'Chiller 4': [
- "Chiller 4 Liquid Refrigerant Evaporator Temperature",
- "Chiller 4 Condenser Water Supply To Chiller Temperature",
- "Chiller 4 Return Temperature",
- "Chiller 4 Setpoint Temperature",
- "Chiller 4 Chiller % Loaded",
- "Chiller 4 Supply Temperature",
- "Chiller 4 Condenser Water Flow",
- "Chiller 4 Tonnage",
- "Chiller 4 Power Input",
- "Chiller 4 Chiller Efficiency",
- ],
-
- 'Chiller 6': [
- 'Chiller 6 Condenser Water Return To Tower Temperature',
- 'Chiller 6 Chiller Efficiency',
- 'Chiller 6 Tonnage',
- 'Chiller 6 Supply Temperature',
- 'Chiller 6 Return Temperature',
- 'Chiller 6 Run Status',
- 'Chiller 6 Condenser Water Flow',
- 'Chiller 6 Schedule',
- 'Chiller 6 Power Input',
- 'Chiller 6 Chiller % Loaded',
- 'Chiller 6 Liquid Refrigerant Evaporator Temperature',
- 'Chiller 6 Setpoint Temperature',
- ],
-
- 'Chiller 9': [
- 'Chiller 9 Setpoint Temperature',
- 'Chiller 9 Supply Temperature',
- 'Chiller 9 Tonnage',
- 'Chiller 9 Run Status',
- 'Chiller 9 Return Temperature',
- 'Chiller Efficiency',
- 'Chiller 9 Schedule',
- 'Chiller 9 Power Input',
- 'Chiller 9 Chiller % Loaded',
- 'Chiller 9 Condenser Water Flow',
- 'Chiller 9 Liquid Refrigerant Evaporator Temperature',
- 'Chiller 9 Condenser Water Supply To Chiller Temperature',
- ],
-
- 'Chiller 3': [
- 'Chiller 3 Condenser Water Flow',
- 'Chiller 3 Chiller Efficiency',
- 'Chiller 3 Liquid Refrigerant Evaporator Temperature',
- 'Chiller 3 Run Status',
- 'Chiller 3 Tonnage',
- 'Chiller 3 Chiller % Loaded',
- 'Chiller 3 Supply Temperature',
- 'Chiller 3 Condenser Water Supply To Chiller Temperature',
- 'Chiller 3 Schedule',
- 'Chiller 3 Setpoint Temperature',
- 'Chiller 3 Power Input',
- 'Chiller 3 Return Temperature',
- ]
-}
-
-# class SensorDescription:
-# sensor_name: str
-# name: str
-
-class BMSAssetsInputs(BaseModel):
- site_name: str = Field(description="site or location of the assets")
-
-
-class BMSAssetDescriptionInputs(BaseModel):
- site_name: str = Field(description="site or location of the assets")
- asset_name: str = Field(description="name of asset (not Tag/Reference)")
-
-
-class AssetMessage:
- site_name: str
- total_assets: int
- file_path: str
- message: str
-
-
-class BMSAssets(BaseTool):
- """Tool to fetch assets at a given site from a BMS datastore"""
-
- name: str = "assets"
- description: str = (
- "Returns a list of assets for a given site. Each asset includes an id and a name."
- )
- args_schema: Type[BaseModel] = BMSAssetsInputs
- response_format: str = "JSON"
-
- def _run(self, site_name: str) -> str:
-
- if site_name not in SITES:
- raise ValueError(f'unknown site "{site_name}"')
-
- tmpfilename = getTempFilename()
-
- out = open(tmpfilename, "w")
- assets = list(SENSORS.keys())
- json.dump(assets, out)#, default=custom_json)
- out.close()
-
- total_assets = len(SENSORS)
- message = f"found {total_assets} assets for site_name {site_name}. file_path contains a JSON array of Asset data"
-
- retval = AssetMessage()
- retval.site_name = site_name
- retval.total_assets = total_assets
- retval.file_path = tmpfilename
- retval.message = message
-
- return json.dumps(retval, default=custom_json)
-
-
-class BMSAssetDescription(BaseTool):
- """Tool to fetch description such as type and location information of asset at a given site using asset name from a BMS datastore"""
-
- name: str = "assetdescription"
- description: str = (
- "Return an asset description for a given site and an asset. Asset description includes asset id, asset type and location information."
- )
- args_schema: Type[BaseModel] = BMSAssetDescriptionInputs
- response_format: str = "JSON"
-
- def _run(self, site_name: str, asset_name: str) -> str:
- # Fetch the asset description (assuming it returns a single AssetDescription object)
- asset_description: AssetDescription = self.BMS_functions.asset_description(
- site_name=site_name, asset_name=asset_name
- )
-
- # Convert the single asset description to JSON using custom serialization
- return json.dumps(asset_description, default=self.custom_json)
-
- def custom_json(self, obj):
- """Custom JSON serialization for complex objects like AssetDescription"""
- if isinstance(obj, AssetDescription):
- # Serialize the AssetDescription object into a dictionary
- return {
- "asset_name": obj.asset_name,
- "asset_type": obj.asset_type,
- "asset_location": obj.asset_location,
- "site_name": obj.site_name,
- "asset_id": obj.asset_id,
- # Add other fields if needed
- }
- raise TypeError(f"Type {type(obj)} not serializable")
-
-
-class SitesMessage:
- sites: str
-
-class BMSSites(BaseTool):
- """Tool to fetch sites from a BMS datastore"""
-
- name: str = "sites"
- description: str = "Retrieves a list of sites. Each site is represented by a name."
- response_format: str = "JSON"
-
- def _run(self, args: None = None) -> str:
-
- retval = SitesMessage()
- retval.sites = SITES
-
- return json.dumps(retval, default=custom_json)
-
-
-class BMSHistoryInputs(BaseModel):
- site_name: str = Field(description="site or location of the assets")
- assetnum: str = Field(
- description="Asset Number"
- )
- start: str = Field(description="start datetime to return")
- # sensor_name_list: Optional[str] = Field(
- # description="name of sensor(s) for which to return history. If None return all sensors"
- # )
- final: Optional[str] = Field(
- description="final datetime to return. If None, return only the values from the start datetime"
- )
-
-
-class HistoryMessage:
- site_name: str = Field(description="site or location of the assets")
- assetnum: str
- total_observations: int
- start: str
- final: str
- file_path: str = Field(description="path to file of sensor information")
- message: str = Field(description="response message")
-
-
-class BMSHistory(BaseTool):
- """Tool to return sensor history for an asset at a site"""
-
- name: str = "history"
- description: str = (
- "Returns a list of historical sensor values for the specified asset(s) at a site within a given time range (start to final)."
- )
- args_schema: Type[BaseModel] = BMSHistoryInputs
- response_format: str = "JSON"
-
- def _run(
- self,
- site_name: str,
- assetnum: str,
- start: str,
- final: Optional[str] = None,
- # sensor_name_list: Optional[str] = None,
- ) -> str:
-
- if assetnum not in SENSORS:
- raise ValueError(f'no such assetnum {assetnum}')
-
- selector = {
- "asset_id": assetnum,
- "timestamp": {
- "$gte": datetime.fromisoformat(start).isoformat()
- }
- }
-
- if final is not None:
- selector["timestamp"]["$lt"] = datetime.fromisoformat(final).isoformat()
-
- strt = datetime.fromisoformat(start).isoformat()
- fin = datetime.fromisoformat(final).isoformat()
-
- if strt >= fin:
- raise ValueError('start >= final')
-
- # fields = None
- # assetSensorProps = SENSORS[assetnum]
- # if sensor_name_list is not None:
- # fields = ['_id']
- # sensorList = sensor_name_list.split(',')
-
- # for nonTrimmedSensor in sensorList:
- # sensor = nonTrimmedSensor.strip()
-
- # property = None
- # for entry in assetSensorProps:
- # if entry['sensor_name'] == sensor:
- # property = entry['sensor_name']
- # break
-
- # if property is None:
- # raise ValueError(f'sensor {sensor} not found on assetnum {assetnum}')
-
- # fields.append(property)
-
- # print('selecting:\n', json.dumps(selector, indent=2))
- # print(f'fields = {fields}')
-
- logger.critical(f'******* selector ={json.dumps(selector, indent=2)}')
- res = db.find(selector, limit=100000, sort=[{"asset_id": "asc"}, {"timestamp": "asc"}])
- # limit: int = 25, skip: int = 0, sort: List[Dict] = None, fields: List[str] = None, use_index: Union[str, List[str]] = None, conflicts: bool = False, r: int = 1, bookmark: str = None, update: bool = True, stable: bool = None, execution_stats: bool = False, partition: str = None) ‑> Dict
-
- # print('res =', json.dumps(res, indent=2))
-
-
- tmpfilename = getTempFilename()
-
- docs = res['docs']
- # for doc in docs:
- # doc['timestamp'] = doc['_id']
-
- out = open(tmpfilename, "w")
- json.dump(docs, out)
- out.close()
-
- total_observations = len(docs)
- message = f"found {total_observations} observations. file_path contains a JSON array of Observation data"
-
- retval = HistoryMessage()
- retval.site_name = site_name
- retval.assetnum = assetnum
- retval.total_observations = total_observations
- retval.start = start
- retval.final = final
- # retval.sensor_name = sensor_name
- retval.file_path = tmpfilename
- retval.message = message
-
- return json.dumps(retval, default=custom_json)
-
-
-class SensorMessage:
- site_name: str = Field(description="site or location of the assets")
- assetnum: str = Field(description="asset number")
- total_sensors: int = (Field(description="total number of sensoors"),)
- file_path: str = Field(description="path to file of sensor information")
- message: str = Field(description="response message")
-
-
-class BMSSensorsInputs(BaseModel):
- site_name: str = Field(description="site or location of the assets")
- assetnum: str = Field(description="assetnum of asset")
-
-
-class BMSSensors(BaseTool):
- """given a site and asset, list available metadata for the asset"""
-
- name: str = "sensors"
- description: str = (
- "Lists the sensors available for a specified asset at a given site."
- )
- args_schema: Type[BaseModel] = BMSSensorsInputs
- response_format: str = "json"
-
- def _run(self, site_name: str, assetnum: str) -> str:
-
- if site_name not in SITES:
- raise ValueError(f'unknown site "{site_name}"')
-
- if assetnum not in SENSORS:
- raise ValueError(f'unknown assetnum "{assetnum}"')
-
- sensors = SENSORS[assetnum]
-
- tmpfilename = getTempFilename()
-
- out = open(tmpfilename, "w")
- json.dump(sensors, out, default=custom_json)
- out.close()
-
- total_sensors = len(sensors)
- message = f"found {total_sensors} sensors for assetnum {assetnum} and site_name {site_name}. file_path contains a JSON array of Sensor data"
-
- retval = SensorMessage()
- retval.site_name = site_name
- retval.assetnum = assetnum
- retval.total_sensors = total_sensors
- retval.file_path = tmpfilename
- retval.message = message
-
- out = json.dumps(retval, default=custom_json)
-
- return out
diff --git a/src/tmp/IoTAgent/gettools.py b/src/tmp/IoTAgent/gettools.py
deleted file mode 100644
index 6a63fc52..00000000
--- a/src/tmp/IoTAgent/gettools.py
+++ /dev/null
@@ -1,30 +0,0 @@
-from IoTAgent.jsonreader import JSONReader
-from IoTAgent.jsonreader.jsonwrapper import JSONWrapperFunctions
-from IoTAgent.time.timewrapper import TimeWrapperFunctions
-from IoTAgent.time.timetool import CurrentTimeJSONISO
-from IoTAgent.bmstool import BMSHistory, BMSSites, BMSAssets, BMSSensors
-
-
-def getTools():
-
- bmsSites = BMSSites()
- bmsHistory = BMSHistory()
- bmsAssets = BMSAssets()
- bmsSensors = BMSSensors()
-
- jsonWrapper = JSONWrapperFunctions()
- jsonReaderTool = JSONReader(functions=jsonWrapper)
-
- fns = TimeWrapperFunctions()
- currenttimejsoniso = CurrentTimeJSONISO(functions=fns)
-
- tools = [
- bmsSites,
- bmsHistory,
- bmsAssets,
- bmsSensors,
- jsonReaderTool,
- currenttimejsoniso,
- ]
-
- return tools
diff --git a/src/tmp/IoTAgent/jsonreader/jsonreader.py b/src/tmp/IoTAgent/jsonreader/jsonreader.py
deleted file mode 100644
index 67ed16cc..00000000
--- a/src/tmp/IoTAgent/jsonreader/jsonreader.py
+++ /dev/null
@@ -1,122 +0,0 @@
-from typing import Type
-from langchain_core.tools import BaseTool
-from reactxen.tools.jsonreader.jsonwrapper import JSONWrapperFunctions, JSONPropertyReturn
-from pydantic import BaseModel, Field
-import json
-import tempfile
-from pathlib import Path, PurePath
-from uuid import uuid4
-
-def getTempFilename():
- tmpdir = tempfile.gettempdir()
- tmppath = Path(tmpdir)
- basepath = Path("cbmdir")
- filename = str(uuid4())
-
- tmpdirpath = PurePath.joinpath(tmppath, basepath)
-
- tmpdirpath.mkdir(exist_ok=True)
-
- filepath = PurePath.joinpath(tmpdirpath, Path(filename + ".json"))
-
- return str(filepath)
-
-class JSONReaderInputs(BaseModel):
- file_name: str = Field(description="name of JSON input file")
-
-
-class JSONReader(BaseTool):
- """Tool to"""
-
- name: str = "jsonreader"
- description: str = (
- "Reads a JSON file, parses its content, and returns the parsed data."
- )
- args_schema: Type[BaseModel] = JSONReaderInputs
- response_format: str = "JSON"
- functions: JSONWrapperFunctions
-
- def _run(self, file_name: str) -> str:
- contents: any = self.functions.readFile(file_name)
-
- jsonContents = json.dumps(contents)
-
- return jsonContents
-
-class JSONProperties(BaseTool):
- """Tool to"""
-
- name: str = "jsonproperties"
- description: str = (
- "Reads a JSON file, and returns a list of the properties contained in the JSON."
- )
- args_schema: Type[BaseModel] = JSONReaderInputs
- response_format: str = "JSON"
- functions: JSONWrapperFunctions
-
- def _run(self, file_name: str) -> str:
- contents: any = self.functions.properties(file_name)
-
- jsonContents = json.dumps(contents, default=custom_json)
-
- return jsonContents
-
-
-def custom_json(obj):
-
- if isinstance(obj, JSONFileMergeMessage):
-
- return {
- "file_name_1": obj.file_name_1,
- "file_name_2": obj.file_name_2,
- "file_path": obj.file_path,
- "message": obj.message,
- }
-
- if isinstance(obj, JSONPropertyReturn):
- return {
- "properties": ", ".join(obj.properties)
- }
-
-
-class JSONFileMergeInputs(BaseModel):
- file_name_1: str = Field(description="name of first JSON input file")
- file_name_2: str = Field(description="name of second JSON input file")
-
-
-class JSONFileMergeMessage:
- file_name_1: str
- file_name_2: str
- file_path: str
- message: str
-
-
-class JSONFileMerge(BaseTool):
- """tool to merge two JSON files"""
-
- name: str = "jsonfilemerge"
- description: str = (
- "Merges the contents of two JSON files and returns the combined result."
- )
- args_schema: Type[BaseModel] = JSONFileMergeInputs
- response_format: str = "JSON"
- functions: JSONWrapperFunctions
-
- def _run(self, file_name_1: str, file_name_2: str) -> str:
- contents: any = self.functions.mergeFile(file_name_1, file_name_2)
-
- tmpfilename = getTempFilename()
-
- out = open(tmpfilename, "w")
- json.dump(contents, out, default=None)
- out.close()
-
- message = f"{file_name_1} and {file_name_2} have been merged into {tmpfilename}"
-
- retval = JSONFileMergeMessage()
- retval.file_name_1 = file_name_1
- retval.file_name_2 = file_name_2
- retval.file_path = tmpfilename
- retval.message = message
-
- return json.dumps(retval, default=custom_json)
diff --git a/src/tmp/IoTAgent/jsonreader/jsonwrapper.py b/src/tmp/IoTAgent/jsonreader/jsonwrapper.py
deleted file mode 100644
index ca2763d0..00000000
--- a/src/tmp/IoTAgent/jsonreader/jsonwrapper.py
+++ /dev/null
@@ -1,90 +0,0 @@
-import json
-from pydantic import Field
-from typing import List
-
-class JSONPropertyReturn:
- properties: List[str] = Field(description="names of JSON properties")
-
-class JSONWrapperFunctions:
- def readFile(self, fileName: str) -> any:
- """given a fileName, read the file and parse as JSON
- Args:
- fileName: a file name
- Returns:
- the parsed data structure
- """
-
- fp = open(fileName, 'r')
- retval = json.load(fp)
- fp.close()
-
- return retval
-
- def properties(self, fileName: str) -> JSONPropertyReturn:
- """Reads a JSON file, and returns a list of the properties contained in the JSON.
- Args:
- fileName: a file name
- Returns:
- the properties of the JSON
- """
-
- fp = open(fileName, 'r')
- obj = json.load(fp)
- fp.close()
-
- if not isinstance(obj, list):
- raise ValueError('JSON file does not represent a list')
-
- props = set()
- for line in obj:
- if not isinstance(line, dict):
- raise ValueError('JSON list entry does not represent a dict')
-
- props.update(line.keys())
-
- retval = JSONPropertyReturn()
- retval.properties = list(props)
-
- return retval
-
- def mergeFile(self, file_name_1: str, file_name_2: str) -> any:
- """given a fileName, read the file and parse as JSON
- Args:
- fileName: a file name
- Returns:
- the parsed data structure
- """
-
- fp1 = open(file_name_1, 'r')
- file1contents = json.load(fp1)
- fp1.close()
-
- fp2 = open(file_name_2, 'r')
- file2contents = json.load(fp2)
- fp2.close()
-
- if not isinstance(file1contents, list):
- return 'file1 is not a list'
-
- if not isinstance(file2contents, list):
- return 'file2 is not a list'
-
- file1Item = file1contents[0]
- file2Item = file2contents[0]
-
- if not isinstance(file1Item, dict):
- return 'file1 contains neither a list nor a dict; unable to compare with file2'
- if not isinstance(file2Item, dict):
- return 'file2 contains neither a list nor a dict; unable to compare with file1'
-
- file1KeysSorted = sorted(list(file1Item))
- file2KeysSorted = sorted(list(file2Item))
-
- if len(file1KeysSorted) != len(file2KeysSorted):
- return 'files have a different number of keys'
-
- for i in range(len(file1KeysSorted)):
- if file1KeysSorted[i] != file2KeysSorted[i]:
- return 'file1 and file2 keys differ'
-
- return file1contents + file2contents
diff --git a/src/tmp/IoTAgent/time/timetool.py b/src/tmp/IoTAgent/time/timetool.py
deleted file mode 100644
index 8ab78a35..00000000
--- a/src/tmp/IoTAgent/time/timetool.py
+++ /dev/null
@@ -1,59 +0,0 @@
-from langchain_core.tools import BaseTool
-from reactxen.tools.time.timewrapper import TimeWrapperFunctions
-
-import json
-import pendulum
-
-class CurrentTimeJSONISO(BaseTool):
- """Tool to return current date time"""
-
- # """Tool to return current time in JSON ISO format"""
-
- name: str = "currentdatetime"
- description: str = "Provides the current date time as a JSON object."
- # name: str = "currenttimejsoniso"
- # description: str = "return the current time in ISO format as a JSON object"
- response_format: str = "JSON"
- functions: TimeWrapperFunctions
-
- def _run(self, args: None = None) -> str:
-
- nowISO: str = self.functions.currentTime()
- nowDate: str = nowISO.split('T')[0]
- nowTime: str = nowISO.split('T')[1].split('.')[0]
-
- # Create a descriptive string with the current date and time
- description: str = f"Today's date is {nowDate} and time is {nowTime}."
-
- obj = {
- 'currentDateTime': nowISO,
- 'currentDateTimeDescription': description
- }
-
- retval = json.dumps(obj)
-
- return retval
-
-class CurrentTimeEnglish(BaseTool):
- """Tool to return current time in English"""
-
- name: str = "currenttimeenglish"
- description: str = "return the current time in English text"
- response_format: str = "content_and_artifact"
- functions: TimeWrapperFunctions
-
- def _run(self, args: None = None) -> str:
-
- nowISO: str = self.functions.currentTime()
-
- obj = {
- 'current_time': nowISO
- }
-
- retval = json.dumps(obj)
-
- dt = pendulum.parse(nowISO)
-
- eng = dt.to_datetime_string()
-
- return (eng, retval)
diff --git a/src/tmp/IoTAgent/time/timewrapper.py b/src/tmp/IoTAgent/time/timewrapper.py
deleted file mode 100644
index ca07ed55..00000000
--- a/src/tmp/IoTAgent/time/timewrapper.py
+++ /dev/null
@@ -1,15 +0,0 @@
-import datetime
-from datetime import datetime, timezone
-
-class TimeWrapperFunctions:
- def currentTime(self) -> str:
- """Return the current time in ISO format
- Args: None
- Returns:
- the current time UTC as an ISO-formatted string
- """
- now = datetime.now(timezone.utc)
-
- nowISO = now.isoformat()
-
- return nowISO
diff --git a/src/tmp/TrajFM/README.md b/src/tmp/TrajFM/README.md
deleted file mode 100644
index 6feef90c..00000000
--- a/src/tmp/TrajFM/README.md
+++ /dev/null
@@ -1,199 +0,0 @@
-# TrajFM: LLM Trajectory Failure Mode Analysis
-
-A pipeline for analyzing LLM agent trajectories to identify and categorize failure modes using LLM-based analysis and clustering.
-
-> 📖 **For detailed documentation**, see [README_detail.md](README_detail.md)
-
-## 🚀 Quick Start
-
-### Installation
-
-```bash
-pip install pandas numpy scikit-learn sentence-transformers
-pip install plotly kaleido # For visualization
-```
-
-### Basic Usage
-
-```bash
-# Run the complete pipeline
-python failure_mode_extractor.py \
- --traj_root_base /path/to/trajectories \
- --model_id 18 \
- --summary_dir summary_codabench
-
-# Generate visualizations
-python plot_failure_mode.py
-```
-
-### Python API
-
-```python
-from failure_mode_pipeline import run_failure_mode_pipeline
-
-results = run_failure_mode_pipeline(
- traj_root_base="/path/to/trajectories",
- model_id=18,
- summary_dir="summary_codabench"
-)
-```
-
-## 📁 Project Structure
-
-```
-TrajFM/
-├── failure_mode_extractor.py # Main CLI entry point
-├── failure_mode_generator.py # LLM-based trajectory analysis
-├── failure_mode_reduction.py # Clustering and categorization
-├── failure_mode_pipeline.py # Simplified pipeline wrapper
-├── plot_failure_mode.py # Visualization generation
-├── prompt.py # LLM system prompt
-├── utils.py # Helper functions
-└── README_detail.md # Detailed documentation
-```
-
-## 📊 Input Format
-
-Trajectories should be JSON files with this structure:
-
-```json
-{
- "text": "Question or task description",
- "trajectory": [
- {
- "task_description": "What the agent is thinking",
- "agent_name": "AgentName",
- "response": "Agent's response or observation"
- }
- ],
- "final_answer": "The final answer (optional)"
-}
-```
-
-## 📈 Output Files
-
-1. **`processed_trajectories/combined_m{model_id}_db.pkl`** - Analyzed trajectories with failure modes
-2. **`summary_codabench/addtional_fm.csv`** - Raw additional failure modes
-3. **`summary_codabench/additional_fm_clustered.csv`** - Clustered failure modes
-4. **`failure_modes_sunburst.html`** - Interactive visualization
-5. **`failure_modes_sunburst.png`** - Static visualization
-
-## 🔍 What It Detects
-
-### 14 Predefined Failure Modes
-
-**Task Execution Issues:**
-- Disobey Task/Role Specification
-- Step Repetition
-- Loss of Conversation History
-- Unaware of Termination Conditions
-
-**Communication Issues:**
-- Conversation Reset
-- Fail to Ask for Clarification
-- Task Derailment
-- Information Withholding
-- Ignored Other Agent's Input
-- Action-Reasoning Mismatch
-
-**Verification Issues:**
-- Premature Termination
-- No or Incorrect Verification
-- Weak Verification
-
-### Additional Failure Modes
-
-The system also identifies and clusters **additional failure modes** not in the predefined list, such as:
-- Inadequate Error Handling
-- Lack of Final Answer
-- Insufficient File Format Support
-- Inconsistent Data Retrieval
-
-## 🎨 Visualization
-
-The `plot_failure_mode.py` script creates interactive sunburst charts showing:
-- Hierarchical view of failure modes
-- Cluster distributions
-- Interactive drill-down capabilities
-
-**Note:** The script includes sample data. To visualize your results, modify it to load from `summary_codabench/additional_fm_clustered.csv`.
-
-## ⚙️ Configuration
-
-### CLI Arguments
-
-```bash
---traj_root_base # Root directory containing trajectory JSON files
---model_id # Model identifier (default: 18)
---summary_dir # Output directory for CSV files (default: summary)
---k # Number of clusters (optional, auto-determined if not set)
-```
-
-### Clustering Parameters
-
-```python
-# In failure_mode_reduction.py
-model_name = "all-MiniLM-L6-v2" # Sentence transformer model
-k = None # Auto-determine optimal clusters (or set fixed number)
-```
-
-## 🔧 Customization
-
-### Using a Different LLM Provider
-
-Modify `utils.py` to replace the `watsonx_llm()` function with your LLM API:
-
-```python
-def get_llm_answer_from_json(data: dict, model_id: int) -> str:
- # Replace with your LLM API call
- response = your_llm_api(prompt=formatted_prompt)
- return response
-```
-
-### Adding New Predefined Failure Modes
-
-Edit `prompt.py` to add new failure mode definitions to the `system_prompt`.
-
-## 📝 Example Workflow
-
-```python
-# 1. Generate failure mode analysis
-from failure_mode_generator import process_trajectories
-
-gen_results = process_trajectories(
- traj_root_base="/path/to/trajectories",
- model_id=18
-)
-
-# 2. Cluster additional failure modes
-from failure_mode_reduction import failure_mode_reduction
-
-red_results = failure_mode_reduction(
- combined_db_path=gen_results['combined_path'],
- summary_dir="summary_codabench"
-)
-
-# 3. Visualize results
-# Modify plot_failure_mode.py to load your CSV and run:
-# python plot_failure_mode.py
-```
-
-## 🧪 Testing
-
-```bash
-# Test trajectory processing
-python failure_mode_generator_test.py
-
-# Test clustering
-python failure_mode_reduction_test.py
-```
-
-## 📚 Documentation
-
-- **[README_detail.md](README_detail.md)** - Complete documentation with:
- - Detailed function references
- - Step-by-step examples
- - Troubleshooting guide
- - Advanced customization options
- - Complete API documentation
-
diff --git a/src/tmp/TrajFM/README_detail.md b/src/tmp/TrajFM/README_detail.md
deleted file mode 100644
index fd3bc7d4..00000000
--- a/src/tmp/TrajFM/README_detail.md
+++ /dev/null
@@ -1,1259 +0,0 @@
-# TrajFM: LLM Trajectory Failure Mode Analysis
-
-A comprehensive Python pipeline for analyzing Large Language Model (LLM) execution trajectories to automatically identify, categorize, and cluster failure modes in multi-agent systems.
-
-## 📋 Overview
-
-This tool uses LLM-based analysis to detect 14 predefined failure modes and discover additional custom failure patterns in agent execution traces. It then applies machine learning clustering to categorize similar failure modes, providing actionable insights for debugging and quality assurance.
-
-## 🎯 Key Features
-
-- **Automated Failure Detection**: Analyzes trajectory JSON files using LLM to identify 14 predefined failure modes
-- **Custom Failure Discovery**: LLM suggests additional failure modes specific to your system
-- **Intelligent Clustering**: Groups similar failure modes using sentence embeddings and K-Means
-- **Visualization**: Generate interactive sunburst and treemap charts
-- **Scalable Processing**: Handles multiple timestamps and large trajectory datasets
-
-## 📁 Project Structure
-
-```
-TrajFM/
-├── failure_mode_extractor.py # Main CLI entry point
-├── failure_mode_generator.py # Step 1: LLM-based trajectory analysis
-├── failure_mode_reduction.py # Step 2: Clustering and categorization
-├── failure_mode_pipeline.py # Simplified pipeline wrapper
-├── prompt.py # LLM system prompt with failure mode definitions
-├── utils.py # Helper functions for LLM calls and JSON parsing
-├── plot_failure_mode.py # Visualization generation
-├── failure_mode_generator_test.py # Unit tests for generator
-├── failure_mode_reduction_test.py # Unit tests for reduction
-├── processed_trajectories/ # Intermediate pickle files (generated)
-│ ├── 1_m12_db.pkl # Per-timestamp analysis results
-│ └── combined_m12_db.pkl # Combined analysis across all timestamps
-└── summary_codabench/ # Final output CSVs (generated)
- ├── addtional_fm.csv # Raw additional failure modes
- └── additional_fm_clustered.csv # Clustered failure modes
-```
-
-## 📚 File Functions Reference
-
-### Core Pipeline Files
-
-| File | Type | Key Functions | Purpose |
-|------|------|---------------|---------|
-| `failure_mode_extractor.py` | **Entry Point** | `main()` | CLI interface for running the complete pipeline. Parses arguments and orchestrates generation + reduction steps. |
-| `failure_mode_generator.py` | Module | `process_trajectories()`, `_load_all_json_files()`, `_normalize_additional_failure_modes()` | Loads trajectory JSON files, sends them to LLM for analysis, extracts failure modes, saves results as pickles. |
-| `failure_mode_reduction.py` | Module | `failure_mode_reduction()` | Loads pickles, extracts additional failure modes, generates embeddings, performs K-Means clustering, assigns representative labels, exports CSVs. |
-| `failure_mode_pipeline.py` | Wrapper | `run_failure_mode_pipeline()` | Simplified API that runs both generation and reduction steps sequentially. Returns combined results. |
-| `utils.py` | Utilities | `get_llm_answer_from_json()`, `extract_json_from_response()` | Formats trajectory data for LLM prompts, calls watsonx API, parses JSON responses (handles markdown formatting). |
-| `prompt.py` | Configuration | `system_prompt` (string) | Contains the complete system prompt with failure mode definitions, examples, and JSON output format instructions for the LLM. |
-| `plot_failure_mode.py` | Visualization | Script execution | Generates interactive hierarchical visualizations (sunburst charts) using Plotly to display failure mode distributions across clusters. |
-
-### Test Files
-
-| File | Purpose |
-|------|---------|
-| `failure_mode_generator_test.py` | Unit tests for trajectory processing and LLM analysis functions |
-| `failure_mode_reduction_test.py` | Unit tests for clustering and reduction algorithms |
-
-### Function Details
-
-#### `failure_mode_generator.py`
-
-```python
-def process_trajectories(
- timestamps: Optional[Sequence[str]] = None,
- traj_root_base: str = ".",
- model_id: int = 18,
- out_dir: str = "processed_trajectories"
-) -> Dict[str, Any]
-```
-**Returns**: Dictionary with keys:
-- `per_timestamp_paths`: List of individual pickle file paths
-- `combined_path`: Path to combined pickle file
-- `combined_df`: Pandas DataFrame with all results
-
-**Process**:
-1. Auto-discovers subdirectories if `timestamps=None`
-2. Loads all JSON files recursively using `_load_all_json_files()`
-3. For each trajectory:
- - Formats as prompt using `get_llm_answer_from_json()`
- - Sends to LLM for analysis
- - Extracts failure modes from JSON response
- - Stores in DataFrame row
-4. Saves per-timestamp and combined pickle files
-
-```python
-def _load_all_json_files(root_path: str) -> Dict[str, Any]
-```
-**Returns**: Dictionary mapping file paths to parsed JSON content
-**Process**: Recursively walks directory tree, attempts to load each file as JSON
-
-```python
-def _normalize_additional_failure_modes(obj: Any) -> List[Dict[str, Any]]
-```
-**Returns**: List of dicts with `title` and `description` keys
-**Process**: Handles various formats (list, dict, single object) and normalizes to standard format
-
-#### `failure_mode_reduction.py`
-
-```python
-def failure_mode_reduction(
- combined_pickle_path: str,
- out_dir: str = "summary",
- model_name: str = "all-MiniLM-L6-v2",
- k: Optional[int] = None,
- k_min: int = 2,
- k_max: int = 7,
- verbose: bool = True
-) -> Dict[str, Any]
-```
-**Returns**: Dictionary with keys:
-- `df_expanded`: Tidy DataFrame with title/description columns
-- `df_clustered`: Clustered DataFrame with cluster/failure mode labels
-- `k`: Number of clusters used
-- `silhouette_scores`: List of (k, score) tuples if auto-selected
-- `paths`: Dictionary with CSV file paths
-
-**Process**:
-1. Loads combined pickle file
-2. Explodes `addi_fm_list` column into separate rows
-3. Normalizes to title/description format
-4. Saves raw CSV (`addtional_fm.csv`)
-5. Generates sentence embeddings using SentenceTransformer
-6. If `k=None`: Tests k_min to k_max clusters, selects best via silhouette score
-7. Performs K-Means clustering
-8. Assigns representative title (closest to centroid) to each cluster
-9. Saves clustered CSV (`additional_fm_clustered.csv`)
-
-#### `utils.py`
-
-```python
-def get_llm_answer_from_json(data: dict, model_id: int) -> str
-```
-**Parameters**:
-- `data`: Dict with keys `text` (question), `trajectory` (list of steps)
-- `model_id`: LLM model identifier
-
-**Returns**: Raw LLM response text
-
-**Process**:
-1. Extracts question, trajectory steps, and final answer
-2. Formats each step as "Thought X / Action X / Observation X"
-3. Combines into formatted trace string
-4. Inserts into system prompt template
-5. Calls `watsonx_llm()` API
-6. Returns response
-
-```python
-def extract_json_from_response(response_text: str) -> dict
-```
-**Returns**: Parsed JSON dictionary
-
-**Process**:
-1. Searches for JSON in markdown code fences (```json...```)
-2. Falls back to finding first {...} block
-3. Parses and returns JSON object
-4. Raises ValueError if no valid JSON found
-
-#### `failure_mode_pipeline.py`
-
-```python
-def run_failure_mode_pipeline(
- traj_root_base: str,
- model_id: int = 18,
- timestamps: Optional[Sequence[str]] = None,
- summary_dir: str = "summary",
- model_name: str = "all-MiniLM-L6-v2",
- k: Optional[int] = None
-) -> Dict[str, Any]
-```
-**Returns**: Dictionary with keys:
-- `generation`: Results from `process_trajectories()`
-- `reduction`: Results from `failure_mode_reduction()`
-
-**Process**:
-1. Calls `process_trajectories()` with provided parameters
-2. Calls `failure_mode_reduction()` using combined pickle from step 1
-3. Returns both result dictionaries
-
-#### `plot_failure_mode.py`
-
-**Type**: Standalone visualization script (not imported as module)
-
-**Purpose**: Creates interactive hierarchical visualizations of failure mode distributions using Plotly sunburst charts.
-
-**Data Structure**:
-The script contains hardcoded sample data with three hierarchical levels:
-1. **Cluster** (0-6): Top-level grouping of related failure modes
-2. **Failure Mode**: General category (e.g., "Inadequate Error Handling", "Lack of Final Answer")
-3. **Title**: Specific failure mode instance with detailed description
-
-**Sample Data Included**:
-- 24 failure mode examples across 7 clusters
-- Categories include:
- - Inadequate Error Handling
- - Insufficient File Format Support
- - Lack of Final Answer
- - Lack of Adaptive Learning
- - Inconsistent Data Retrieval
- - Insufficient Feedback
- - Inadequate Finalization
-
-**Outputs**:
-1. `failure_modes_sunburst.html` - Interactive HTML visualization with hover details
-2. `failure_modes_sunburst.png` - Static high-resolution PNG (scale=2)
-
-**Visualization Features**:
-- Color-coded by cluster using Viridis color scale
-- Shows label + percentage for each segment
-- Hover displays: label, value, and percentage of parent
-- Interactive drill-down through hierarchy levels
-
-**Prerequisites**:
-```bash
-pip install plotly kaleido # kaleido required for PNG export
-```
-
-**Usage**:
-```bash
-python plot_failure_mode.py
-```
-
-**Note**: This script uses **sample/demonstration data** hardcoded in the file. To visualize your own pipeline results, you would need to:
-1. Load data from `summary_codabench/additional_fm_clustered.csv`
-2. Transform it into the required format (cluster, failure_mode, title columns)
-3. Replace the `data` dictionary in the script
-
-**Customization Example**:
-```python
-import pandas as pd
-import plotly.express as px
-
-# Load your clustered results
-df = pd.read_csv('summary_codabench/additional_fm_clustered.csv')
-
-# Create sunburst (adjust path based on your column names)
-fig = px.sunburst(
- df,
- path=['cluster', 'failure_mode', 'title'],
- color='cluster',
- color_continuous_scale='Viridis',
- title="Your Failure Modes Analysis"
-)
-fig.write_html("your_results.html")
-```
-
-## 🚀 Quick Start
-
-### Prerequisites
-
-```bash
-pip install pandas numpy scikit-learn sentence-transformers plotly kaleido
-```
-
-**Note**: You'll need access to IBM watsonx or modify `utils.py` to use your preferred LLM API.
-
-### Entry Points
-
-There are **three ways** to run the pipeline:
-
-#### 1. **Main Entry Point: Command Line Interface** (Recommended)
-
-The primary entry point is `failure_mode_extractor.py`, which provides a complete CLI:
-
-```bash
-python failure_mode_extractor.py \
- --traj_directory ./path/to/trajectories \
- --model_id 18 \
- --summary_dir summary \
- --model_name all-MiniLM-L6-v2 \
- --k 5
-```
-
-**What it does**:
-1. Reads all JSON trajectory files from `--traj_directory`
-2. Analyzes each trajectory using LLM (model specified by `--model_id`)
-3. Saves intermediate results to `processed_trajectories/`
-4. Clusters additional failure modes
-5. Exports final CSVs to `--summary_dir`
-
-#### 2. **Python API: Pipeline Wrapper**
-
-For programmatic access, use `failure_mode_pipeline.py`:
-
-```python
-from failure_mode_pipeline import run_failure_mode_pipeline
-
-results = run_failure_mode_pipeline(
- traj_root_base="./path/to/trajectories",
- model_id=18,
- timestamps=None, # Auto-discover subdirectories
- summary_dir="summary",
- model_name="all-MiniLM-L6-v2",
- k=None # Auto-select optimal K using silhouette scores
-)
-
-# Access generation results
-print(f"Analyzed {len(results['generation']['combined_df'])} trajectories")
-print(results["generation"]["combined_df"].head())
-
-# Access clustering results
-print(f"Found {results['reduction']['k']} clusters")
-print(results["reduction"]["df_clustered"].head())
-```
-
-#### 3. **Step-by-Step: Individual Modules**
-
-For fine-grained control, call modules separately:
-
-```python
-# Step 1: Generate failure mode analysis
-from failure_mode_generator import process_trajectories
-
-gen_results = process_trajectories(
- traj_root_base="./my_trajectories",
- model_id=18,
- out_dir="processed_trajectories"
-)
-
-# Step 2: Cluster additional failure modes
-from failure_mode_reduction import failure_mode_reduction
-
-red_results = failure_mode_reduction(
- combined_pickle_path=gen_results["combined_path"],
- out_dir="summary",
- k=None # Auto-select
-)
-```
-
-### How Trajectories Are Read
-
-The pipeline automatically discovers and reads trajectory files:
-
-```python
-# In failure_mode_generator.py
-def _load_all_json_files(root_path: str) -> Dict[str, Any]:
- """
- Recursively walks the directory tree starting from root_path.
- Attempts to load each file as JSON.
- Returns a dictionary mapping file_path -> parsed_json_content
- """
-```
-
-**Directory Structure Expected**:
-```
-traj_root_base/
-├── trajectory_file_1.json
-├── trajectory_file_2.json
-├── subdirectory/
-│ ├── trajectory_file_3.json
-│ └── trajectory_file_4.json
-└── another_subdir/
- └── trajectory_file_5.json
-```
-
-**Reading Process**:
-1. `os.walk()` recursively traverses all subdirectories
-2. Each file is opened and parsed as JSON
-3. Files that fail JSON parsing are silently skipped
-4. File path is used to extract metadata (e.g., `ut_id` from filename)
-5. All successfully loaded trajectories are processed
-
-**Filename Convention**:
-- Files are typically named like: `{ut_id}_{other_info}.json`
-- The `ut_id` (unit test ID) is extracted from the filename using: `parts = os.path.relpath(path, root_directory).split('_')`
-
-## 📊 Input Format
-
-### Trajectory JSON Structure
-
-Your trajectory files **must** be JSON with the following structure:
-
-```json
-{
- "text": "User's question or task description",
- "trajectory": [
- {
- "task_description": "Agent's thought process",
- "agent_name": "Action taken",
- "response": "Observation or result"
- }
- ]
-}
-```
-
-### Required Fields
-
-| Field | Type | Description | Example |
-|-------|------|-------------|---------|
-| `text` | string | The original user question or task | `"Download sensor data for equipment CWC04009"` |
-| `trajectory` | array | List of execution steps (can be empty) | `[{...}, {...}]` |
-| `trajectory[].task_description` | string | Agent's reasoning or thought | `"I need to retrieve the sensor data"` |
-| `trajectory[].agent_name` | string | Name of agent or action taken | `"IoTDataDownload"` |
-| `trajectory[].response` | string | Result or observation from action | `"Data downloaded successfully"` |
-
-### Complete Example
-
-```json
-{
- "text": "Download sensor data for equipment CWC04009 from 2020",
- "trajectory": [
- {
- "task_description": "I need to retrieve the sensor data for the specified equipment",
- "agent_name": "IoTDataDownload",
- "response": "Successfully connected to IoT database"
- },
- {
- "task_description": "Now I'll query the data for year 2020",
- "agent_name": "DatabaseQuery",
- "response": "Retrieved 1000 data points"
- },
- {
- "task_description": "I should analyze the data format",
- "agent_name": "DataAnalyzer",
- "response": "Error: Invalid file format - expected CSV but got JSON"
- },
- {
- "task_description": "Let me try to convert the format",
- "agent_name": "FormatConverter",
- "response": "Conversion failed - missing required fields"
- }
- ]
-}
-```
-
-### How Trajectories Are Processed
-
-When a trajectory is loaded, it's formatted into a prompt like this:
-
-```
-Question: Download sensor data for equipment CWC04009 from 2020
-----------------------------------------
-Thought 1: I need to retrieve the sensor data for the specified equipment
-Action 1: IoTDataDownload
-Observation 1: Successfully connected to IoT database
-----------------------------------------
-Thought 2: Now I'll query the data for year 2020
-Action 2: DatabaseQuery
-Observation 2: Retrieved 1000 data points
-----------------------------------------
-Thought 3: I should analyze the data format
-Action 3: DataAnalyzer
-Observation 3: Error: Invalid file format - expected CSV but got JSON
-----------------------------------------
-Thought 4: Let me try to convert the format
-Action 4: FormatConverter
-Observation 4: Conversion failed - missing required fields
-----------------------------------------
-Answer: [No final answer provided]
-```
-
-This formatted trace is then sent to the LLM along with the system prompt for failure mode analysis.
-
-### Edge Cases Handled
-
-- **Empty trajectory**: If `trajectory` is an empty list, only the question is analyzed
-- **Missing final_answer**: Extracted from last trajectory step if available, otherwise marked as `"[No final answer provided]"`
-- **Missing fields**: Default values used (`"[No thought]"`, `"[No action]"`, `"[No observation]"`)
-
-## 🔍 Predefined Failure Modes
-
-The system detects 14 predefined failure modes across three categories:
-
-### Task Execution Issues (1.x)
-- **1.1 Disobey Task Specification**: Fails to follow task constraints
-- **1.2 Disobey Role Specification**: Violates assigned role responsibilities
-- **1.3 Step Repetition**: Unnecessarily repeats completed tasks
-- **1.4 Loss of Conversation History**: Loses critical context
-- **1.5 Unaware of Termination Conditions**: Doesn't recognize when to stop
-
-### Communication Issues (2.x)
-- **2.1 Conversation Reset**: Unexpectedly restarts dialogue
-- **2.2 Fail to Ask for Clarification**: Proceeds with unclear information
-- **2.3 Task Derailment**: Deviates from intended objective
-- **2.4 Information Withholding**: Doesn't share critical information
-- **2.5 Ignored Other Agent's Input**: Disregards suggestions
-- **2.6 Action-Reasoning Mismatch**: Actions contradict reasoning
-
-### Verification Issues (3.x)
-- **3.1 Premature Termination**: Ends before completion
-- **3.2 No or Incorrect Verification**: Fails to verify outputs
-- **3.3 Weak Verification**: Insufficient verification rigor
-
-## 📈 Output Files
-
-### 1. `processed_trajectories/combined_m{model_id}_db.pkl`
-Pandas DataFrame with columns:
-- `model_id`, `counter`, `timestamp`, `vendor`, `model`, `ut_id`
-- Boolean columns for each of the 14 predefined failure modes
-- `addi_fm_cnt`: Count of additional failure modes
-- `addi_fm_list`: List of dicts with `title` and `description`
-
-### 2. `summary_codabench/addtional_fm.csv`
-Raw additional failure modes:
-```csv
-title,description
-Inadequate Error Handling,"System failed to handle incorrect file format..."
-Insufficient Feedback,"No feedback provided about task outcome..."
-```
-
-### 3. `summary_codabench/additional_fm_clustered.csv`
-Clustered failure modes:
-```csv
-cluster,failure mode,title,description
-1,Inadequate Error Handling,Inadequate Error Handling,"System failed..."
-1,Inadequate Error Handling,Lack of Error Handling,"Did not handle..."
-2,Lack of Final Answer,Lack of Final Answer,"Failed to provide..."
-```
-
-## 🎨 Visualization
-
-The project includes a visualization script to create interactive hierarchical charts of failure mode distributions.
-
-### Quick Start
-
-```bash
-python plot_failure_mode.py
-```
-
-### What It Does
-
-Creates a **sunburst chart** - a circular hierarchical visualization with three levels:
-1. **Inner ring**: Specific failure mode titles (most detailed)
-2. **Middle ring**: General failure mode categories
-3. **Outer ring**: Cluster groupings (color-coded)
-
-### Output Files
-
-- **`failure_modes_sunburst.html`** - Interactive HTML visualization
- - Hover over segments to see details
- - Click to drill down into hierarchy
- - Shows percentages and counts
-
-- **`failure_modes_sunburst.png`** - Static high-resolution image (2x scale)
- - Suitable for papers and presentations
- - Requires `kaleido` package
-
-### Prerequisites
-
-```bash
-pip install plotly kaleido
-```
-
-### Sample Data
-
-The script includes **demonstration data** with 24 failure mode examples across 7 clusters, including:
-- Inadequate Error Handling
-- Lack of Final Answer
-- Insufficient File Format Support
-- Lack of Adaptive Learning
-- Inconsistent Data Retrieval
-- Insufficient Feedback
-- Inadequate Finalization
-
-### Using Your Own Data
-
-To visualize results from your pipeline:
-
-```python
-import pandas as pd
-import plotly.express as px
-
-# Load your clustered results
-df = pd.read_csv('summary_codabench/additional_fm_clustered.csv')
-
-# Ensure you have these columns: cluster, failure_mode, title
-# You may need to rename or create them based on your CSV structure
-
-fig = px.sunburst(
- df,
- path=['cluster', 'failure_mode', 'title'], # hierarchy levels
- color='cluster',
- color_continuous_scale='Viridis',
- title="Failure Modes Analysis - Your Results"
-)
-
-fig.update_traces(textinfo='label+percent entry', hoverinfo='label+value+percent parent')
-fig.update_layout(margin=dict(t=50, l=0, r=0, b=0))
-
-fig.write_html("my_failure_modes.html")
-fig.write_image("my_failure_modes.png", scale=2)
-```
-
-### Customization Options
-
-**Color Schemes**:
-```python
-color_continuous_scale='Viridis' # Default
-color_continuous_scale='Plasma'
-color_continuous_scale='Inferno'
-color_continuous_scale='RdYlBu'
-```
-
-**Chart Size**:
-```python
-fig.update_layout(width=1000, height=1000) # Adjust dimensions
-```
-
-**Export Quality**:
-```python
-fig.write_image("output.png", scale=3) # Higher resolution (3x)
-```
-
-### Troubleshooting
-
-**Error: "kaleido not found"**
-```bash
-pip install kaleido
-```
-
-**Error: "No module named 'plotly'"**
-```bash
-pip install plotly
-```
-
-**Empty or incorrect visualization**:
-- Verify your CSV has the required columns: `cluster`, `failure_mode`, `title`
-- Check that cluster values are numeric or can be converted to colors
-- Ensure there are no null values in the hierarchy columns
-
-## ⚙️ Configuration Options
-
-### CLI Arguments
-
-| Argument | Type | Default | Description |
-|----------|------|---------|-------------|
-| `--traj_directory` | str | `./localtemp/trajectory/` | Root directory with trajectory JSON files |
-| `--model_id` | int | `18` | LLM model ID for analysis |
-| `--summary_dir` | str | `summary` | Output directory for CSV files |
-| `--model_name` | str | `all-MiniLM-L6-v2` | Sentence transformer model for embeddings |
-| `--k` | int | `None` | Fixed cluster count (None = auto-select) |
-| `--timestamps` | list | `None` | Specific timestamps to process (None = all) |
-
-### Clustering Parameters
-
-In `failure_mode_reduction()`:
-- `k_min`: Minimum clusters to try (default: 2)
-- `k_max`: Maximum clusters to try (default: 7)
-- Optimal K selected using silhouette score analysis
-
-## 🔧 Customization for Your Research
-
-### Critical Modifications Required
-
-#### 1. **Entry Point (`failure_mode_extractor.py`)**
-
-**What to Modify**: Command-line arguments when running
-
-**Current Usage**:
-```bash
-python failure_mode_extractor.py \
- --traj_directory ./localtemp/trajectory/ \
- --model_id 18
-```
-
-**Your Modifications**:
-
-| Argument | Default | What to Change | Example |
-|----------|---------|----------------|---------|
-| `--traj_directory` | `./localtemp/trajectory/` | **REQUIRED**: Path to your trajectory JSON files | `--traj_directory /data/my_trajectories` |
-| `--model_id` | `18` | **OPTIONAL**: Your LLM model ID | `--model_id 20` |
-| `--summary_dir` | `summary` | **OPTIONAL**: Output directory name | `--summary_dir my_results` |
-| `--model_name` | `all-MiniLM-L6-v2` | **OPTIONAL**: Embedding model | `--model_name paraphrase-MiniLM-L6-v2` |
-| `--k` | `None` (auto) | **OPTIONAL**: Fixed cluster count | `--k 5` |
-| `--timestamps` | `None` (all) | **OPTIONAL**: Specific timestamps | `--timestamps 2024-01 2024-02` |
-
-**Example for Your Research**:
-```bash
-python failure_mode_extractor.py \
- --traj_directory /home/researcher/experiment_data/trajectories \
- --model_id 18 \
- --summary_dir results_experiment1 \
- --k 6
-```
-
-**No Code Changes Needed** - Just modify command-line arguments!
-
-#### 2. **Using a Different LLM Provider**
-
-**What to Modify**: `utils.py` file (lines 40-41)
-
-**Current Code**:
-```python
-# ans = watsonx_llm(prompt=prompt, model_id=16)
-ans = watsonx_llm(prompt=prompt, model_id=model_id)
-```
-
-**Modification Options**:
-
-**Option A: OpenAI GPT**
-```python
-import openai
-
-def get_llm_answer_from_json(data: dict, model_id) -> str:
- # ... existing formatting code ...
-
- # Replace watsonx_llm with OpenAI
- response = openai.ChatCompletion.create(
- model="gpt-4", # or "gpt-3.5-turbo"
- messages=[
- {"role": "system", "content": "You are a helpful assistant."},
- {"role": "user", "content": prompt}
- ],
- temperature=0.7,
- )
- ans = {
- "generated_text": response.choices[0].message.content
- }
- return ans
-```
-
-**Option B: Anthropic Claude**
-```python
-import anthropic
-
-def get_llm_answer_from_json(data: dict, model_id) -> str:
- # ... existing formatting code ...
-
- client = anthropic.Anthropic(api_key="your-api-key")
- message = client.messages.create(
- model="claude-3-opus-20240229",
- max_tokens=4096,
- messages=[
- {"role": "user", "content": prompt}
- ]
- )
- ans = {
- "generated_text": message.content[0].text
- }
- return ans
-```
-
-**Option C: Local Model (Ollama)**
-```python
-import requests
-
-def get_llm_answer_from_json(data: dict, model_id) -> str:
- # ... existing formatting code ...
-
- response = requests.post(
- "http://localhost:11434/api/generate",
- json={
- "model": "llama2", # or your local model
- "prompt": prompt,
- "stream": False
- }
- )
- ans = {
- "generated_text": response.json()["response"]
- }
- return ans
-```
-
-**Option D: Azure OpenAI**
-```python
-from openai import AzureOpenAI
-
-def get_llm_answer_from_json(data: dict, model_id) -> str:
- # ... existing formatting code ...
-
- client = AzureOpenAI(
- api_key="your-api-key",
- api_version="2024-02-01",
- azure_endpoint="https://your-resource.openai.azure.com"
- )
- response = client.chat.completions.create(
- model="your-deployment-name",
- messages=[
- {"role": "system", "content": "You are a helpful assistant."},
- {"role": "user", "content": prompt}
- ]
- )
- ans = {
- "generated_text": response.choices[0].message.content
- }
- return ans
-```
-
-#### 3. **Test Files Modifications**
-
-See the [Testing section](#-testing) above for detailed modification instructions for:
-- `failure_mode_generator_test.py` - Change `traj_root_base` path
-- `failure_mode_reduction_test.py` - Update pickle path and parameters
-
-### Adding New Predefined Failure Modes
-
-1. Update `failure_mode_keys` list in `failure_mode_generator.py`
-2. Add definitions to `prompt.py` system prompt
-3. Update DataFrame columns accordingly
-
-### Customizing Clustering
-
-Adjust parameters in `failure_mode_reduction.py`:
-```python
-# Change embedding model
-model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
-
-# Modify clustering algorithm
-from sklearn.cluster import DBSCAN
-clustering = DBSCAN(eps=0.5, min_samples=2)
-```
-
-## 📝 Complete Example Workflow
-
-### End-to-End Example
-
-```python
-# 1. Analyze trajectories
-from failure_mode_generator import process_trajectories
-
-gen_results = process_trajectories(
- traj_root_base="./my_trajectories",
- model_id=18,
- out_dir="processed_trajectories"
-)
-
-print(f"Analyzed {len(gen_results['combined_df'])} trajectories")
-print(f"Found {gen_results['combined_df']['addi_fm_cnt'].sum()} additional failure modes")
-
-# 2. Cluster failure modes
-from failure_mode_reduction import failure_mode_reduction
-
-red_results = failure_mode_reduction(
- combined_pickle_path=gen_results["combined_path"],
- out_dir="summary",
- k=None # Auto-select
-)
-
-print(f"Optimal K: {red_results['k']}")
-print(f"Silhouette scores: {red_results['silhouette_scores']}")
-
-# 3. Analyze results
-df = red_results['df_clustered']
-print("\nFailure Mode Distribution:")
-print(df['failure mode'].value_counts())
-
-# 4. Generate visualizations
-import subprocess
-subprocess.run(["python", "plot_failure_mode.py"])
-```
-
-### Detailed Workflow with Data Inspection
-
-```python
-import pandas as pd
-from failure_mode_pipeline import run_failure_mode_pipeline
-
-# Run complete pipeline
-results = run_failure_mode_pipeline(
- traj_root_base="./trajectories",
- model_id=18,
- summary_dir="summary",
- k=None # Auto-select optimal K
-)
-
-# === Inspect Generation Results ===
-gen_df = results["generation"]["combined_df"]
-
-print("=== GENERATION RESULTS ===")
-print(f"Total trajectories analyzed: {len(gen_df)}")
-print(f"\nColumns: {list(gen_df.columns)}")
-
-# Check predefined failure modes
-predefined_cols = [col for col in gen_df.columns if col.startswith(('1.', '2.', '3.'))]
-print(f"\nPredefined failure mode occurrences:")
-for col in predefined_cols:
- count = gen_df[col].sum()
- if count > 0:
- print(f" {col}: {count} ({count/len(gen_df)*100:.1f}%)")
-
-# Check additional failure modes
-print(f"\nAdditional failure modes:")
-print(f" Total count: {gen_df['addi_fm_cnt'].sum()}")
-print(f" Trajectories with additional FMs: {(gen_df['addi_fm_cnt'] > 0).sum()}")
-print(f" Average per trajectory: {gen_df['addi_fm_cnt'].mean():.2f}")
-
-# === Inspect Reduction Results ===
-red_df = results["reduction"]["df_clustered"]
-
-print("\n=== CLUSTERING RESULTS ===")
-print(f"Number of clusters: {results['reduction']['k']}")
-print(f"Total additional failure modes: {len(red_df)}")
-
-# Cluster distribution
-print("\nCluster distribution:")
-cluster_counts = red_df.groupby('failure mode').size().sort_values(ascending=False)
-for fm, count in cluster_counts.items():
- print(f" {fm}: {count} instances")
-
-# Show sample from each cluster
-print("\n=== SAMPLE FROM EACH CLUSTER ===")
-for cluster_id in sorted(red_df['cluster'].unique()):
- cluster_data = red_df[red_df['cluster'] == cluster_id]
- print(f"\nCluster {cluster_id}: {cluster_data.iloc[0]['failure mode']}")
- print(f" Sample title: {cluster_data.iloc[0]['title']}")
- print(f" Sample description: {cluster_data.iloc[0]['description'][:100]}...")
-
-# === Export for Further Analysis ===
-# Save detailed report
-with open("analysis_report.txt", "w") as f:
- f.write("=== FAILURE MODE ANALYSIS REPORT ===\n\n")
- f.write(f"Total Trajectories: {len(gen_df)}\n")
- f.write(f"Trajectories with Failures: {(gen_df[predefined_cols].any(axis=1)).sum()}\n")
- f.write(f"Additional Failure Modes: {gen_df['addi_fm_cnt'].sum()}\n")
- f.write(f"Unique Failure Clusters: {results['reduction']['k']}\n\n")
-
- f.write("Top Predefined Failure Modes:\n")
- for col in predefined_cols:
- count = gen_df[col].sum()
- if count > 0:
- f.write(f" {col}: {count}\n")
-
- f.write("\nClustered Additional Failure Modes:\n")
- for fm, count in cluster_counts.items():
- f.write(f" {fm}: {count}\n")
-
-print("\nReport saved to analysis_report.txt")
-```
-
-### Reading Existing Results
-
-If you've already run the pipeline and want to analyze existing results:
-
-```python
-import pandas as pd
-
-# Load generation results
-gen_df = pd.read_pickle("processed_trajectories/combined_m18_db.pkl")
-
-# Load clustering results
-clustered_df = pd.read_csv("summary/additional_fm_clustered.csv")
-raw_df = pd.read_csv("summary/addtional_fm.csv")
-
-# Analyze
-print(f"Trajectories: {len(gen_df)}")
-print(f"Additional FMs: {len(raw_df)}")
-print(f"Clusters: {clustered_df['cluster'].nunique()}")
-
-# Find trajectories with specific failure mode
-step_repetition = gen_df[gen_df['1.3 Step Repetition'] == True]
-print(f"\nTrajectories with Step Repetition: {len(step_repetition)}")
-print(step_repetition[['ut_id', 'addi_fm_cnt']].head())
-```
-
-## 🧪 Testing
-
-### Test Files Overview
-
-The repository includes two test files that demonstrate basic usage:
-
-#### 1. `failure_mode_generator_test.py`
-
-**Purpose**: Tests the trajectory generation/analysis step
-
-**Current Code**:
-```python
-from failure_mode_generator import process_trajectories
-
-res = process_trajectories(
- traj_root_base="/Users/dhaval/Documents/GitHub/CodeBenchBackEnd/codabench/tmp0k8_i0of/trajectory",
- model_id=12,
-)
-
-print(res)
-```
-
-**What You Need to Modify**:
-
-| Parameter | Current Value | What to Change |
-|-----------|---------------|----------------|
-| `traj_root_base` | `/Users/dhaval/Documents/.../trajectory` | **REQUIRED**: Change to your trajectory directory path |
-| `model_id` | `12` | **OPTIONAL**: Change to your LLM model ID (default: 18) |
-
-**Example Modifications**:
-
-```python
-# For your own research
-from failure_mode_generator import process_trajectories
-
-res = process_trajectories(
- traj_root_base="./my_research_data/trajectories", # ← Your path
- model_id=18, # ← Your model ID (or keep default)
- out_dir="processed_trajectories", # ← Optional: change output directory
-)
-
-print(f"Processed {len(res['combined_df'])} trajectories")
-print(f"Saved to: {res['combined_path']}")
-print(res['combined_df'].head())
-```
-
-**To Run**:
-```bash
-# After modifying the path
-python failure_mode_generator_test.py
-```
-
-#### 2. `failure_mode_reduction_test.py`
-
-**Purpose**: Tests the clustering/reduction step
-
-**Current Code**:
-```python
-from failure_mode_reduction import failure_mode_reduction
-
-result = failure_mode_reduction(
- combined_pickle_path="./processed_trajectories/combined_m12_db.pkl",
- out_dir="summary_codabench",
- # model_name="all-MiniLM-L6-v2", # or another sentence-transformers model
- # k=6, # fix cluster count if you prefer
-)
-
-print(result["k"], result["silhouette_scores"][:3])
-print(result["paths"])
-print(result["df_clustered"].head())
-```
-
-**What You Need to Modify**:
-
-| Parameter | Current Value | What to Change |
-|-----------|---------------|----------------|
-| `combined_pickle_path` | `./processed_trajectories/combined_m12_db.pkl` | **REQUIRED**: Update model ID (m12 → m18) or path if you changed it |
-| `out_dir` | `summary_codabench` | **OPTIONAL**: Change output directory name |
-| `model_name` | (commented) `all-MiniLM-L6-v2` | **OPTIONAL**: Uncomment and change embedding model |
-| `k` | (commented) `6` | **OPTIONAL**: Uncomment to fix cluster count instead of auto-select |
-
-**Example Modifications**:
-
-```python
-# For your own research
-from failure_mode_reduction import failure_mode_reduction
-
-result = failure_mode_reduction(
- combined_pickle_path="./processed_trajectories/combined_m18_db.pkl", # ← Match your model_id
- out_dir="my_summary", # ← Your output directory
- model_name="all-MiniLM-L6-v2", # ← Uncomment to specify embedding model
- k=None, # ← None for auto-select, or set a number (e.g., k=5)
- verbose=True, # ← See detailed progress
-)
-
-print(f"Optimal K: {result['k']}")
-print(f"Silhouette scores: {result['silhouette_scores']}")
-print(f"Output files: {result['paths']}")
-print("\nClustered failure modes:")
-print(result["df_clustered"].head(10))
-```
-
-**To Run**:
-```bash
-# After modifying the parameters
-python failure_mode_reduction_test.py
-```
-
-### Running Tests in Sequence
-
-For a complete test workflow:
-
-```bash
-# Step 1: Generate analysis (modify path first!)
-python failure_mode_generator_test.py
-
-# Step 2: Cluster results (ensure pickle path matches)
-python failure_mode_reduction_test.py
-```
-
-### Common Modifications for Research
-
-#### Scenario 1: Different Data Location
-```python
-# In failure_mode_generator_test.py
-res = process_trajectories(
- traj_root_base="/path/to/your/data", # ← Change this
- model_id=18,
-)
-```
-
-#### Scenario 2: Different Model
-```python
-# In failure_mode_generator_test.py
-res = process_trajectories(
- traj_root_base="./trajectories",
- model_id=20, # ← Different model
-)
-
-# In failure_mode_reduction_test.py (must match!)
-result = failure_mode_reduction(
- combined_pickle_path="./processed_trajectories/combined_m20_db.pkl", # ← Update m20
- out_dir="summary",
-)
-```
-
-#### Scenario 3: Fixed Cluster Count
-```python
-# In failure_mode_reduction_test.py
-result = failure_mode_reduction(
- combined_pickle_path="./processed_trajectories/combined_m18_db.pkl",
- out_dir="summary",
- k=7, # ← Force 7 clusters instead of auto-select
-)
-```
-
-#### Scenario 4: Different Embedding Model
-```python
-# In failure_mode_reduction_test.py
-result = failure_mode_reduction(
- combined_pickle_path="./processed_trajectories/combined_m18_db.pkl",
- out_dir="summary",
- model_name="paraphrase-MiniLM-L6-v2", # ← Different embedding model
-)
-```
-
-### Troubleshooting Test Files
-
-| Error | Cause | Solution |
-|-------|-------|----------|
-| `FileNotFoundError: [Errno 2] No such file or directory` | Wrong `traj_root_base` path | Update path to your actual trajectory directory |
-| `FileNotFoundError: combined_m12_db.pkl` | Pickle file doesn't exist or wrong model ID | Run generator test first, or update model ID in path |
-| `ModuleNotFoundError: No module named 'reactxen'` | Missing dependency for LLM calls | Install required package or modify `utils.py` to use your LLM |
-| `KeyError: 'addi_fm_cnt'` | Pickle file from different version | Regenerate pickle files with current code |
-| Clustering fails with small data | Not enough failure modes to cluster | Ensure you have at least 2 additional failure modes |
-
-## 📊 Performance Considerations
-
-- **LLM Calls**: Generation step makes one LLM call per trajectory (can be slow/expensive)
-- **Caching**: Pickle files enable re-running clustering without re-analyzing
-- **Batch Processing**: Process trajectories in batches if memory is limited
-- **Parallel Processing**: Consider parallelizing LLM calls for large datasets
-
-## 🔄 Pipeline Execution Flow
-
-### Complete Data Flow Diagram
-
-```
-┌─────────────────────────────────────────────────────────────────┐
-│ INPUT: Trajectory JSON Files │
-│ (Located in traj_root_base directory, recursively discovered) │
-└────────────────────────────┬────────────────────────────────────┘
- │
- ▼
-┌─────────────────────────────────────────────────────────────────┐
-│ STEP 1: failure_mode_generator.py │
-│ │
-│ 1. _load_all_json_files() - Recursively load all JSONs │
-│ 2. For each trajectory: │
-│ a. Format as prompt (Question → Thought/Action/Obs → Answer) │
-│ b. get_llm_answer_from_json() - Call LLM API │
-│ c. extract_json_from_response() - Parse JSON response │
-│ d. Extract 14 predefined failure modes (boolean) │
-│ e. Extract additional failure modes (list of dicts) │
-│ 3. Store in DataFrame with columns: │
-│ - model_id, counter, timestamp, vendor, model, ut_id │
-│ - 14 boolean columns for predefined failure modes │
-│ - addi_fm_cnt, addi_fm_list │
-│ 4. Save per-timestamp pickle: {timestamp}_m{model_id}_db.pkl │
-│ 5. Combine all timestamps into: combined_m{model_id}_db.pkl │
-└────────────────────────────┬────────────────────────────────────┘
- │
- ▼
-┌─────────────────────────────────────────────────────────────────┐
-│ OUTPUT: processed_trajectories/ folder │
-│ - Individual pickle files per timestamp │
-│ - Combined pickle file (input for next step) │
-└────────────────────────────┬────────────────────────────────────┘
- │
- ▼
-┌─────────────────────────────────────────────────────────────────┐
-│ STEP 2: failure_mode_reduction.py │
-│ │
-│ 1. Load combined pickle file │
-│ 2. Filter rows where addi_fm_cnt > 0 │
-│ 3. Explode addi_fm_list into separate rows │
-│ 4. Normalize to title/description format │
-│ 5. Save raw CSV: addtional_fm.csv │
-│ 6. Generate embeddings using SentenceTransformer │
-│ 7. Determine optimal K: │
-│ - If k=None: Test k_min to k_max, select best silhouette │
-│ - If k provided: Use that value │
-│ 8. Perform K-Means clustering │
-│ 9. For each cluster: │
-│ - Find point closest to centroid │
-│ - Use its title as cluster representative │
-│ 10. Add cluster and failure_mode columns │
-│ 11. Save clustered CSV: additional_fm_clustered.csv │
-└────────────────────────────┬────────────────────────────────────┘
- │
- ▼
-┌─────────────────────────────────────────────────────────────────┐
-│ OUTPUT: summary_codabench/ folder │
-│ - addtional_fm.csv (raw additional failure modes) │
-│ - additional_fm_clustered.csv (with cluster assignments) │
-└─────────────────────────────────────────────────────────────────┘
-```
-
-### Execution Timeline Example
-
-For a dataset with 100 trajectories:
-
-```
-Time Step Status
-─────────────────────────────────────────────────────────────
-00:00 Start pipeline
-00:01 Load 100 JSON files ✓ Complete
-00:02 Analyze trajectory 1/100 with LLM Processing...
-00:03 Analyze trajectory 2/100 with LLM Processing...
-... (1-2 seconds per trajectory)
-03:20 Analyze trajectory 100/100 with LLM ✓ Complete
-03:21 Save per-timestamp pickles ✓ Complete
-03:22 Save combined pickle ✓ Complete
-03:23 Load combined pickle ✓ Complete
-03:24 Explode additional failure modes ✓ Complete
-03:25 Generate embeddings (50 unique FMs) Processing...
-03:26 Test K=2 (silhouette: 0.45)
-03:27 Test K=3 (silhouette: 0.52)
-03:28 Test K=4 (silhouette: 0.58)
-03:29 Test K=5 (silhouette: 0.61) ← Best
-03:30 Test K=6 (silhouette: 0.55)
-03:31 Test K=7 (silhouette: 0.48)
-03:32 Perform final clustering with K=5 ✓ Complete
-03:33 Assign representative titles ✓ Complete
-03:34 Save CSVs ✓ Complete
-03:35 Pipeline complete ✓ Success
-```
-
-**Total Time**: ~3.5 minutes (mostly LLM calls)
-
-### Key Decision Points
-
-| Decision Point | Logic | Impact |
-|----------------|-------|--------|
-| **Auto-discover timestamps** | If `timestamps=None`, use all subdirectories | Processes entire dataset vs. specific time periods |
-| **Retry on LLM failure** | Max 2 attempts per trajectory | Handles transient API errors |
-| **Skip invalid JSON** | Silent skip with try/except | Robust to malformed files |
-| **Auto-select K** | If `k=None`, test range and use silhouette | Optimal clustering vs. fixed cluster count |
-| **Handle small samples** | Special cases for n=0, n=1, n=2 | Prevents clustering errors |
-| **Representative selection** | Closest to centroid | Most typical example per cluster |
-
-## 🤝 Contributing
-
-When adapting this code for your research:
-
-1. **Cite appropriately** if publishing results
-2. **Document modifications** to failure mode definitions
-3. **Share insights** about new failure patterns discovered
-4. **Report issues** with edge cases or unexpected behaviors
-
-## 📄 License
-
-[Add your license information here]
-
-## 📧 Contact
-
-[Add contact information for questions/support]
-
-## 🙏 Acknowledgments
-
-This tool is designed for analyzing multi-agent LLM systems and identifying failure patterns to improve system reliability and performance.
-
----
-
-**Note**: The `processed_trajectories/` and `summary_codabench/` folders are auto-generated. You can safely delete `summary_codabench/` to regenerate clustering results, but deleting `processed_trajectories/` requires re-running expensive LLM analysis.
\ No newline at end of file
diff --git a/src/tmp/TrajFM/failure_mode_extractor.py b/src/tmp/TrajFM/failure_mode_extractor.py
deleted file mode 100644
index e8e241a4..00000000
--- a/src/tmp/TrajFM/failure_mode_extractor.py
+++ /dev/null
@@ -1,85 +0,0 @@
-import argparse
-from failure_mode_generator import process_trajectories
-from failure_mode_reduction import failure_mode_reduction
-
-
-def main():
- """
- Run the failure mode pipeline:
- 1) Generate combined pickle from trajectories
- 2) Reduce/cluster additional failure modes and export CSVs
- """
- parser = argparse.ArgumentParser(
- description="Analyze LLM execution trajectories to identify and cluster failure modes."
- )
- parser.add_argument(
- "--traj_directory",
- type=str,
- default="./localtemp/trajectory/",
- help="Path to the root directory containing per-timestamp trajectory folders.",
- )
- parser.add_argument(
- "--backstage_directory",
- type=str,
- default=".",
- help="(Optional) Path to auxiliary resources (unused, kept for compatibility).",
- )
- parser.add_argument(
- "--model_id",
- type=int,
- default=18,
- help="Model ID passed to the generator step.",
- )
- parser.add_argument(
- "--summary_dir",
- type=str,
- default="summary",
- help="Directory to write the clustered CSV outputs.",
- )
- parser.add_argument(
- "--model_name",
- type=str,
- default="all-MiniLM-L6-v2",
- help="Sentence-Transformers model for title embeddings.",
- )
- parser.add_argument(
- "--k",
- type=int,
- default=None,
- help="Optional fixed number of clusters (if omitted, silhouette chooses K).",
- )
- parser.add_argument(
- "--timestamps",
- nargs="*",
- default=None,
- help="Optional list of timestamps to process. If omitted, auto-discovers all subfolders.",
- )
-
- args = parser.parse_args()
-
- # Step 1: Generate combined pickle (auto-discovers timestamps if not provided)
- gen = process_trajectories(
- timestamps=args.timestamps, # None => auto-discover
- traj_root_base=args.traj_directory,
- model_id=args.model_id,
- out_dir="processed_trajectories",
- )
- print("\n[Step 1] Combined pickle:", gen["combined_path"])
- print(gen["combined_df"].head())
-
- # Step 2: Reduce/cluster additional failure modes from the combined pickle
- red = failure_mode_reduction(
- combined_pickle_path=gen["combined_path"],
- out_dir=args.summary_dir,
- model_name=args.model_name,
- k=args.k,
- )
- print("\n[Step 2] Chosen K:", red["k"])
- if red.get("silhouette_scores"):
- print("[Step 2] Silhouette scores (first 3):", red["silhouette_scores"][:3])
- print("[Step 2] Outputs:", red["paths"])
- print(red["df_clustered"].head())
-
-
-if __name__ == "__main__":
- main()
diff --git a/src/tmp/TrajFM/failure_mode_generator.py b/src/tmp/TrajFM/failure_mode_generator.py
deleted file mode 100644
index e2ac1eb6..00000000
--- a/src/tmp/TrajFM/failure_mode_generator.py
+++ /dev/null
@@ -1,145 +0,0 @@
-import os
-import json
-import pandas as pd
-from pathlib import Path
-from typing import Dict, Any, List, Sequence, Optional
-
-from utils import get_llm_answer_from_json, extract_json_from_response
-
-
-def _load_all_json_files(root_path: str) -> Dict[str, Any]:
- """Load numeric-named files (e.g., '0001', '0002') recursively under root_path."""
- json_data: Dict[str, Any] = {}
- for dirpath, _, filenames in os.walk(root_path):
- for filename in filenames:
- file_path = os.path.join(dirpath, filename)
- try:
- with open(file_path, "r", encoding="utf-8") as f:
- data = json.load(f)
- json_data[file_path] = data
- except Exception:
- pass
- return json_data
-
-
-def _normalize_additional_failure_modes(obj: Any) -> List[Dict[str, Any]]:
- if obj is None:
- return []
- if isinstance(obj, list):
- return [x for x in obj if isinstance(x, dict)]
- if isinstance(obj, dict):
- if "title" in obj or "description" in obj:
- return [obj]
- return [{"title": t, "description": d} for t, d in obj.items()]
- return []
-
-
-def process_trajectories(
- timestamps: Optional[Sequence[str]] = None,
- traj_root_base: str = ".",
- model_id: int = 18,
- out_dir: str = "processed_trajectories",
-):
- """
- Process trajectories using LLM and save per-timestamp + combined pickles.
-
- If `timestamps` is None, auto-discovers subfolders in `traj_root_base` and uses them as timestamps.
- """
-
- failure_mode_keys = [
- "1.1 Disobey Task Specification",
- "1.2 Disobey Role Specification",
- "1.3 Step Repetition",
- "1.4 Loss of Conversation History",
- "1.5 Unaware of Termination Conditions",
- "2.1 Conversation Reset",
- "2.2 Fail to Ask for Clarification",
- "2.3 Task Derailment",
- "2.4 Information Withholding",
- "2.5 Ignored Other Agent's Input",
- "2.6 Action-Reasoning Mismatch",
- "3.1 Premature Termination",
- "3.2 No or Incorrect Verification",
- "3.3 Weak Verification",
- ]
-
- Path(out_dir).mkdir(parents=True, exist_ok=True)
-
- per_timestamp_paths: List[str] = []
- all_dfs: List[pd.DataFrame] = []
-
- timestamp = '1'
- print(f"\nProcessing timestamp {timestamp}")
- root_directory = f"{traj_root_base}"
- all_jsons = _load_all_json_files(root_directory)
- print(f" Loaded {len(all_jsons)} files")
-
- df_columns = [
- "model_id",
- "counter",
- "timestamp",
- "vendor",
- "model",
- "ut_id",
- "addi_fm_cnt",
- "addi_fm_list",
- ] + failure_mode_keys
- df = pd.DataFrame(columns=df_columns)
-
- counter = 1
- for path, content in all_jsons.items():
- parts = os.path.relpath(path, root_directory).split('_')
- ut_id = parts[0]
- model = model_id
- vendor = ''
-
- max_trial = 2
- cur_trial = 0
- while cur_trial < max_trial:
- cur_trial = cur_trial + 1
- try:
- raw_output = get_llm_answer_from_json(data=content, model_id=model_id)
- response_text = raw_output["generated_text"]
- response_json = extract_json_from_response(response_text)
-
- failure_modes = response_json.get("failure_modes", {})
- afm_list = _normalize_additional_failure_modes(
- response_json.get("additional_failure_modes", [])
- )
-
- row = {
- "model_id": model_id,
- "counter": counter,
- "timestamp": timestamp,
- "vendor": vendor,
- "model": model,
- "ut_id": ut_id,
- "addi_fm_cnt": len(afm_list),
- "addi_fm_list": afm_list,
- }
- for key in failure_mode_keys:
- row[key] = bool(failure_modes.get(key, False))
-
- df.loc[len(df)] = row
- break
- except Exception as e:
- print(f" Failed to process {path}: {e}")
-
- counter += 1
-
- df_file_path = f"{out_dir}/{timestamp}_m{model_id}_db.pkl"
- df.to_pickle(df_file_path)
- per_timestamp_paths.append(df_file_path)
- all_dfs.append(df)
- print(f" Saved {df_file_path} with {len(df)} rows")
-
- combined_df = pd.concat(all_dfs, ignore_index=True) if all_dfs else pd.DataFrame()
- combined_file_path = f"{out_dir}/combined_m{model_id}_db.pkl"
- combined_df.to_pickle(combined_file_path)
- print(f"\nSaved combined DataFrame: {combined_file_path} ({len(combined_df)} rows)")
-
- return {
- "per_timestamp_paths": per_timestamp_paths,
- "combined_path": combined_file_path,
- "combined_df": combined_df,
- }
diff --git a/src/tmp/TrajFM/failure_mode_generator_test.py b/src/tmp/TrajFM/failure_mode_generator_test.py
deleted file mode 100644
index c1382441..00000000
--- a/src/tmp/TrajFM/failure_mode_generator_test.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from failure_mode_generator import process_trajectories
-
-res = process_trajectories(
- traj_root_base="/Users/dhaval/Documents/GitHub/CodeBenchBackEnd/codabench/tmp0k8_i0of/trajectory",
- model_id=12,
-)
-
-print (res)
\ No newline at end of file
diff --git a/src/tmp/TrajFM/failure_mode_pipeline.py b/src/tmp/TrajFM/failure_mode_pipeline.py
deleted file mode 100644
index 1048a7b5..00000000
--- a/src/tmp/TrajFM/failure_mode_pipeline.py
+++ /dev/null
@@ -1,34 +0,0 @@
-from failure_mode_generator import process_trajectories # Step 1 (generation)
-from failure_mode_reduction import failure_mode_reduction # Step 2 (reduction)
-
-
-def run_failure_mode_pipeline(
- traj_root_base: str,
- model_id: int = 18,
- timestamps=None, # None => auto-discover subfolders
- summary_dir: str = "summary",
- model_name: str = "all-MiniLM-L6-v2",
- k: int | None = None, # fix cluster count if you want
-):
- # Step 1: generate + save combined pickle
- gen = process_trajectories(
- timestamps=timestamps, # or leave None to auto-discover
- traj_root_base=traj_root_base,
- model_id=model_id,
- )
- print("Combined pickle:", gen["combined_path"])
- print(gen["combined_df"].head())
-
- # Step 2: reduce/cluster using the combined pickle from Step 1
- red = failure_mode_reduction(
- combined_pickle_path=gen["combined_path"],
- out_dir=summary_dir,
- model_name=model_name,
- k=k,
- )
- print("Chosen K:", red["k"])
- print("Paths:", red["paths"])
- print(red["df_clustered"].head())
-
- # Return both results if you want to assert on them in tests
- return {"generation": gen, "reduction": red}
diff --git a/src/tmp/TrajFM/failure_mode_reduction.py b/src/tmp/TrajFM/failure_mode_reduction.py
deleted file mode 100644
index 97e4dbf6..00000000
--- a/src/tmp/TrajFM/failure_mode_reduction.py
+++ /dev/null
@@ -1,214 +0,0 @@
-import pandas as pd
-from pathlib import Path
-from typing import Optional, Dict, Any, Tuple, List
-
-
-def failure_mode_reduction(
- combined_pickle_path: str,
- out_dir: str = "summary",
- model_name: str = "all-MiniLM-L6-v2",
- k: Optional[int] = None,
- k_min: int = 2,
- k_max: int = 7,
- verbose: bool = True,
-) -> Dict[str, Any]:
- """
- Reduce additional failure modes by exploding, embedding, clustering, and labeling.
-
- Returns
- -------
- dict with:
- - 'df_expanded': tidy dataframe with ['title','description']
- - 'df_clustered': clustered dataframe with ['cluster','failure mode','title','description']
- - 'k': number of clusters used
- - 'silhouette_scores': List[Tuple[int, float]] if k was auto-selected, else []
- - 'paths': {'addtional_fm_csv', 'additional_fm_clustered_csv'}
- """
- if verbose:
- print(f"Loading combined pickle: {combined_pickle_path}")
- df = pd.read_pickle(combined_pickle_path)
- print (df)
-
- # --- Step 3: explode addi_fm_list -> title/description ---
- if verbose:
- print("Exploding additional failure modes...")
- if "addi_fm_cnt" not in df.columns or "addi_fm_list" not in df.columns:
- raise KeyError("Expected columns 'addi_fm_cnt' and 'addi_fm_list' not found.")
-
- df_new_fm = df[df["addi_fm_cnt"] > 0][["addi_fm_cnt", "addi_fm_list"]].copy()
- df_new_fm.reset_index(drop=True, inplace=True)
-
- df_exploded = df_new_fm.explode("addi_fm_list", ignore_index=True)
- df_expanded = pd.concat(
- [
- df_exploded.drop(columns=["addi_fm_list"]),
- pd.json_normalize(df_exploded["addi_fm_list"]),
- ],
- axis=1,
- )
-
- keep_cols = [c for c in ["title", "description"] if c in df_expanded.columns]
- if not keep_cols:
- raise KeyError(
- "No 'title'/'description' columns found inside 'addi_fm_list' items."
- )
- df_expanded = df_expanded[keep_cols].copy()
-
- # Save the “addtional_fm.csv” (typo preserved to match notebook)
- out = Path(out_dir)
- out.mkdir(parents=True, exist_ok=True)
- addtional_csv = out / "addtional_fm.csv"
- df_expanded.to_csv(addtional_csv, index=False)
- if verbose:
- print(f"Saved: {addtional_csv} (rows={len(df_expanded)})")
-
- # --- Step 4/5: embeddings + clustering with small-sample handling ---
- titles = df_expanded["title"].fillna("").astype(str).tolist()
- n = len(titles)
-
- # n == 0: nothing to do
- if n == 0:
- if verbose:
- print("No titles to cluster. Returning early.")
- return {
- "df_expanded": df_expanded,
- "df_clustered": pd.DataFrame(
- columns=["cluster", "failure mode", "title", "description"]
- ),
- "k": 0,
- "silhouette_scores": [],
- "paths": {
- "addtional_fm_csv": str(addtional_csv),
- "additional_fm_clustered_csv": None,
- },
- }
-
- # n == 1: assign a single cluster without embeddings
- if n == 1:
- df_clustered = df_expanded.copy()
- df_clustered["cluster"] = 0
- df_clustered["failure mode"] = df_clustered["title"]
- clustered_csv = out / "additional_fm_clustered.csv"
- df_clustered[["cluster", "failure mode", "title", "description"]].to_csv(
- clustered_csv, index=False
- )
- if verbose:
- print(f"Single item: saved {clustered_csv}")
- return {
- "df_expanded": df_expanded,
- "df_clustered": df_clustered[
- ["cluster", "failure mode", "title", "description"]
- ],
- "k": 1,
- "silhouette_scores": [],
- "paths": {
- "addtional_fm_csv": str(addtional_csv),
- "additional_fm_clustered_csv": str(clustered_csv),
- },
- }
-
- # n >= 2: embed
- if verbose:
- print(f"Embedding {n} titles with {model_name} ...")
- from sentence_transformers import SentenceTransformer
- from sklearn.cluster import KMeans
- from sklearn.metrics import silhouette_score
- from sklearn.metrics.pairwise import euclidean_distances
- import numpy as np
-
- model = SentenceTransformer(model_name)
- embeddings = model.encode(titles, convert_to_numpy=True, show_progress_bar=False)
-
- silhouette_scores: List[Tuple[int, float]] = []
-
- # n == 2: only valid K is 2 for silhouette constraints
- if n == 2:
- k = 2
- if verbose:
- print("Only two samples detected; using K=2.")
- else:
- if k is None:
- lo = max(2, k_min)
- hi = min(k_max, n - 1) # silhouette requires k <= n-1
- if lo > hi:
- # Not enough samples for a range; fall back to a valid K
- k = min(2, n - 1)
- if verbose:
- print(f"Insufficient samples for a K range; using K={k}.")
- else:
- if verbose:
- print(f"Selecting K by silhouette over [{lo}..{hi}]")
- best_k, best_score = None, -1.0
- for cand in range(lo, hi + 1):
- km = KMeans(n_clusters=cand, random_state=42, n_init="auto")
- labels = km.fit_predict(embeddings)
- # If all points fall into one cluster (identical embeddings), silhouette is invalid
- if len(set(labels)) <= 1:
- score = -1.0
- else:
- score = float(silhouette_score(embeddings, labels))
- silhouette_scores.append((cand, score))
- if score > best_score:
- best_k, best_score = cand, score
- k = best_k or min(2, n - 1)
- if verbose:
- print("Silhouette scores:", silhouette_scores)
- print(f"Chosen K = {k}")
- else:
- # user-provided K → clamp safely
- if n <= 2:
- k = 2
- else:
- k = max(2, min(int(k), n - 1))
- if verbose:
- print(f"Using K = {k} (validated for n={n})")
-
- # Final clustering
- kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto")
- clusters = kmeans.fit_predict(embeddings)
-
- df_clustered = df_expanded.copy()
- df_clustered["cluster"] = clusters
-
- # Representative (closest to centroid) title per cluster
- if verbose:
- print("Selecting representative title for each cluster...")
- representative_titles: List[Tuple[int, str]] = []
- for cl in range(k):
- idxs = df_clustered.index[df_clustered["cluster"] == cl].tolist()
- if not idxs:
- continue
- dists = euclidean_distances(
- embeddings[idxs], [kmeans.cluster_centers_[cl]]
- ).flatten()
- closest_local = int(np.argmin(dists))
- rep_idx = idxs[closest_local]
- representative_titles.append((cl, df_clustered.loc[rep_idx, "title"]))
-
- if verbose and representative_titles:
- print("\nRepresentative titles:")
- for cl, title in representative_titles:
- print(f" Cluster {cl}: {title}")
-
- cluster_to_title = dict(representative_titles)
- df_clustered["failure mode"] = df_clustered["cluster"].map(cluster_to_title)
-
- # final column order
- cols = ["cluster", "failure mode", "title", "description"]
- df_clustered = df_clustered[cols].copy()
-
- clustered_csv = out / "additional_fm_clustered.csv"
- df_clustered.to_csv(clustered_csv, index=False)
- if verbose:
- print(f"Saved: {clustered_csv} (rows={len(df_clustered)})")
-
- return {
- "df_expanded": df_expanded,
- "df_clustered": df_clustered,
- "k": k,
- "silhouette_scores": silhouette_scores,
- "paths": {
- "addtional_fm_csv": str(addtional_csv),
- "additional_fm_clustered_csv": str(clustered_csv),
- },
- }
diff --git a/src/tmp/TrajFM/failure_mode_reduction_test.py b/src/tmp/TrajFM/failure_mode_reduction_test.py
deleted file mode 100644
index c1de05d9..00000000
--- a/src/tmp/TrajFM/failure_mode_reduction_test.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from failure_mode_reduction import failure_mode_reduction
-
-result = failure_mode_reduction(
- combined_pickle_path="./processed_trajectories/combined_m12_db.pkl",
- out_dir="summary_codabench",
- # model_name="all-MiniLM-L6-v2", # or another sentence-transformers model
- # k=6, # fix cluster count if you prefer
-)
-
-print(result["k"], result["silhouette_scores"][:3])
-print(result["paths"])
-print(result["df_clustered"].head())
diff --git a/src/tmp/TrajFM/failure_modes_sunburst.html b/src/tmp/TrajFM/failure_modes_sunburst.html
deleted file mode 100644
index bb3e0bb8..00000000
--- a/src/tmp/TrajFM/failure_modes_sunburst.html
+++ /dev/null
@@ -1,3888 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/src/tmp/TrajFM/failure_modes_sunburst.png b/src/tmp/TrajFM/failure_modes_sunburst.png
deleted file mode 100644
index 55db1aa7..00000000
Binary files a/src/tmp/TrajFM/failure_modes_sunburst.png and /dev/null differ
diff --git a/src/tmp/TrajFM/failure_modes_treemap.png b/src/tmp/TrajFM/failure_modes_treemap.png
deleted file mode 100644
index 3a7a2869..00000000
Binary files a/src/tmp/TrajFM/failure_modes_treemap.png and /dev/null differ
diff --git a/src/tmp/TrajFM/plot_failure_mode.py b/src/tmp/TrajFM/plot_failure_mode.py
deleted file mode 100644
index 5e832d29..00000000
--- a/src/tmp/TrajFM/plot_failure_mode.py
+++ /dev/null
@@ -1,81 +0,0 @@
-import pandas as pd
-import plotly.express as px
-
-# Data
-data = {
- "cluster": [
- 1, 6, 2, 4, 4, 2, 3, 1, 2, 5, 4, 1, 0, 3, 1, 5, 1, 2, 1, 6, 2, 4, 4, 2
- ],
- "failure_mode": [
- "Inadequate Error Handling",
- "Insufficient File Format Support",
- "Lack of Final Answer",
- "Lack of Adaptive Learning",
- "Lack of Adaptive Learning",
- "Lack of Final Answer",
- "Inconsistent Data Retrieval",
- "Inadequate Error Handling",
- "Lack of Final Answer",
- "Insufficient Feedback",
- "Lack of Adaptive Learning",
- "Inadequate Error Handling",
- "Inadequate Finalization",
- "Inconsistent Data Retrieval",
- "Inadequate Error Handling",
- "Insufficient Feedback",
- "Inadequate Error Handling",
- "Lack of Final Answer",
- "Inadequate Error Handling",
- "Insufficient File Format Support",
- "Lack of Final Answer",
- "Lack of Adaptive Learning",
- "Lack of Adaptive Learning",
- "Lack of Final Answer"
- ],
- "title": [
- "Inadequate Error Handling",
- "Insufficient File Format Support",
- "Lack of Final Answer",
- "Inadequate Contextual Understanding",
- "Lack of Adaptive Learning",
- "Lack of Final Answer",
- "Inconsistent Data Retrieval",
- "Lack of Error Handling",
- "Lack of Final Answer",
- "Insufficient Feedback",
- "Inadequate Sensor Relevancy Mapping",
- "Insufficient Failure Mode Documentation",
- "Inadequate Finalization",
- "Redundant Data Retrieval",
- "Insufficient Data Handling",
- "Lack of Feedback Mechanism",
- "Inability to Handle Missing Data",
- "Lack of Alternative Solutions",
- "Inadequate Error Handling",
- "Insufficient File Format Support",
- "Lack of Final Answer",
- "Inadequate Contextual Understanding",
- "Lack of Adaptive Learning",
- "Lack of Final Answer"
- ]
-}
-
-df = pd.DataFrame(data)
-
-# Sunburst plot
-fig = px.sunburst(
- df,
- path=['cluster', 'failure_mode', 'title'], # hierarchy
- values=None, # size determined automatically (count of rows)
- color='cluster', # color by cluster
- color_continuous_scale='Viridis',
- title="Hierarchical Visualization of Failure Modes"
-)
-
-# Make it interactive and visually appealing
-fig.update_traces(textinfo='label+percent entry', hoverinfo='label+value+percent parent')
-fig.update_layout(margin=dict(t=50, l=0, r=0, b=0))
-
-# Save as HTML for interactivity or PNG for static image
-fig.write_html("failure_modes_sunburst.html") # interactive
-fig.write_image("failure_modes_sunburst.png", scale=2) # static high-res image
diff --git a/src/tmp/TrajFM/processed_trajectories/1_m12_db.pkl b/src/tmp/TrajFM/processed_trajectories/1_m12_db.pkl
deleted file mode 100644
index 4a016c75..00000000
Binary files a/src/tmp/TrajFM/processed_trajectories/1_m12_db.pkl and /dev/null differ
diff --git a/src/tmp/TrajFM/processed_trajectories/combined_m12_db.pkl b/src/tmp/TrajFM/processed_trajectories/combined_m12_db.pkl
deleted file mode 100644
index a2fcb10e..00000000
Binary files a/src/tmp/TrajFM/processed_trajectories/combined_m12_db.pkl and /dev/null differ
diff --git a/src/tmp/TrajFM/prompt.py b/src/tmp/TrajFM/prompt.py
deleted file mode 100644
index ed52ab94..00000000
--- a/src/tmp/TrajFM/prompt.py
+++ /dev/null
@@ -1,129 +0,0 @@
-system_prompt = """
-You will be provided with a multiagent system trace. Your task is to analyze the system behavior to detect any inefficiencies or failure modes.
-
-Instructions:
-- Review the trace for failures or inefficiencies based on the predefined list below.
-- Only mark a failure mode as true if you can clearly identify an instance of it in the trace.
-- For any detected failure, briefly explain it in the 'summary' field.
-- Indicate whether the task is completed or not using a boolean.
-- Answer all failure modes explicitly as true, or false.
-- Optionally, suggest up to two new failure modes not on the predefined list.
-
-Return your output as a valid JSON object only — do not include any text or explanation outside the JSON.
-
-@@
-{{
- "summary": "",
- "task_completed": ,
- "failure_modes": {{
- "1.1 Disobey Task Specification": ,
- "1.2 Disobey Role Specification": ,
- "1.3 Step Repetition": ,
- "1.4 Loss of Conversation History": ,
- "1.5 Unaware of Termination Conditions": ,
- "2.1 Conversation Reset": ,
- "2.2 Fail to Ask for Clarification": ,
- "2.3 Task Derailment": ,
- "2.4 Information Withholding": ,
- "2.5 Ignored Other Agent's Input": ,
- "2.6 Action-Reasoning Mismatch": ,
- "3.1 Premature Termination": ,
- "3.2 No or Incorrect Verification": ,
- "3.3 Weak Verification":
- }},
- "additional_failure_modes": [
- {{
- "title": "",
- "description": "<1–2 sentence description with evidence from the trace>"
- }},
- {{
- "title": "",
- "description": "<1–2 sentence description with evidence from the trace>"
- }}
- ]
-}}
-@@
-
-If no new failure modes are found, return an empty array for "additional_failure_modes".
-
-Here is the trace:
-{trace}
-
-Below are the definitions of the known failure modes and inefficiencies:
-
-1.1 **Disobey Task Specification**:
-This error occurs when an agent or system fails to adhere to specified constraints, guidelines, or requirements associated with a particular task. Non-compliance can result from unclear, incomplete, or ambiguous instructions provided by the user, system prompts, or task descriptions. It may also arise from an agent's inadequate ability to interpret or apply constraints effectively. Consequences include incorrect or suboptimal outputs, reduced system performance, and increased resource consumption.
-
-1.2 **Disobey Role Specification**:
-Failure to adhere to the defined responsibilities and constraints of an assigned role, potentially leading to an agent behaving like another.
-
-1.3 **Step Repetition**:
-This error occurs when an agent or system unnecessarily repeats a task or phase that has already been completed. Redundancy can result from inadequate state or context tracking, inefficient workflow management, unclear instructions, or failure to recognize completed tasks.
-
-1.4 **Loss of Conversation History**:
-Unexpected context truncation, disregarding recent interactions, and reverting to a prior state in the conversation, causing loss of critical context.
-
-1.5 **Unaware of Termination Conditions**:
-This error occurs when an agent or system fails to recognize criteria designed to trigger the termination of an interaction or task. Oversight can lead to unnecessary actions, wasted resources, or incorrect behavior beyond the intended task.
-
-2.1 **Conversation Reset**:
-An unexpected or unwarranted restarting of the dialogue, potentially losing context and progress in the interaction.
-
-2.2 **Fail to Ask for Clarification**:
-Failure to request additional information when faced with unclear or incomplete data, which can result in incorrect actions.
-
-2.3 **Task Derailment**:
-Deviation from the intended task or objective, leading to irrelevant or unproductive actions.
-
-2.4 **Information Withholding**:
-This occurs when an agent possesses critical information but fails to share it with other agents or system components, which may impair the system’s overall operation or decision-making.
-
-2.5 **Ignored Other Agent's Input**:
-When an agent fails to consider or properly act on suggestions from other agents, leading to poor decisions or stalled progress.
-
-2.6 **Action-Reasoning Mismatch**:
-This happens when there’s a discrepancy between an agent’s reasoning and its actual actions, leading to unintended or counterproductive behavior.
-
-3.1 **Premature Termination**:
-Ending a task or conversation before the necessary information has been exchanged or objectives fully met.
-
-3.2 **No or Incorrect Verification**:
-The failure to properly verify task outcomes or system outputs, potentially allowing errors to propagate undetected.
-
-3.3 **Weak Verification**:
-Verification mechanisms that are insufficiently rigorous, potentially missing subtle errors or inconsistencies.
-
-Here are example instances of those failure modes to help guide your analysis:
-
-Example of "Step Repetition":
-In the following trace, the Planner repeats the same thought twice in the workflow:
-Planner's Response: Thought: To address this issue, we need to understand the root cause of the 'Line3D' object not having the '_verts3d' attribute...
-Planner's Response: Thought: To address this issue, we need to understand the root cause of the 'Line3D' object not having the '_verts3d' attribute...
-
-Example of "Unaware of Termination Conditions":
-In this trace, two agents continuously repeat the same instructions despite missing information that prevents the task from progressing:
-User: "Could you provide either the total length of the ribbon or the ribbon length used for each bow?"
-Agent: "Continue. Please keep solving the problem until you need to query..."
-User: "I'm sorry, but I really need more information to solve this problem."
-Agent: "Continue. Please keep solving the problem until you need to query..."
-
-Example of "No or Incorrect Verification":
-In the following case, verification steps failed to detect an error in the game generation process:
-Error: The file 'ship.bmp' was not found in the directory /Users/user/Documents/*/ChatDev/WareHouse/TextBasedSpaceInvaders_DefaultOrganization_20250117121911.
-Traceback (most recent call last):
- File "/Users/user/Documents/*/ChatDev/WareHouse/TextBasedSpaceInvaders_DefaultOrganization_20250117121911/main.py", line 31, in
- run_game()
- File "/Users/user/Documents/*/ChatDev/WareHouse/TextBasedSpaceInvaders_DefaultOrganization_20250117121911/main.py", line 22, in run_game
- gf.create_fleet(ai_settings, screen, aliens)
- File "/Users/user/Documents/*/ChatDev/WareHouse/TextBasedSpaceInvaders_DefaultOrganization_20250117121911/game_functions.py", line 64, in create_fleet
- alien = Alien(ai_settings, screen)
- File "/Users/user/Documents/*/ChatDev/WareHouse/TextBasedSpaceInvaders_DefaultOrganization_20250117121911/alien.py", line 13, in __init__
- self.image = pygame.image.load('alien.bmp')
-FileNotFoundError: No file 'alien.bmp' found in working directory '/Users/*/Documents/*/ChatDev/'.
-
-Example of "Action-Reasoning Mismatch":
-In the following trace, the agent makes a statement that contradicts its previous reasoning:
-Agent: "Note that the `_add_prefix_for_feature_names_out` method is not explicitly shown in the code snippet, but it is mentioned in the context of the `_iter` method."
-[...]
-However, earlier in the trace, the agent had already shown the full implementation of the `_add_prefix_for_feature_names_out` method, creating a mismatch between its reasoning and the actions taken.
-"""
\ No newline at end of file
diff --git a/src/tmp/TrajFM/summary_codabench/additional_fm_clustered.csv b/src/tmp/TrajFM/summary_codabench/additional_fm_clustered.csv
deleted file mode 100644
index 1e861018..00000000
--- a/src/tmp/TrajFM/summary_codabench/additional_fm_clustered.csv
+++ /dev/null
@@ -1,199 +0,0 @@
-cluster,failure mode,title,description
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-1,Inadequate Error Handling,Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-6,Insufficient File Format Support,Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-4,Lack of Adaptive Learning,Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-4,Lack of Adaptive Learning,Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-2,Lack of Final Answer,Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-3,Inconsistent Data Retrieval,Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-1,Inadequate Error Handling,Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-2,Lack of Final Answer,Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-5,Insufficient Feedback,Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-4,Lack of Adaptive Learning,Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-1,Inadequate Error Handling,Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-0,Inadequate Finalization,Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-3,Inconsistent Data Retrieval,Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-1,Inadequate Error Handling,Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-5,Insufficient Feedback,Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-1,Inadequate Error Handling,Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-2,Lack of Final Answer,Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
diff --git a/src/tmp/TrajFM/summary_codabench/addtional_fm.csv b/src/tmp/TrajFM/summary_codabench/addtional_fm.csv
deleted file mode 100644
index cf4cd2ad..00000000
--- a/src/tmp/TrajFM/summary_codabench/addtional_fm.csv
+++ /dev/null
@@ -1,199 +0,0 @@
-title,description
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
-Inadequate Error Handling,"The system failed to handle the error caused by the incorrect file format of the downloaded sensor data, leading to an inability to proceed with the analysis."
-Insufficient File Format Support,The system's inability to support JSON file formats for the tsfm_integrated_tsad tool limited its functionality and led to the failure in analyzing the sensor data.
-Lack of Final Answer,"The system failed to provide a final answer to the user's question, despite having the necessary information to do so."
-Inadequate Contextual Understanding,"The system fails to understand the context of the task, leading to repeated actions and thoughts without progress, as seen in the repetition of thoughts and actions in the trace."
-Lack of Adaptive Learning,"The system does not adapt to new information or learn from its interactions, resulting in the repetition of the same mistakes and inefficiencies throughout the trace."
-Lack of Final Answer,"The system failed to provide a final answer to the user's question about the TTM model support, despite having the necessary information."
-Inconsistent Data Retrieval,"The system retrieved data for a different equipment (CU02013) and year (2017) instead of the specified equipment (CWC04009) and year (2020), indicating an inconsistency in data retrieval."
-Lack of Error Handling,"The system did not handle the case where no events were found for the specified equipment and date, and instead provided a summary for a different equipment and year, indicating a lack of error handling."
-Lack of Final Answer,"The system did not provide a final answer to the user's question, despite completing the necessary actions to download the sensor data."
-Insufficient Feedback,"The system did not provide sufficient feedback to the user about the outcome of the task, such as confirming whether the data was successfully downloaded."
-Inadequate Sensor Relevancy Mapping,"The system failed to accurately map the failure modes of Chiller 6 to the sensors that can detect them, specifically Chiller 6 Supply Temperature, due to incomplete or inaccurate information."
-Insufficient Failure Mode Documentation,"The system lacked comprehensive documentation of failure modes for Chiller 6, making it difficult to identify and detect specific failures, such as those related to the supply temperature sensor."
-Inadequate Finalization,"The system completed all necessary steps but failed to provide a final answer to the user's question, indicating a potential issue with task finalization or output generation."
-Redundant Data Retrieval,"The system performed multiple IoT Data Downloads that seemed redundant, such as verifying if MAIN is among the available sites after already retrieving the list of available sites, which could be optimized for efficiency."
-Insufficient Data Handling,"The system did not handle the absence of work orders for the specified equipment and years effectively, leading to an incomplete task and lack of a final answer."
-Lack of Feedback Mechanism,"There was no feedback provided to the user about the reasons for the task's incompleteness, such as the absence of work orders, which could have helped in understanding the issue and potentially resolving it."
-Inability to Handle Missing Data,"The system fails to handle the absence of required data for equipment CWC04013, leading to an inability to complete the task, as seen in the repeated attempts to retrieve historical data despite its unavailability."
-Lack of Alternative Solutions,"The system does not provide alternative solutions or suggestions when the primary task cannot be completed due to missing data, such as offering to search for similar equipment or providing a list of available sites."
diff --git a/src/tmp/TrajFM/utils.py b/src/tmp/TrajFM/utils.py
deleted file mode 100644
index 93123df1..00000000
--- a/src/tmp/TrajFM/utils.py
+++ /dev/null
@@ -1,68 +0,0 @@
-import json
-import re
-from prompt import system_prompt
-from reactxen.utils.model_inference import watsonx_llm
-
-
-def get_llm_answer_from_json(data: dict, model_id) -> str:
- """
- Given a parsed JSON dict with keys 'task', 'trajectory', and 'final_answer',
- formats the content and returns the LLM's response.
- """
- try:
- trajectory = data.get("trajectory", [])
- question = data.get("text", "[No question provided]")
- if len(trajectory) > 0:
- final_answer = trajectory[-1].get('final_answer', "[No final answer provided]")
- else:
- final_answer = "[No final answer provided]"
-
- formatted_steps = [f"Question: {question}"]
- for idx, step in enumerate(trajectory, 1):
- thought = step.get("task_description", "[No thought]")
- action = step.get("agent_name", "[No action]")
- observation = step.get("response", "[No observation]")
-
- step_text = (
- f"Thought {idx}: {thought}\n"
- f"Action {idx}: {action}\n"
- f"Observation {idx}: {observation}\n"
- )
- formatted_steps.append(step_text)
-
- formatted_steps.append(f"Answer: {final_answer}")
-
- # Combine all steps into a single formatted prompt
- final_prompt_string = "\n" + "-" * 40 + "\n".join(formatted_steps)
- prompt = system_prompt.format(trace=final_prompt_string)
-
- # Call the model inference
- # ans = watsonx_llm(prompt=prompt, model_id=16)
- ans = watsonx_llm(prompt=prompt, model_id=model_id)
- return ans
-
- except Exception as e:
- return f"Error while processing input data: {e}"
-
-
-def extract_json_from_response(response_text: str) -> dict:
- """
- Extract and parse a JSON object from LLM-generated response text,
- even if it's wrapped in text or markdown formatting.
- """
- # Try to find a JSON block inside markdown-style code fences
- match = re.search(r"```json\s*(\{.*?\})\s*```", response_text, re.DOTALL)
- if match:
- json_str = match.group(1)
- else:
- # Fallback: find the first {...} block in the response
- match = re.search(r"(\{.*\})", response_text, re.DOTALL)
- if match:
- json_str = match.group(1)
- else:
- raise ValueError("No valid JSON found in the response text.")
-
- try:
- return json.loads(json_str)
- except json.JSONDecodeError as e:
- raise ValueError(f"JSON decoding failed: {e}")
diff --git a/src/tmp/agent_hive/agents/base_agent.py b/src/tmp/agent_hive/agents/base_agent.py
deleted file mode 100644
index 17e4266d..00000000
--- a/src/tmp/agent_hive/agents/base_agent.py
+++ /dev/null
@@ -1,29 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import Any, List
-
-from langchain.tools import BaseTool
-from pydantic import Field
-
-
-class BaseAgent(ABC):
- """
- Base class for all agents.
- """
-
- name: str = Field(description="Name of the agent.")
- description: str = Field(description="Description of the agent.")
- llm: str = Field(description="LLM used by the agent.")
- memory: List[str] = Field(default_factory=list, description="Memory of the agent.")
- tools: List[BaseTool] = Field(
- default_factory=list, description="Tools the agent is limited to use."
- )
-
- @abstractmethod
- def execute_task(self, *args, **kwargs):
- pass
-
- def __str__(self):
- tool_names = (
- ", ".join(str(tool.name) for tool in self.tools) if self.tools else "None"
- )
- return f"{self.__class__.__name__}(name={self.name}, description={self.description}, llm={self.llm}, tools=[{tool_names}])"
diff --git a/src/tmp/agent_hive/agents/plan_reviewer_agent.py b/src/tmp/agent_hive/agents/plan_reviewer_agent.py
deleted file mode 100644
index 1d71120c..00000000
--- a/src/tmp/agent_hive/agents/plan_reviewer_agent.py
+++ /dev/null
@@ -1,143 +0,0 @@
-from reactxen.utils.model_inference import watsonx_llm
-from agent_hive.agents.plan_reviewer_prompt import review_plan_system_prompt_template
-import json
-import re
-from agent_hive.agents.base_agent import BaseAgent
-from typing import List, Dict
-
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-class PlanReviewerAgent(BaseAgent):
- """
- This class is responsible for evaluating the generated plan based on the given criteria.
- It uses a language model to generate a review of the plan and then parses the JSON output from the model.
- """
- name = "PlanReviewerAgent"
- description = "This agent evaluates the generated plan based on predefined criteria."
- memory = []
- tools = []
-
- def __init__(self, llm="mistralai/mistral-large", max_retries=3):
- self.llm = llm
- self.max_retries = max_retries
-
- def extract_and_parse_json_using_manual_parser(self, response):
-
- cleaned_json_str = (
- response.strip().replace("\n", " ").replace("\\n", " ").replace("\\", "")
- )
-
- # Define regular expressions to extract each part:
- status_regex = r'"status":\s*"([^"]+)"'
- reasoning_regex = r'"reasoning":\s*"([^"]+)"'
- suggestions_regex = r'"suggestions":\s*"([^"]+)"'
-
- # Extract the values using regex
- status_match = re.search(status_regex, cleaned_json_str)
- reasoning_match = re.search(reasoning_regex, cleaned_json_str)
- suggestions_match = re.search(suggestions_regex, cleaned_json_str)
-
- # Extract and display the results if found
- if status_match and reasoning_match and suggestions_match:
- status = status_match.group(1)
- reasoning = reasoning_match.group(1)
- suggestions = suggestions_match.group(1)
- return {
- "status": status,
- "reasoning": reasoning,
- "suggestions": suggestions,
- }
- else:
- return {
- "status": "Error",
- "reasoning": f"The extracted JSON block could not be parsed.",
- "suggestions": "Ensure the LLM outputs valid JSON inside the ```json``` block.",
- }
-
- def extract_and_parse_json(self, response):
- """
- Extract and parse JSON from the response.
-
- Args:
- response (str): The raw response from the LLM.
-
- Returns:
- dict: Parsed JSON object or an error report.
- """
- try:
- # Extract JSON block enclosed in ```json ... ```
- # match = re.search(r"```json(.*?)```", response, re.DOTALL)
- match = re.search(r"\{.*\}", response.strip(), re.DOTALL)
- if match:
- json_block = match.group(0).strip() # Extract and clean the JSON block
- else:
- json_block = response.strip()
-
- if not json_block:
- raise ValueError("Extracted JSON block is empty.")
-
- parsed_json = json.loads(json_block)
- return parsed_json
-
- except json.JSONDecodeError as ex:
- return {
- "status": "Error",
- "reasoning": f"The extracted JSON block could not be parsed. {ex}",
- "suggestions": "Ensure the LLM outputs valid JSON inside the ```json``` block.",
- }
-
- except ValueError as ex:
- # print(f"Value Error: {ex}")
- return {
- "status": "Error",
- "reasoning": str(ex),
- "suggestions": "Check if the extracted JSON block is empty or improperly formatted.",
- }
-
- def execute_task(self, question: str, agent_descriptions: str, plan: str):
- """
-
- Evaluate the plan based on the question and agent expertise.
-
- Args:
- question (str): The user's question.
- agent_descriptions (str): Descriptions of the agents involved.
- plan (str): The plan to evaluate.
-
- Returns:
- dict: The evaluation result.
- """
-
- prompt = review_plan_system_prompt_template.format(
- question=question,
- agent_expertise=agent_descriptions,
- plan=plan,
- )
- logger.info(f"Review Prompt: {prompt}")
- for it_index in range(self.max_retries):
- review_result = watsonx_llm(
- prompt, model_id=self.llm, stop=["\n(END OF RESPONSE)"]
- )["generated_text"]
- # logger.info(f'review_result: {review_result}')
- parsed_result = self.extract_and_parse_json(review_result)
-
- # Check if parsing succeeded
- if parsed_result.get("status") != "Error":
- return parsed_result
-
- parsed_result = self.extract_and_parse_json_using_manual_parser(
- review_result
- )
- # Check if parsing succeeded
- if parsed_result.get("status") != "Error":
- return parsed_result
-
- # Return error after exceeding retries
- return {
- "status": "Error",
- "reasoning": f"Failed to produce valid JSON after {self.max_retries} attempts.",
- "suggestions": "Review the prompt and refine the LLM response strategy.",
- }
diff --git a/src/tmp/agent_hive/agents/plan_reviewer_prompt.py b/src/tmp/agent_hive/agents/plan_reviewer_prompt.py
deleted file mode 100644
index 3cb61c2c..00000000
--- a/src/tmp/agent_hive/agents/plan_reviewer_prompt.py
+++ /dev/null
@@ -1,65 +0,0 @@
-review_plan_system_prompt_template = """You are a critical reviewer tasked with evaluating the effectiveness and accuracy of a plan. Your goal is to determine whether the plan is valid or not given the context of the input question and agent expertise. A valid plan should:
-
-1. **Ensure all necessary actions are addressed:**
- The plan must cover all required steps to successfully complete the task as specified in the question. Ensure that each action directly contributes to the task goal.
-2. **Include appropriate dependencies between steps:**
- Actions should be logically ordered with clear dependencies. Each step must rely on the completion of the previous step to ensure a coherent and efficient workflow.
-3. **Ensure no crucial steps are missed:**
- The plan must not overlook any essential actions required to solve the task. If any crucial steps are absent, the plan must be flagged as incomplete.
-4. **Confirm all actions align with agent capabilities:**
- Each step in the plan must fall within the designated expertise of the agents involved. No action should require expertise or knowledge outside of the agent's specified capabilities. Any plan that violate this condition is an invalid plan.
-5. **Strictly follow the task's question:**
- Carefully compare the provided question with the task. The plan should only include actions that directly relate to the question's explicit requirements, without introducing any unnecessary tasks or assumptions.
-6. **Avoid Abstract task/step:**
- Ensure steps/tasks are grounded with respect to the data generated by previous steps or the question.
-
-### Evaluation Criteria:
-1. **Completeness:**
- - Verify that the system prompt leads to a plan that includes all necessary steps to accomplish the task.
- - Ensure the description of each step contains all the relevant information needed to execute the step, including any required parameters or inputs that are mentioned in the task's question.
-
-2. **Relevance:**
- - Confirm that each step in the plan directly contributes to solving the task.
- - Eliminate any steps that do not serve a clear purpose in achieving the goal.
-
-3. **Correctness:**
- - Ensure that all steps are logically consistent and ordered correctly.
- - Ensure that the dependencies between the steps are valid and follow a correct sequence.
-
-4. **Expertise Alignment:**
- - Confirm that the steps in the plan are within the capabilities of the agent.
- - Validate that the agents used in each steps are among the available agents mentioned in the agents' expertise.
-
-5. **Efficiency:**
- - Make sure the plan doesn't introduce redundant actions.
- - Avoid unnecessary complexity in the plan.
-
-6. **Clarity:**
- - Ensure that the plan is easy to understand and logically structured.
-
----
-
-**Question:**
-{question}
-
-**Agents' Expertise:**
-{agent_expertise}
-
-**Plan:**
-{plan}
-
----
-
-### Output Format:
-Your review must always be in JSON format. Do not include any additional formatting or Markdown in your response.
-
-```json
-{{
- "status": "Valid | Invalid | Other",
- "reasoning": "A concise explanation for your evaluation. If a specific step is wrong, point it out directly.",
- "suggestions": "Actions or improvements for rectifying the plan if applicable."
-}}
-```
-
-Output:
-"""
diff --git a/src/tmp/agent_hive/agents/react_agent.py b/src/tmp/agent_hive/agents/react_agent.py
deleted file mode 100644
index 34e77cea..00000000
--- a/src/tmp/agent_hive/agents/react_agent.py
+++ /dev/null
@@ -1,51 +0,0 @@
-import re
-from typing import List, Optional
-
-from langchain.tools import BaseTool
-from reactxen.agents.react.agents import ReactAgent as ReactXenReactAgent
-from reactxen.agents.react.prompts.fewshots import MPE_SIMPLE4
-
-from agent_hive.agents.base_agent import BaseAgent
-
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-
-class ReactAgent(BaseAgent):
- """
- This class represents a React agent that can execute a task based on user input.
- It uses a list of tools to execute the task.
- React agent can use multiple tools to execute the task.
- """
- few_shots: Optional[str] = None
- task_examples: Optional[List[str]] = None
-
- def __init__(self, name: str, description: str, tools: list[BaseTool], llm: str, few_shots: str = MPE_SIMPLE4,
- task_examples: Optional[List[str]] = None):
- self.name = name
- self.description = description
- self.tools = tools
- self.llm = llm
- self.memory = []
- self.few_shots = few_shots
- if task_examples:
- self.task_examples = task_examples
- else:
- self.task_examples = re.findall(r"^Question:(.*)$", self.few_shots, re.MULTILINE)
- self.task_examples = [ex.strip() for ex in self.task_examples]
-
- def execute_task(self, user_input):
- logger.info(f'ReactAgent is executing task: {user_input}, with Tools {self.tools}')
- self.agent_executor = ReactXenReactAgent(
- debug=False,
- question=user_input,
- key="",
- cbm_tools=self.tools,
- max_steps=15,
- react_llm_model_id=self.llm,
- react_example=self.few_shots,
- )
- self.agent_executor.run()
- return self.agent_executor.answer
diff --git a/src/tmp/agent_hive/agents/react_reflect_agent.py b/src/tmp/agent_hive/agents/react_reflect_agent.py
deleted file mode 100644
index d4fb9073..00000000
--- a/src/tmp/agent_hive/agents/react_reflect_agent.py
+++ /dev/null
@@ -1,53 +0,0 @@
-import re
-from typing import List, Optional
-
-from langchain.tools import BaseTool
-from reactxen.agents.react.agents import ReactReflectAgent as ReactReflectXenAgent
-from reactxen.agents.react.prompts.fewshots import MPE_SIMPLE4
-
-from agent_hive.agents.base_agent import BaseAgent
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-class ReactReflectAgent(BaseAgent):
- """
- This class represents a ReactReflect agent that can execute a task based on user input.
- It uses a list of tools to execute the task.
- React agent can use multiple tools to execute the task.
- """
- few_shots: Optional[str] = None
- task_examples: Optional[List[str]] = None
-
- def __init__(self, name: str, description: str, tools: list[BaseTool], llm: str, few_shots: str = MPE_SIMPLE4,
- task_examples: Optional[List[str]] = None):
- self.name = name
- self.description = description
- self.tools = tools
- self.llm = llm
- self.memory = []
- self.few_shots = few_shots
- if task_examples:
- self.task_examples = task_examples
- else:
- self.task_examples = re.findall(r"^Question:(.*)$", self.few_shots, re.MULTILINE)
- self.task_examples = [ex.strip() for ex in self.task_examples]
-
- def execute_task(self, user_input):
- logger.info(f'ReactReflectAgent is executing task: {user_input}, with Tools {self.tools}')
- self.agent_executor = ReactReflectXenAgent(
- question=user_input,
- key="",
- cbm_tools=self.tools,
- max_steps=6,
- react_llm_model_id=self.llm,
- reflect_llm_model_id=self.llm,
- react_example=self.few_shots,
- num_reflect_iteration=5,
- handle_context_length_overflow=True,
- apply_loop_detection_check=True,
- log_structured_messages=True,
- early_stop=True,
- )
- self.agent_executor.run()
- return self.agent_executor.answer
diff --git a/src/tmp/agent_hive/agents/simple_agent.py b/src/tmp/agent_hive/agents/simple_agent.py
deleted file mode 100644
index bdc940bf..00000000
--- a/src/tmp/agent_hive/agents/simple_agent.py
+++ /dev/null
@@ -1,50 +0,0 @@
-from reactxen.utils.model_inference import watsonx_llm
-from agent_hive.utils import json_parser
-from langchain.tools import BaseTool
-from agent_hive.agents.base_agent import BaseAgent
-from agent_hive.logger import logger
-
-
-class SimpleAgent(BaseAgent):
- """
- This class represents a simple agent that can execute a task based on user input.
- It uses a list of tools to execute the task.
- Only one tool is executed at a time.
- """
- def __init__(self, name: str, description: str, tools: list[BaseTool], llm: str):
- self.name = name
- self.description = description
- self.tools = tools
- self.llm = llm
- self.memory = []
-
- def execute_task(self, user_input):
- tool_descriptions = "\n".join([f"- {tool.name}: {tool.description}" for tool in self.tools])
- response_format = {"action": "", "args": ""}
-
- prompt = f"""Task:
- {user_input}
-
- Available tools:
- {tool_descriptions}
-
- Based on the user's input and context, decide if you should use a tool or respond directly.
- If you identify an action, respond with the tool name and the arguments for the tool.
- If you decide to respond directly to the user then make the action "respond_to_user" with args as your response in the following format.
-
- Response Format:
- {response_format}
-
- Response:
- """
- logger.info(f"Prompt: {prompt}")
- response = watsonx_llm(prompt, model_id=self.llm, )['generated_text']
- logger.info(f"Agent Response: {response}")
-
- response_dict = json_parser(response)
-
- for tool in self.tools:
- if tool.name.lower() == response_dict["action"].lower():
- return tool.run(response_dict["args"])
-
- return response_dict
diff --git a/src/tmp/agent_hive/agents/summarization_agent.py b/src/tmp/agent_hive/agents/summarization_agent.py
deleted file mode 100644
index 20cf2d24..00000000
--- a/src/tmp/agent_hive/agents/summarization_agent.py
+++ /dev/null
@@ -1,57 +0,0 @@
-from reactxen.utils.model_inference import watsonx_llm
-from agent_hive.agents.plan_reviewer_prompt import review_plan_system_prompt_template
-import json
-import re
-from agent_hive.agents.base_agent import BaseAgent
-from typing import List, Dict
-
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-class SummarizationAgent(BaseAgent):
- """
- This class is responsible for summarizing the given text and answer the question.
- """
-
- name = "SummarizationAgent"
- description = "This agent summarizes the given text and answers the question."
- memory = []
- tools = []
-
- def __init__(self, llm="mistralai/mistral-large", max_retries=3):
- self.llm = llm
- self.max_retries = max_retries
-
- def execute_task(self, user_input):
- """
- This function execute the task by summarizing the question and provided context and then generating an answer.
-
- Args:
- user_input (str): The input question to be summarized and answered.
-
- """
-
- summarization_prompt = f"""Given the following context, extract the most relevant answer to the question provided. If the answer is not explicitly stated, infer it based on the context without adding external information. Output only the answer and do not include any additional text. If the question cannot be answered, respond with "Not enough information available."
-
-Question:
-{user_input}
-
-Answer:
-"""
- logger.info(f"Summarization Prompt: {summarization_prompt}")
- for it_index in range(self.max_retries):
- result = watsonx_llm(
- summarization_prompt,
- model_id=self.llm,
- temperature=it_index / 10,
- stop=["\n(END OF RESPONSE)"],
- )["generated_text"]
- if "not enough information available" not in result.lower():
- return result.strip()
- logger.warning(
- f"Retrying due to insufficient information. Attempt {it_index}/{self.max_retries}"
- )
-
- return ""
diff --git a/src/tmp/agent_hive/agents/wo_agent.py b/src/tmp/agent_hive/agents/wo_agent.py
deleted file mode 100644
index fde5696f..00000000
--- a/src/tmp/agent_hive/agents/wo_agent.py
+++ /dev/null
@@ -1,65 +0,0 @@
-import re
-from typing import List, Optional
-
-from langchain.tools import BaseTool
-from reactxen.agents.react.agents import ReactReflectAgent as ReactReflectXenAgent
-from reactxen.agents.react.prompts.fewshots import MPE_SIMPLE4
-
-from agent_hive.agents.base_agent import BaseAgent
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-class WorderOrderAgent(BaseAgent):
- """
- Work Order agent that can retrieve, analyze, and generate work orders for equipment.
-
- Uses ReAct+Reflection to reason over work order history, anomalies, and alerts,
- and produces recommendations for preventive/corrective maintenance actions.
- """
-
- few_shots: Optional[str] = None
- task_examples: Optional[List[str]] = None
-
- def __init__(
- self,
- name: str,
- description: str,
- tools: list[BaseTool],
- llm: str,
- few_shots: str = MPE_SIMPLE4,
- task_examples: Optional[List[str]] = None,
- reflect_step: int = 1,
- ):
- self.name = name
- self.description = description
- self.tools = tools
- self.llm = llm
- self.memory = []
- self.few_shots = few_shots
- self.reflect_step = reflect_step
- if task_examples:
- self.task_examples = task_examples
- else:
- self.task_examples = re.findall(r"^Question:(.*)$", self.few_shots, re.MULTILINE)
- self.task_examples = [ex.strip() for ex in self.task_examples]
-
- def execute_task(self, user_input: str) -> str:
- logger.info(f"WorderOrderAgent executing task: {user_input}, tools: {self.tools}")
- self.agent_executor = ReactReflectXenAgent(
- question=user_input,
- key="",
- cbm_tools=self.tools,
- max_steps=6,
- react_llm_model_id=self.llm,
- reflect_llm_model_id=self.llm,
- react_example=self.few_shots,
- num_reflect_iteration=self.reflect_step,
- handle_context_length_overflow=True,
- apply_loop_detection_check=True,
- log_structured_messages=True,
- early_stop=True,
- )
- self.agent_executor.run()
- return self.agent_executor.answer
diff --git a/src/tmp/agent_hive/enum.py b/src/tmp/agent_hive/enum.py
deleted file mode 100644
index 4f5a3aae..00000000
--- a/src/tmp/agent_hive/enum.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from enum import Enum
-
-
-class Color(Enum):
- red = 1
- green = 2
- blue = 3
-
-
-class ContextType(Enum):
- DISABLED = 1
- ALL = 2
- SELECTED = 3
- PREVIOUS = 4
-
diff --git a/src/tmp/agent_hive/logger.py b/src/tmp/agent_hive/logger.py
deleted file mode 100644
index 5bc10ad1..00000000
--- a/src/tmp/agent_hive/logger.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import logging
-
-# Function to get the custom logger
-def get_custom_logger(name):
- # Create a logger object
- logger = logging.getLogger(name)
- logger.setLevel(logging.DEBUG) # Set the minimum log level
-
- allowed_logger = [__name__, 'reactxen', 'cbm_gen', 'agent_hive']
- for log_name, log_obj in logging.Logger.manager.loggerDict.items():
- if log_name not in allowed_logger:
- logging.getLogger(log_name).setLevel(logging.INFO)
-
- logging.getLogger("ibm_watsonx_ai").disabled = True
- logging.getLogger("httpx").disabled = True
-
- # Check if handlers are already added to avoid duplicate logs
- if not logger.handlers:
- # Create console handler and set level to debug
- ch = logging.StreamHandler()
- ch.setLevel(logging.DEBUG)
-
- # Create formatter
- formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-
- # Add formatter to ch
- ch.setFormatter(formatter)
-
- # Add ch to logger
- logger.addHandler(ch)
-
- return logger
\ No newline at end of file
diff --git a/src/tmp/agent_hive/task.py b/src/tmp/agent_hive/task.py
deleted file mode 100644
index 4ce84d33..00000000
--- a/src/tmp/agent_hive/task.py
+++ /dev/null
@@ -1,26 +0,0 @@
-from typing import List, Optional
-
-from pydantic import Field
-
-from agent_hive.agents.base_agent import BaseAgent
-
-
-class Task:
- description: str = Field(description="Description of the actual task.")
- agents: List[BaseAgent] = Field(description="Agents responsible for execution the task.")
- expected_output: Optional[str] = Field(default=None,
- description="Clear definition of expected output for the task.")
- context: Optional[List["Task"]] = Field(
- description="Other tasks that will have their output used as context for this task.",
- default=None,
- )
-
- def __init__(self, description: str, agents: List[BaseAgent], expected_output: Optional[str] = None,
- context: Optional[List['Task']] = None):
- self.description = description
- self.agents = agents
- self.expected_output = expected_output
- self.context = context
-
- def __str__(self):
- return f"Task(description={self.description}, agents={self.agents}, expected_output={self.expected_output}, context={self.context})"
diff --git a/src/tmp/agent_hive/tests/evaluate.py b/src/tmp/agent_hive/tests/evaluate.py
deleted file mode 100644
index f60949d4..00000000
--- a/src/tmp/agent_hive/tests/evaluate.py
+++ /dev/null
@@ -1,105 +0,0 @@
-from dotenv import load_dotenv
-
-load_dotenv()
-
-from reactxen.utils.model_inference import azure_openai_llm, watsonx_llm
-import os
-import json
-from reactxen.agents.evaluation_agent.agent import EvaluationAgent
-import re
-
-
-ret = watsonx_llm(
- prompt="what is the capital of usa",
- model_id="openai-azure/gpt-4.1-2025-04-14",
-)
-
-print(ret)
-
-# id range
-fmsr_range = [101, 120]
-iot_range = [1, 48]
-tsfm_range = [201, 223]
-wo_range = [400, 435]
-multi_range = []
-utterances = {}
-
-
-def load_utterances():
- utterance_files = [
- "./scenarios/single_agent/iot_utterance_meta.json",
- "./scenarios/single_agent/fmsr_utterance.json",
- "./scenarios/single_agent/tsfm_utterance.json",
- "./scenarios/single_agent/wo_utterance.json",
- "./scenarios/multi_agent/end2end_utterance.json",
- ]
- for filepath in utterance_files:
- with open(filepath, "r") as f:
- try:
- data = json.load(f)
- for d in data:
- utterances[d["id"]] = d
- except json.JSONDecodeError as e:
- print(f"Error decoding {filename}: {e}")
- return utterances
-
-
-def llm_eval(data, characteristic):
- eval_agent = EvaluationAgent(model_id="meta-llama/llama-4-maverick-17b-128e-instruct-fp8")
- assert characteristic["text"] == data['text']
- print(characteristic["text"])
- print(data['text'])
- print('='*10)
- agent_think = 'The agent executes the following steps: '
- for item in data['trajectory']:
- agent_think += f"{item['task_number']}. task: {item['task_description']}; agent: {item['agent_name']}; response: {item['response']}. "
- final_answer = data['trajectory'][-1]['response']
- if final_answer.strip() == "":
- final_answer = data['trajectory'][-2]['response']
- try:
- review_resultFull = eval_agent.evaluate_response(
- question=characteristic["text"],
- agent_think=agent_think,
- agent_response=final_answer,
- characteristic_answer=characteristic["characteristic_form"],
- )
- print(review_resultFull)
-
- return review_resultFull
- except BaseException as e:
- print(f"EXCEPTION: {e}")
-
- return None
-
-
-utterances = load_utterances()
-print(utterances)
-
-# Directory path
-directory = (
- "./src/agent_hive/tests/multi_agent/trajectory/ReactReflectAgent"
-)
-
-# Store all loaded data
-all_logs = {}
-
-# Loop through all files in the directory
-for filename in os.listdir(directory):
- if filename.endswith(".json"):
- filepath = os.path.join(directory, filename)
- with open(filepath, "r") as f:
- try:
- data = json.load(f)
- all_logs[filename] = data
- # eval
- if 'mvk_evaluation' not in data:
- evaluation_res = llm_eval(data, utterances[int(re.findall(r'\d+', filename)[1])])
- if evaluation_res:
- data["mvk_evaluation"] = evaluation_res
-
- with open(filepath, "w") as f:
- json.dump(data, f, indent=4)
- except json.JSONDecodeError as e:
- print(f"Error decoding {filename}: {e}")
-
-print(f"Loaded {len(all_logs)} JSON files.")
diff --git a/src/tmp/agent_hive/tests/multi_agent/run.py b/src/tmp/agent_hive/tests/multi_agent/run.py
deleted file mode 100644
index 44513df1..00000000
--- a/src/tmp/agent_hive/tests/multi_agent/run.py
+++ /dev/null
@@ -1,153 +0,0 @@
-import argparse
-import json
-import os
-import time
-from tenacity import retry, stop_after_attempt, wait_fixed
-
-from dotenv import load_dotenv
-
-load_dotenv()
-
-from agent_hive.task import Task
-
-from agent_hive.tools.skyspark import (
- iot_bms_tools,
- iot_bms_fewshots,
- iot_agent_description,
- iot_agent_name,
-)
-from agent_hive.tools.fmsr import (
- fmsr_tools,
- fmsr_fewshots,
- fmsr_task_examples,
- fmsr_agent_name,
- fmsr_agent_description,
-)
-from agent_hive.tools.tsfm import (
- tsfm_tools,
- tsfm_fewshots,
- tsfm_agent_name,
- tsfm_agent_description,
-)
-from agent_hive.tools.wo import (
- wo_agent_description,
- wo_agent_name,
- wo_fewshots,
- wo_tools,
-)
-from agent_hive.workflows.planning_review import PlanningReviewWorkflow
-
-from agent_hive.workflows.sequential import SequentialWorkflow
-from agent_hive.agents.react_reflect_agent import ReactReflectAgent
-from agent_hive.agents.wo_agent import WorderOrderAgent
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-import warnings
-
-warnings.filterwarnings("ignore")
-
-PLAN_PREFIX = os.path.dirname(os.path.abspath(__file__)) + "/plan/"
-
-
-@retry(stop=stop_after_attempt(7), wait=wait_fixed(2))
-def run_planning_workflow(question, llm_model, qid):
- iot_rr_agent = ReactReflectAgent(
- name=iot_agent_name,
- description=iot_agent_description,
- tools=iot_bms_tools,
- llm=llm_model,
- few_shots=iot_bms_fewshots,
- reflect_step=1
- )
-
- fmsr_rr_agent = ReactReflectAgent(
- name=fmsr_agent_name,
- description=fmsr_agent_description,
- tools=fmsr_tools,
- llm=llm_model,
- task_examples=fmsr_task_examples,
- few_shots=fmsr_fewshots,
- reflect_step=1
- )
-
- tsfm_rr_agent = ReactReflectAgent(
- name=tsfm_agent_name,
- description=tsfm_agent_description,
- tools=tsfm_tools,
- llm=llm_model,
- few_shots=tsfm_fewshots,
- reflect_step=1
- )
-
- wo_rr_agent = WorderOrderAgent(
- name=wo_agent_name,
- description=wo_agent_description,
- tools=wo_tools,
- llm=llm_model,
- few_shots=wo_fewshots,
- reflect_step=1
- )
-
- task_1 = Task(
- description=question,
- expected_output="",
- agents=[iot_rr_agent, fmsr_rr_agent, tsfm_rr_agent, wo_rr_agent],
- )
-
- wf = PlanningReviewWorkflow(
- tasks=[task_1],
- llm=llm_model
- )
-
- return wf.run(enable_summarization=True)
-
-
-
-def run_react_reflect(utterance_file, react_llm_model_id, reverse=False):
- with open(utterance_file, "r") as json_file:
- data = json.load(json_file)
-
- if reverse:
- data = data[::-1]
-
- for utterance in data:
- print(
- f"ID: {utterance['id']}, Text: {utterance['text']}, model: {react_llm_model_id}, ReactReflectAgent..."
- )
- trajectory_file = f"trajectory/ReactReflectAgent/Model_{react_llm_model_id}_Q_{utterance['id']}_trajectory_output.json"
-
- if os.path.exists(trajectory_file):
- print(f"Skipping {utterance['id']}")
- continue
-
- start_time = time.time()
- ans = run_planning_workflow(
- utterance["text"],
- react_llm_model_id,
- utterance["id"],
- )
-
- end_time = time.time()
- runtime = end_time - start_time
-
- output = {
- "id": utterance["id"],
- "text": utterance["text"],
- "runtime": runtime,
- "trajectory": ans,
- }
-
- with open(trajectory_file, "w") as file:
- json.dump(output, file, indent=4)
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--llm", type=int, default=15)
- parser.add_argument('-f', type=str)
- parser.add_argument("--reverse", type=bool, default=False)
-
- args = parser.parse_args()
- run_react_reflect(args.f, args.llm, reverse=args.reverse)
diff --git a/src/tmp/agent_hive/tests/multi_agent/runall.sh b/src/tmp/agent_hive/tests/multi_agent/runall.sh
deleted file mode 100644
index aff3939c..00000000
--- a/src/tmp/agent_hive/tests/multi_agent/runall.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-python run.py --llm 6 -f ./scenarios/multi_agent/end2end_utterance.json
-python run.py --llm 7 -f ./scenarios/multi_agent/end2end_utterance.json
-python run.py --llm 12 -f ./scenarios/multi_agent/end2end_utterance.json
-python run.py --llm 16 -f ./scenarios/multi_agent/end2end_utterance.json
-python run.py --llm 17 -f ./scenarios/multi_agent/end2end_utterance.json
-python run.py --llm 19 -f ./scenarios/multi_agent/end2end_utterance.json
\ No newline at end of file
diff --git a/src/tmp/agent_hive/tests/single_agent/run.py b/src/tmp/agent_hive/tests/single_agent/run.py
deleted file mode 100644
index 44513df1..00000000
--- a/src/tmp/agent_hive/tests/single_agent/run.py
+++ /dev/null
@@ -1,153 +0,0 @@
-import argparse
-import json
-import os
-import time
-from tenacity import retry, stop_after_attempt, wait_fixed
-
-from dotenv import load_dotenv
-
-load_dotenv()
-
-from agent_hive.task import Task
-
-from agent_hive.tools.skyspark import (
- iot_bms_tools,
- iot_bms_fewshots,
- iot_agent_description,
- iot_agent_name,
-)
-from agent_hive.tools.fmsr import (
- fmsr_tools,
- fmsr_fewshots,
- fmsr_task_examples,
- fmsr_agent_name,
- fmsr_agent_description,
-)
-from agent_hive.tools.tsfm import (
- tsfm_tools,
- tsfm_fewshots,
- tsfm_agent_name,
- tsfm_agent_description,
-)
-from agent_hive.tools.wo import (
- wo_agent_description,
- wo_agent_name,
- wo_fewshots,
- wo_tools,
-)
-from agent_hive.workflows.planning_review import PlanningReviewWorkflow
-
-from agent_hive.workflows.sequential import SequentialWorkflow
-from agent_hive.agents.react_reflect_agent import ReactReflectAgent
-from agent_hive.agents.wo_agent import WorderOrderAgent
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-import warnings
-
-warnings.filterwarnings("ignore")
-
-PLAN_PREFIX = os.path.dirname(os.path.abspath(__file__)) + "/plan/"
-
-
-@retry(stop=stop_after_attempt(7), wait=wait_fixed(2))
-def run_planning_workflow(question, llm_model, qid):
- iot_rr_agent = ReactReflectAgent(
- name=iot_agent_name,
- description=iot_agent_description,
- tools=iot_bms_tools,
- llm=llm_model,
- few_shots=iot_bms_fewshots,
- reflect_step=1
- )
-
- fmsr_rr_agent = ReactReflectAgent(
- name=fmsr_agent_name,
- description=fmsr_agent_description,
- tools=fmsr_tools,
- llm=llm_model,
- task_examples=fmsr_task_examples,
- few_shots=fmsr_fewshots,
- reflect_step=1
- )
-
- tsfm_rr_agent = ReactReflectAgent(
- name=tsfm_agent_name,
- description=tsfm_agent_description,
- tools=tsfm_tools,
- llm=llm_model,
- few_shots=tsfm_fewshots,
- reflect_step=1
- )
-
- wo_rr_agent = WorderOrderAgent(
- name=wo_agent_name,
- description=wo_agent_description,
- tools=wo_tools,
- llm=llm_model,
- few_shots=wo_fewshots,
- reflect_step=1
- )
-
- task_1 = Task(
- description=question,
- expected_output="",
- agents=[iot_rr_agent, fmsr_rr_agent, tsfm_rr_agent, wo_rr_agent],
- )
-
- wf = PlanningReviewWorkflow(
- tasks=[task_1],
- llm=llm_model
- )
-
- return wf.run(enable_summarization=True)
-
-
-
-def run_react_reflect(utterance_file, react_llm_model_id, reverse=False):
- with open(utterance_file, "r") as json_file:
- data = json.load(json_file)
-
- if reverse:
- data = data[::-1]
-
- for utterance in data:
- print(
- f"ID: {utterance['id']}, Text: {utterance['text']}, model: {react_llm_model_id}, ReactReflectAgent..."
- )
- trajectory_file = f"trajectory/ReactReflectAgent/Model_{react_llm_model_id}_Q_{utterance['id']}_trajectory_output.json"
-
- if os.path.exists(trajectory_file):
- print(f"Skipping {utterance['id']}")
- continue
-
- start_time = time.time()
- ans = run_planning_workflow(
- utterance["text"],
- react_llm_model_id,
- utterance["id"],
- )
-
- end_time = time.time()
- runtime = end_time - start_time
-
- output = {
- "id": utterance["id"],
- "text": utterance["text"],
- "runtime": runtime,
- "trajectory": ans,
- }
-
- with open(trajectory_file, "w") as file:
- json.dump(output, file, indent=4)
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--llm", type=int, default=15)
- parser.add_argument('-f', type=str)
- parser.add_argument("--reverse", type=bool, default=False)
-
- args = parser.parse_args()
- run_react_reflect(args.f, args.llm, reverse=args.reverse)
diff --git a/src/tmp/agent_hive/tests/single_agent/runall.sh b/src/tmp/agent_hive/tests/single_agent/runall.sh
deleted file mode 100644
index 1f90ab0e..00000000
--- a/src/tmp/agent_hive/tests/single_agent/runall.sh
+++ /dev/null
@@ -1,32 +0,0 @@
-python run.py --llm 6 -f ./scenarios/single_agent/iot_utterance_meta.json
-python run.py --llm 7 -f ./scenarios/single_agent/iot_utterance_meta.json
-python run.py --llm 12 -f ./scenarios/single_agent/iot_utterance_meta.json
-python run.py --llm 16 -f ./scenarios/single_agent/iot_utterance_meta.json
-python run.py --llm 17 -f ./scenarios/single_agent/iot_utterance_meta.json
-python run.py --llm 19 -f ./scenarios/single_agent/iot_utterance_meta.json
-
-python run.py --llm 6 -f ./scenarios/single_agent/fmsr_utterance.json
-python run.py --llm 7 -f ./scenarios/single_agent/fmsr_utterance.json
-python run.py --llm 12 -f ./scenarios/single_agent/fmsr_utterance.json
-python run.py --llm 16 -f ./scenarios/single_agent/fmsr_utterance.json
-python run.py --llm 17 -f ./scenarios/single_agent/fmsr_utterance.json
-python run.py --llm 19 -f ./scenarios/single_agent/fmsr_utterance.json
-
-python run.py --llm 6 -f ./scenarios/single_agent/tsfm_utterance.json
-python run.py --llm 7 -f ./scenarios/single_agent/tsfm_utterance.json
-python run.py --llm 12 -f ./scenarios/single_agent/tsfm_utterance.json
-python run.py --llm 16 -f ./scenarios/single_agent/tsfm_utterance.json
-python run.py --llm 17 -f ./scenarios/single_agent/tsfm_utterance.json
-python run.py --llm 19 -f ./scenarios/single_agent/tsfm_utterance.json
-
-python run.py --llm 6 -f ./scenarios/single_agent/wo_utterance.json
-python run.py --llm 7 -f ./scenarios/single_agent/wo_utterance.json
-python run.py --llm 12 -f ./scenarios/single_agent/wo_utterance.json
-python run.py --llm 16 -f ./scenarios/single_agent/wo_utterance.json
-python run.py --llm 17 -f ./scenarios/single_agent/wo_utterance.json
-python run.py --llm 19 -f ./scenarios/single_agent/wo_utterance.json
-
-python run.py --llm 18 -f ./scenarios/single_agent/iot_utterance_meta.json
-python run.py --llm 18 -f ./scenarios/single_agent/fmsr_utterance.json
-python run.py --llm 18 -f ./scenarios/single_agent/tsfm_utterance.json
-python run.py --llm 18 -f ./scenarios/single_agent/wo_utterance.json
diff --git a/src/tmp/agent_hive/tools/__init__.py b/src/tmp/agent_hive/tools/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/src/tmp/agent_hive/tools/fmsr.py b/src/tmp/agent_hive/tools/fmsr.py
deleted file mode 100644
index da599785..00000000
--- a/src/tmp/agent_hive/tools/fmsr.py
+++ /dev/null
@@ -1,27 +0,0 @@
-from reactxen.tools.jsonreader.jsonreader import JSONReader
-from reactxen.tools.jsonreader.jsonwrapper import JSONWrapperFunctions
-
-from fmsr_agent.agent.react import get_fmsr_task_examples, get_fmsr_fewshots
-from fmsr_agent.tools import (
- GetFailureModeTool,
- GetFailureModeAndSensorMappingTool,
- GetSensorMetadataTool,
-)
-
-fns = JSONWrapperFunctions()
-jsonReader = JSONReader(functions=fns)
-
-# fmsr_tools = [GetFailureModeTool(), GetFailureModeAndSensorMappingTool(), GetSensorMetadataTool(), jsonReader]
-
-fmsr_agent_name = "Failure Mode and Sensor Relevancy Expert for Industrial Asset"
-fmsr_agent_description = (
- "Can provide information about failure modes, mapping between failure modes and sensors, "
- "and can generate machine learning recipes for specific failures"
-)
-fmsr_tools = [
- GetFailureModeTool(),
- GetFailureModeAndSensorMappingTool(),
- GetSensorMetadataTool(),
-]
-fmsr_fewshots = get_fmsr_fewshots()
-fmsr_task_examples = get_fmsr_task_examples()
diff --git a/src/tmp/agent_hive/tools/skyspark.py b/src/tmp/agent_hive/tools/skyspark.py
deleted file mode 100644
index 15f78292..00000000
--- a/src/tmp/agent_hive/tools/skyspark.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from iotagent.demo.skysparkfewshots import SKYSPARK1
-from iotagent.demo.run_reactreflect import getTools
-
-iot_agent_name = 'IoT Data Download'
-iot_agent_description = ('Can provide information about IoT sites, asset details, sensor data, and retrieve historical '
- 'data and metadata for various assets and equipment')
-iot_tools = getTools()[0]
-iot_fewshots = SKYSPARK1
diff --git a/src/tmp/agent_hive/tools/time.py b/src/tmp/agent_hive/tools/time.py
deleted file mode 100644
index 26b87db9..00000000
--- a/src/tmp/agent_hive/tools/time.py
+++ /dev/null
@@ -1,19 +0,0 @@
-from langchain_core.tools import BaseTool
-from datetime import datetime
-from typing import Any
-
-
-class TimeTool(BaseTool):
- name: str = "time"
- description: str = "Get the current time"
-
- def _run(self, *args: Any) -> str:
- return "The current time is " + str(datetime.now())
-
-
-def get_time_tools():
- return [TimeTool()]
-
-
-def get_time_agent_name():
- return "Time Agent"
diff --git a/src/tmp/agent_hive/tools/tsfm.py b/src/tmp/agent_hive/tools/tsfm.py
deleted file mode 100644
index 90052294..00000000
--- a/src/tmp/agent_hive/tools/tsfm.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from tsfmagent.agents.tsfmagent.tsfmfewshots import TSFM_FEW_SHOTS
-from tsfmagent.agents.tsfmagent.tsfm_agent import getTSFMTools
-
-tsfm_agent_name = 'Time Series Analytics and Forecasting'
-tsfm_agent_description = ('Can assist with time series analysis, forecasting, anomaly '
- 'detection, and model selection, and supports pretrained models, context length '
- 'specifications, and regression tasks for various time series data')
-tsfm_tools = getTSFMTools()
-tsfm_fewshots = TSFM_FEW_SHOTS
diff --git a/src/tmp/agent_hive/tools/wo.py b/src/tmp/agent_hive/tools/wo.py
deleted file mode 100644
index 1befea5b..00000000
--- a/src/tmp/agent_hive/tools/wo.py
+++ /dev/null
@@ -1,11 +0,0 @@
-from woagent.agents.woagent.wofewshots import WO_FEW_SHOTS
-from woagent.agents.woagent.wo_agent import getWOTools
-
-wo_agent_name = "Work Order Management"
-wo_agent_description = (
- "Can retrieve, analyze, and generate work orders for equipment based on historical data, "
- "anomalies, alerts, and performance metrics, offering recommendations for preventive and "
- "corrective actions, including bundling, prioritization, and predictive maintenance"
-)
-wo_tools = getWOTools()
-wo_fewshots = WO_FEW_SHOTS
diff --git a/src/tmp/agent_hive/utils.py b/src/tmp/agent_hive/utils.py
deleted file mode 100644
index 74887c83..00000000
--- a/src/tmp/agent_hive/utils.py
+++ /dev/null
@@ -1,15 +0,0 @@
-import ast
-import json
-
-
-def json_parser(input_string):
- input_string = input_string.strip()
- input_string = input_string.replace("```json", "").replace("```", "")
- python_dict = ast.literal_eval(input_string)
- json_string = json.dumps(python_dict)
- json_dict = json.loads(json_string)
-
- if isinstance(json_dict, dict) or isinstance(json_dict, list):
- return json_dict
-
- raise "Invalid JSON response"
diff --git a/src/tmp/agent_hive/workflows/__init__.py b/src/tmp/agent_hive/workflows/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/src/tmp/agent_hive/workflows/base_workflow.py b/src/tmp/agent_hive/workflows/base_workflow.py
deleted file mode 100644
index 8b8d7033..00000000
--- a/src/tmp/agent_hive/workflows/base_workflow.py
+++ /dev/null
@@ -1,16 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import List
-from pydantic import Field
-from agent_hive.task import Task
-
-
-class Workflow(ABC):
- """Base class for all workflows."""
-
- tasks: List[Task] = Field(description="List of tasks to execute.")
- memory: List[str] = Field(default=[], description="Memory of the workflow.")
- max_memory: int = Field(default=10, description="Maximum memory size of the workflow.")
-
- @abstractmethod
- def run(self, *args, **kwargs):
- pass
diff --git a/src/tmp/agent_hive/workflows/planning.py b/src/tmp/agent_hive/workflows/planning.py
deleted file mode 100644
index a40e354c..00000000
--- a/src/tmp/agent_hive/workflows/planning.py
+++ /dev/null
@@ -1,193 +0,0 @@
-from agent_hive.task import Task
-from pydantic import Field
-from typing import List
-from agent_hive.enum import ContextType
-import json
-from agent_hive.workflows.base_workflow import Workflow
-from reactxen.utils.model_inference import watsonx_llm
-import re
-
-from agent_hive.workflows.sequential import SequentialWorkflow
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-
-class PlanningWorkflow(Workflow):
- """
- This class represents a planning workflow, where the (parent) task is decomposed into multiple subtasks by
- LLM-based Rewoo planning and then executed sequentially.
-
- Example:
- agent1 = ...
- agent2 = ...
-
- task = Task(..., agents=[agent1, agent2], ...)
-
- workflow = PlanningWorkflow(task=[task1], ...)
- workflow.run()
- """
-
- llm: str = Field(description="LLM used by the task planning.")
-
- def __init__(self, tasks: List[Task], llm: str):
- self.tasks = tasks
- self.memory = []
- self.max_memory = 10
- self.llm = llm
- self._verify_tasks()
-
- def _verify_tasks(self):
- if not isinstance(self.tasks, list):
- raise ValueError("tasks must be a list of Task objects")
- if len(self.tasks) != 1:
- raise ValueError("Planning only supports one task")
- task = self.tasks[0]
- if task.agents is None or len(task.agents) < 1:
- raise ValueError("Task must have at least one agent")
-
- def run(self, enable_summarization=False):
- generated_steps = self.generate_steps()
- if enable_summarization:
- from agent_hive.agents.summarization_agent import SummarizationAgent
-
- summarization_task = Task(
- description=self.tasks[0].description,
- expected_output=self.tasks[0].expected_output,
- agents=[SummarizationAgent(llm=self.llm)],
- context=generated_steps[:],
- )
- generated_steps.append(summarization_task)
-
- sequential_workflow = SequentialWorkflow(
- tasks=generated_steps, context_type=ContextType.SELECTED
- )
-
- return sequential_workflow.run()
-
- def generate_steps(self, save_plan=False, saved_plan_filename=''):
- task = self.tasks[0]
- agent_descriptions = ""
-
- for ii, aagent in enumerate(task.agents):
- agent_descriptions += f"\n({ii + 1}) Agent name: {aagent.name}"
- agent_descriptions += f"\nAgent description: {aagent.description}"
- if 'task_examples' in aagent.__dict__ and aagent.task_examples:
- agent_descriptions += f"\nTasks that agent can solve:"
- for idx, task_example in enumerate(aagent.task_examples, start=1):
- agent_descriptions += f"\n{idx}. {task_example}" # Numbering each task example on a new line
- agent_descriptions += "\n"
-
- def get_prompt():
-
-
- return f"""
-You are an AI assistant who makes step-by-step plan to solve a complicated problem under the help of external agents.
-For each step, make one task followed by one agent-call.
-Each step denoted by #S1, #S2, #S3 ... can be referred to in later steps as a dependency.
-
-Each step must contain Task, Agent, Dependency and ExpectedOutput.
-1. **Task**: A detailed description of what needs to be done in this step. It should include all necessary details and requirements.
-2. **Agent**: The external agent to be used for solving this task. Agent needs to be selected from the available agents.
-3. **Dependency**: A list of previous steps (denoted as `#S1`, `#S2`, etc.) that this step depends on. If no previous steps are required, use `None`.
-4. **ExpectedOutput**: The anticipated result from the agent's execution.
-
-## Output Format (Replace '<...>') ##
-
-## Step 1
-#Task1:
-#Agent1:
-#Dependency1: None
-#ExpectedOutput1:
-
-## Step 2
-#Task2:
-#Agent2:
-#Dependency2: []
-#ExpectedOutput2:
-
-And so on...
-
-Here are the available agents:
-{agent_descriptions}
-
-You are going to solve the following complicated problem:
-{task.description}
-
-Guidelines:
-- Task should be something that can be solved by the agent. Task needs to be clear and unambiguous and contain all the information needed to solve it.
-- A plan usually contains less than 5 steps.
-- Only output the generated plan, do not output any other text.
-
-Output (your generated plan):
-"""
-
- prompt = get_prompt()
- logger.info(f"Planning Prompt: \n{prompt}")
- llm_response = watsonx_llm(prompt, model_id=self.llm)['generated_text']
- logger.info(f"Plan: \n{llm_response}")
-
- self.memory = []
-
- task_pattern = r"#Task\d+: (.+)"
- agent_pattern = r"#Agent\d+: (.+)"
- dependency_pattern = r"#Dependency\d+: (.+)"
- output_pattern = r"#ExpectedOutput\d+: (.+)"
-
- tasks = re.findall(task_pattern, llm_response)
- agents = re.findall(agent_pattern, llm_response)
- dependencies = re.findall(dependency_pattern, llm_response)
- outputs = re.findall(output_pattern, llm_response)
-
- if save_plan:
- if not saved_plan_filename.endswith('.txt'):
- saved_plan_filename += '.txt'
-
- saved_plan_text = f'Question: {task.description}\nPlan:\n{llm_response}'
- with open(saved_plan_filename, 'w') as f:
- f.write(saved_plan_text)
-
- planned_tasks = []
- for i in range(len(tasks)):
- task_description = tasks[i]
- if i == len(agents):
- break
- agent_name = agents[i]
- if i < len(dependencies):
- dependency = dependencies[i]
- else:
- dependency = "None"
- if i < len(outputs):
- expected_output = outputs[i]
- else:
- expected_output = ""
-
- # identify the agent
- selected_agent = None
- for agent in task.agents:
- if agent.name == agent_name:
- selected_agent = agent
- break
-
- if selected_agent is None:
- # raise ValueError(f"Agent {agent_name} not found in the task.agents")
- selected_agent = task.agents[0]
-
- # identify the dependency
- if dependency != "None":
- numbers = re.findall(r"#S(\d+)", dependency)
- numbers = list(map(int, numbers))
- context = [planned_tasks[i - 1] for i in numbers]
- else:
- context = []
-
- a_task = Task(
- description=task_description,
- expected_output=expected_output,
- agents=[selected_agent],
- context=context,
- )
- planned_tasks.append(a_task)
-
- return planned_tasks
diff --git a/src/tmp/agent_hive/workflows/planning_review.py b/src/tmp/agent_hive/workflows/planning_review.py
deleted file mode 100644
index efe1de35..00000000
--- a/src/tmp/agent_hive/workflows/planning_review.py
+++ /dev/null
@@ -1,242 +0,0 @@
-from agent_hive.task import Task
-from pydantic import Field
-from typing import List
-from agent_hive.enum import ContextType
-import json
-from agent_hive.workflows.base_workflow import Workflow
-from reactxen.utils.model_inference import watsonx_llm
-import re
-from agent_hive.workflows.sequential import SequentialWorkflow
-from agent_hive.agents.plan_reviewer_agent import PlanReviewerAgent
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-class PlanningReviewWorkflow(Workflow):
- """
- This class represents a planning review workflow, where the (parent) task is decomposed into a list of subtasks.
- The workflow is designed to review the planning process and ensure that the plan is correct and feasible,
- all subtasks are correctly defined and executed.
-
- Example:
- agent1 = ...
- agent2 = ...
-
- task = Task(..., agents=[agent1, agent2], ...)
-
- workflow = PlanningReviewWorkflow(task=[task1], ...)
- workflow.run()
- """
-
- llm: str = Field(description="LLM used by the task planning.")
-
- def __init__(self, tasks: List[Task], llm: str):
- self.tasks = tasks
- self.memory = []
- self.max_memory = 10
- self.llm = llm
- self.max_retries = 5
- self._verify_tasks()
-
- def _verify_tasks(self):
- if not isinstance(self.tasks, list):
- raise ValueError("tasks must be a list of Task objects")
- if len(self.tasks) != 1:
- raise ValueError("Planning only supports one task")
- task = self.tasks[0]
- if task.agents is None or len(task.agents) < 1:
- raise ValueError("Task must have at least one agent")
-
- def run(self, enable_summarization=False):
- generated_steps = self.generate_steps()
-
- if enable_summarization:
- from agent_hive.agents.summarization_agent import SummarizationAgent
-
- summarization_task = Task(
- description=self.tasks[0].description,
- expected_output=self.tasks[0].expected_output,
- agents=[SummarizationAgent(llm=self.llm)],
- context=generated_steps[:],
- )
- generated_steps.append(summarization_task)
-
- sequential_workflow = SequentialWorkflow(
- tasks=generated_steps, context_type=ContextType.SELECTED
- )
-
- return sequential_workflow.run()
-
- def generate_steps(self, save_plan=False, saved_plan_filename=""):
- task = self.tasks[0]
- agent_descriptions = ""
-
- for ii, aagent in enumerate(task.agents):
- agent_descriptions += f"\n({ii + 1}) Agent name: {aagent.name}"
- agent_descriptions += f"\nAgent description: {aagent.description}"
- if "task_examples" in aagent.__dict__ and aagent.task_examples:
- agent_descriptions += f"\nTasks that agent can solve:"
- for idx, task_example in enumerate(aagent.task_examples, start=1):
- agent_descriptions += f"\n{idx}. {task_example}" # Numbering each task example on a new line
- agent_descriptions += "\n"
-
- retry = 0
- final_plan = ""
- prev_plan = ""
- prev_review = ""
- while retry < self.max_retries:
- try:
- prompt = self.get_prompt(task.description, agent_descriptions, prev_plan, prev_review)
- logger.info(f"Plan Generation Prompt: \n{prompt}")
- llm_response = watsonx_llm(
- prompt, model_id=self.llm,
- )["generated_text"]
- logger.info(f"Plan {retry + 1}: \n{llm_response}")
-
- plan_reviewer_agent = PlanReviewerAgent(llm=self.llm)
- review = plan_reviewer_agent.execute_task(
- question=task.description,
- agent_descriptions=agent_descriptions,
- plan=llm_response,
- )
- prev_review = review
- prev_plan = llm_response
- logger.info(f"Plan Review: \n{review}")
- if review["status"].lower() == "valid":
- logger.info(f"Plan {retry + 1} is valid.")
- final_plan = llm_response
- break
- else:
- logger.info(f"Plan {retry + 1} is invalid.")
- retry += 1
- except Exception as e:
- logger.warning(f"Error during plan review: {e}. Retrying...")
- retry += 1
-
- if final_plan == "":
- logger.info(
- "No valid plan found after multiple retries. Use the plan from the last retry."
- )
- final_plan = prev_plan
-
- self.memory = []
-
- task_pattern = r"#Task\d+: (.+)"
- agent_pattern = r"#Agent\d+: (.+)"
- dependency_pattern = r"#Dependency\d+: (.+)"
- output_pattern = r"#ExpectedOutput\d+: (.+)"
-
- tasks = re.findall(task_pattern, final_plan)
- agents = re.findall(agent_pattern, final_plan)
- dependencies = re.findall(dependency_pattern, final_plan)
- outputs = re.findall(output_pattern, final_plan)
-
- if save_plan:
- if not saved_plan_filename.endswith(".txt"):
- saved_plan_filename += ".txt"
-
- saved_plan_text = f"Question: {task.description}\nPlan:\n{final_plan}"
- with open(saved_plan_filename, "w") as f:
- f.write(saved_plan_text)
-
- planned_tasks = []
- for i in range(len(tasks)):
- task_description = tasks[i]
- if i == len(agents):
- break
- agent_name = agents[i]
- if i < len(dependencies):
- dependency = dependencies[i]
- else:
- dependency = "None"
- if i < len(outputs):
- expected_output = outputs[i]
- else:
- expected_output = ""
-
- # identify the agent
- selected_agent = None
- for agent in task.agents:
- if agent.name == agent_name:
- selected_agent = agent
- break
-
- if selected_agent is None:
- # raise ValueError(f"Agent {agent_name} not found in the task.agents")
- selected_agent = task.agents[0]
-
- # identify the dependency
- if dependency != "None":
- numbers = re.findall(r"#S(\d+)", dependency)
- numbers = list(map(int, numbers))
- context = [planned_tasks[i - 1] for i in numbers]
- else:
- context = []
-
- a_task = Task(
- description=task_description,
- expected_output=expected_output,
- agents=[selected_agent],
- context=context,
- )
- planned_tasks.append(a_task)
-
- return planned_tasks
-
-
- def get_prompt(self, task_description, agent_descriptions, prev_plan, prev_review):
- invalid_plan_description = ''
- if prev_plan:
- invalid_reason = prev_review['reasoning']
- invalid_suggestions = prev_review['suggestions']
- invalid_plan_description = f'''
-## Here is one invalid plan, please learn from it and do not repeat its mistakes: ##
-Invalid plan:
-{prev_plan}
-Reason why this plan is invalid: {invalid_reason}
-Suggestion for improvement: {invalid_suggestions}
-'''
- prompt = f"""
-You are an AI assistant who makes step-by-step plan to solve a complicated problem under the help of external agents.
-For each step, make one task followed by one agent-call.
-Each step denoted by #S1, #S2, #S3 ... can be referred to in later steps as a dependency.
-
-Each step must contain Task, Agent, Dependency and ExpectedOutput.
-1. **Task**: A detailed description of what needs to be done in this step. It should include all necessary details and requirements.
-2. **Agent**: The external agent to be used for solving this task. Agent needs to be selected from the available agents.
-3. **Dependency**: A list of previous steps (denoted as `#S1`, `#S2`, etc.) that this step depends on. If no previous steps are required, use `None`.
-4. **ExpectedOutput**: The anticipated result from the agent's execution.
-
-## Output Format (Replace '<...>') ##
-
-## Step 1
-#Task1:
-#Agent1:
-#Dependency1: None
-#ExpectedOutput1:
-
-## Step 2
-#Task2:
-#Agent2:
-#Dependency2: []
-#ExpectedOutput2:
-
-And so on...
-
-## Here are the available agents: ##
-{agent_descriptions}
-
-## You are going to solve the following complicated problem: ##
-{task_description}
-
-## Guidelines: ##
-- Task should be something that can be solved by the agent. Task needs to be clear and unambiguous and contain all the information needed to solve it.
-- A plan usually contains less than 5 steps.
-- Only output the generated plan, do not output any other text.
-
-{invalid_plan_description}
-
-Output (your generated plan):
-"""
- return prompt
diff --git a/src/tmp/agent_hive/workflows/sequential.py b/src/tmp/agent_hive/workflows/sequential.py
deleted file mode 100644
index 2326bf49..00000000
--- a/src/tmp/agent_hive/workflows/sequential.py
+++ /dev/null
@@ -1,120 +0,0 @@
-import json
-from typing import List
-
-from pydantic import Field
-
-from agent_hive.enum import ContextType
-from agent_hive.task import Task
-from agent_hive.workflows.base_workflow import Workflow
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-
-class SequentialWorkflow(Workflow):
- """
- This class represents a sequential agentic workflow, where each task is executed in order. And each task is
- assigned to a specific agent.
-
- Example:
- agent1 = ...
- agent2 = ...
-
- task1 = Task(..., agents=[agent1], ...)
- task2 = Task(..., agents=[agent2], ...)
-
- workflow = SequentialWorkflow(tasks=[task1, task2], ...)
- workflow.run()
-
- """
-
- context_type: ContextType = Field(
- default=ContextType.DISABLED, description="Type of context to use."
- )
-
- def __init__(
- self, tasks: List[Task], context_type: ContextType = ContextType.DISABLED
- ):
- self.tasks = tasks
- self.context_type = context_type
- self.memory = []
- self.max_memory = 10
- self._verify_tasks()
-
- def _verify_tasks(self):
- if not isinstance(self.tasks, list):
- raise ValueError("tasks must be a list of Task objects")
- for i, task in enumerate(self.tasks):
- if task.agents is None or len(task.agents) == 0:
- raise ValueError("Task must have at least one agent")
- if len(task.agents) > 1:
- raise NotImplementedError(
- "SequentialWorkflow only supports one agent per task"
- )
- if self.context_type == ContextType.SELECTED:
- if isinstance(task.context, list):
- for context_task in task.context:
- if context_task not in self.tasks[:i]:
- raise ValueError(
- "task.context must be a list of Task objects that are part of the workflow"
- )
-
- def run(self):
- self.memory = []
- for i, task in enumerate(self.tasks):
- task_no = i + 1
- logger.info(f"Task {task_no}: {task.description}")
- assigned_agent = task.agents[0]
-
- if self.context_type == ContextType.DISABLED:
- user_input = task.description
- response = assigned_agent.execute_task(user_input)
-
- elif self.context_type == ContextType.ALL:
- context = "\n".join(self.memory[-self.max_memory :])
- user_input = f"{task.description}\n\nContext:\n{context}"
- response = assigned_agent.execute_task(user_input)
-
- elif self.context_type == ContextType.PREVIOUS:
- context = self.memory[-1]
- user_input = f"{task.description}\n\nContext:\n{context}"
- response = assigned_agent.execute_task(user_input)
-
- elif self.context_type == ContextType.SELECTED:
- context_tasks = task.context
- context = ""
- if context_tasks and len(context_tasks) > 0:
- for context_task in context_tasks:
- idx = self.tasks.index(context_task)
- if idx >= len(self.memory):
- raise IndexError(
- f"Context task {context_task.description} not found in memory"
- )
- context += self.memory[idx] + "\n"
- user_input = f"{task.description}\n\nContext:\n{context}"
- else:
- user_input = f"{task.description}\n"
- response = assigned_agent.execute_task(user_input)
-
- else:
- raise ValueError(f"Invalid context_type: {self.context_type}")
-
- response = response.split("Final Answer:")[0].strip()
- self.memory.append(response)
-
- history = self.generate_history()
- print(json.dumps(history, indent=4))
- return history
-
- def generate_history(self):
- history = []
- for i, task in enumerate(self.tasks):
- history.append(
- {
- "task_number": i + 1,
- "task_description": task.description,
- "agent_name": task.agents[0].name,
- "response": self.memory[i],
- }
- )
- return history
diff --git a/src/tmp/agent_hive/workflows/task_decomposition.py b/src/tmp/agent_hive/workflows/task_decomposition.py
deleted file mode 100644
index c3a487d0..00000000
--- a/src/tmp/agent_hive/workflows/task_decomposition.py
+++ /dev/null
@@ -1,127 +0,0 @@
-from agent_hive.task import Task
-from pydantic import Field
-from typing import List
-from agent_hive.enum import ContextType
-import json
-from reactxen.utils.model_inference import watsonx_llm
-from agent_hive.utils import json_parser
-from agent_hive.logger import logger
-
-
-class TaskDecompositionWorkflow:
- """
- This class represents a task decomposition workflow, where each task is decomposed into multiple subtasks
- by LLM-based task decomposition. The task decomposition is greedy, i.e., asking the LLM to generate
- the next subtask and execute it. and then do it again until the task is completely done.
-
- Example:
- agent1 = ...
- agent2 = ...
-
- task = Task(..., agents=[agent1, agent2], ...)
-
- workflow = PlanningWorkflow(task=[task1], ...)
- workflow.run()
- """
-
- llm: str = Field(description="LLM used by the task decomposition.")
-
- def __init__(self, tasks: List[Task], llm: str):
- self.tasks = tasks
- self.memory = []
- self.max_memory = 10
- self.llm = llm
- self._verify_tasks()
-
- def _verify_tasks(self):
- if not isinstance(self.tasks, list):
- raise ValueError("tasks must be a list of Task objects")
- if len(self.tasks) != 1:
- raise ValueError("TaskDecompositionWorkflow only supports one task")
- task = self.tasks[0]
- if task.agents is None or len(task.agents) < 1:
- raise ValueError("Task must have at least one agent")
-
- def run(self):
- self.memory = []
- the_task = self.tasks[0]
- history = []
- i = 0
- while True:
- response = self.decompose(the_task)
-
- if isinstance(response, list) and response[0] == 'respond_to_user':
- logger.info(f"Final Answer: {response[1]}")
- break
-
- history.append(
- {
- 'task_number': i + 1,
- 'task_description': response[1],
- 'agent_name': response[0],
- 'response': response[2]
- }
- )
- i += 1
-
- print(json.dumps(history, indent=4))
- return history
-
- def decompose(self, task: Task):
- self.memory = self.memory[-self.max_memory:]
- context = "\n".join(self.memory)
- response_format = {"action_agent": "selected_agent", "action_item": "describe_the_next_step"}
- last_response_format = {"respond_to_user": "final_answer"}
-
- def get_prompt():
- return f"""
-You are going to solve the following task:
-{task.description}
-
-The expected output is:
-{task.expected_output}
-
-Use the context from memory to plan next steps.
-Context:
-{context}
-
-You need will use the context provided and the user's input to classify the intent and select the appropriate agent that executes the next step.
-You need to describe the next step for the selected agent so that the agent can efficiently execute it.
-
-Here are the available agents:
-{", ".join([f"- {aagent.name}" for aagent in task.agents])}
-
-
-###Guidelines###
-- The original task could require multiple steps, you will use the context to understand the previous actions taken and the next steps you should take.
-- You will respond the next action in the form of {response_format}.
-- If there are no actions to be taken, you will respond in the form of {last_response_format} with your final answer combining all previous responses as input.
-- Respond with "respond_to_user" only when your final answer meets the expected output, or there are no agents to select from or there is no next action.
-- Always return valid JSON and nothing else.
-
-Output:
-"""
-
- prompt = get_prompt()
- logger.info(f"Task Decomposition Prompt: \n{prompt}")
- llm_response = watsonx_llm(prompt, model_id=self.llm)['generated_text']
- logger.info(f"Next Step: {llm_response}")
-
- llm_response = json_parser(llm_response)
- logger.info(f"Next Step (formatted): {llm_response}")
-
- if isinstance(llm_response, dict) and 'respond_to_user' in llm_response:
- return ['respond_to_user', llm_response['respond_to_user']]
-
- elif isinstance(llm_response, dict) and 'action_agent' in llm_response and 'action_item' in llm_response:
- action_agent = llm_response["action_agent"]
- action_item = llm_response["action_item"]
-
- for agent in task.agents:
- if agent.name == action_agent:
- logger.info(f"Found agent: {agent.name}. Action item: {action_item}")
- agent_response = agent.execute_task(action_item)
- logger.info(f"Agent response: {agent_response}")
-
- self.memory.append(agent_response)
- return [agent.name, action_item, agent_response]
diff --git a/src/tmp/agent_hive/workflows/track1_fact_sheet.json b/src/tmp/agent_hive/workflows/track1_fact_sheet.json
deleted file mode 100644
index 6f2a4755..00000000
--- a/src/tmp/agent_hive/workflows/track1_fact_sheet.json
+++ /dev/null
@@ -1,3 +0,0 @@
-{
- "Track": "Task Planning"
- }
\ No newline at end of file
diff --git a/src/tmp/agent_hive/workflows/track1_planning.py b/src/tmp/agent_hive/workflows/track1_planning.py
deleted file mode 100644
index 61b5718e..00000000
--- a/src/tmp/agent_hive/workflows/track1_planning.py
+++ /dev/null
@@ -1,214 +0,0 @@
-from agent_hive.task import Task
-from pydantic import Field
-from typing import List
-from agent_hive.enum import ContextType
-import json
-from agent_hive.workflows.base_workflow import Workflow
-from reactxen.utils.model_inference import watsonx_llm
-import re
-from agent_hive.workflows.sequential import SequentialWorkflow
-from agent_hive.agents.plan_reviewer_agent import PlanReviewerAgent
-from agent_hive.logger import get_custom_logger
-
-logger = get_custom_logger(__name__)
-
-# =========================================================
-# TODO: Participants can edit this section ONLY
-# Add variable, dict. no more any import just any inline code
-# =========================================================
-# END OF EDITABLE SECTION
-
-
-class NewPlanningWorkflow(Workflow):
- """
- Participant Template for Planning Review Workflow.
- ---------------------------------------------------
- 📝 Instructions for participants:
- - Only modify the section marked with "TODO: Edit prompt here"
- - Do NOT change any workflow logic, agents, or execution components
- - Keep all retry, memory, and sequential execution intact
- """
-
- llm: str = Field(description="LLM used by the task planning.")
-
- def __init__(self, tasks: List[Task], llm: str):
- self.tasks = tasks
- self.memory = []
- self.max_memory = 10
- self.llm = llm
- self.max_retries = 5
- self._verify_tasks()
-
- def _verify_tasks(self):
- if not isinstance(self.tasks, list):
- raise ValueError("tasks must be a list of Task objects")
- if len(self.tasks) != 1:
- raise ValueError("Planning only supports one task")
- task = self.tasks[0]
- if task.agents is None or len(task.agents) < 1:
- raise ValueError("Task must have at least one agent")
-
- def run(self, enable_summarization=False):
- generated_steps = self.generate_steps()
-
- sequential_workflow = SequentialWorkflow(
- tasks=generated_steps, context_type=ContextType.SELECTED
- )
-
- return sequential_workflow.run()
-
- def generate_steps(self, save_plan=False, saved_plan_filename=""):
- task = self.tasks[0]
- agent_descriptions = ""
-
- # =========================================================
- # TODO: Participants can edit this section ONLY
- # 🎨 Purpose: Customize how agent information is collected and formatted
- # ✅ Allowed:
- # - Change numbering style or bullet points
- # - Include additional metadata (e.g., agent capabilities, tags)
- # - Provide examples in a different format
- # - Add emojis or formatting to make the prompt clearer
- # - More thinking
- # ❌ Not allowed:
- # - Modify workflow execution
- # - Replace the base ReAct agent or Executor
- # - Change memory or retry logic
- # =========================================================
-
- for ii, aagent in enumerate(task.agents):
- agent_descriptions += f"\n({ii + 1}) Agent name: {aagent.name}"
- agent_descriptions += f"\nAgent description: {aagent.description}"
- if "task_examples" in aagent.__dict__ and aagent.task_examples:
- agent_descriptions += f"\nTasks that agent can solve:"
- for idx, task_example in enumerate(aagent.task_examples, start=1):
- agent_descriptions += f"\n{idx}. {task_example}"
- agent_descriptions += "\n"
-
- # =========================================================
- # END OF EDITABLE SECTION
- # 🚫 Participants should not modify code below this line
- # ❌ No new variables, functions, or workflow logic allowed
- # ✅ Only modify the section marked as TODO above
- # =========================================================
-
- prompt = self.get_prompt(task.description, agent_descriptions)
- logger.info(f"Plan Generation Prompt: \n{prompt}")
- llm_response = watsonx_llm(
- prompt, model_id=self.llm,
- )["generated_text"]
- logger.info(f"Plan: \n{llm_response}")
-
- final_plan = llm_response
-
- # =========================================================
- # TODO: Participants can edit this section ONLY
- # 🎨 Purpose: Customize LLM response post-processing
- # ❌ Not allowed:
- # - Modify workflow execution
- # - Replace the base ReAct agent or Executor or Task
- # - Change memory or retry logic
- # =========================================================
-
- self.memory = []
-
- task_pattern = r"#Task\d+: (.+)"
- agent_pattern = r"#Agent\d+: (.+)"
- dependency_pattern = r"#Dependency\d+: (.+)"
- output_pattern = r"#ExpectedOutput\d+: (.+)"
-
- tasks = re.findall(task_pattern, final_plan)
- agents = re.findall(agent_pattern, final_plan)
- dependencies = re.findall(dependency_pattern, final_plan)
- outputs = re.findall(output_pattern, final_plan)
-
- if save_plan:
- if not saved_plan_filename.endswith(".txt"):
- saved_plan_filename += ".txt"
-
- saved_plan_text = f"Question: {task.description}\nPlan:\n{final_plan}"
- with open(saved_plan_filename, "w") as f:
- f.write(saved_plan_text)
-
- planned_tasks = []
- for i in range(len(tasks)):
- task_description = tasks[i]
- if i == len(agents):
- break
- agent_name = agents[i]
- if i < len(dependencies):
- dependency = dependencies[i]
- else:
- dependency = "None"
- if i < len(outputs):
- expected_output = outputs[i]
- else:
- expected_output = ""
-
- selected_agent = None
- for agent in task.agents:
- if agent.name == agent_name:
- selected_agent = agent
- break
- if selected_agent is None:
- selected_agent = task.agents[0]
-
- if dependency != "None":
- numbers = re.findall(r"#S(\d+)", dependency)
- numbers = list(map(int, numbers))
- context = [planned_tasks[i - 1] for i in numbers]
- else:
- context = []
-
- a_task = Task(
- description=task_description,
- expected_output=expected_output,
- agents=[selected_agent],
- context=context,
- )
- planned_tasks.append(a_task)
-
- logger.info(f"Planned Tasks: \n{planned_tasks}")
-
- # =========================================================
- # END OF EDITABLE SECTION
- # =========================================================
-
- return planned_tasks
-
- def get_prompt(self, task_description, agent_descriptions):
- # =========================================================
- # TODO: Participants can edit this section ONLY
- # 🎨 Purpose: Improve prompt clarity, formatting, emojis, guidance
- # ✅ Allowed: Wording, structure, examples, emojis
- # ❌ Not allowed: Changing workflow, ReAct agent, Executor, or memory logic
- # =========================================================
-
- prompt = f"""
-🚀 You are an AI assistant tasked with creating a step-by-step plan to solve a complex problem using the external agents provided.
-
-⚠️ Constraints:
-- Only use the agents listed below. No new agents may be added.
-- The base ReAct agent and Executor component are fixed. Do not change them.
-- Produce a plan with fewer than 5 steps.
-- Include Task, Agent, Dependency, and ExpectedOutput for each step.
-- Make instructions clear, unambiguous, and actionable.
-
-Each step must follow this format:
-#Task:
-#Agent:
-#Dependency: