Skip to content

This PR adds a new Wazuh integration for Wazuh decoder rule generation tool#79

Open
Hasitha9796 wants to merge 66 commits into
wazuh:mainfrom
Hasitha9796:main
Open

This PR adds a new Wazuh integration for Wazuh decoder rule generation tool#79
Hasitha9796 wants to merge 66 commits into
wazuh:mainfrom
Hasitha9796:main

Conversation

@Hasitha9796
Copy link
Copy Markdown
Member

@Hasitha9796 Hasitha9796 commented May 1, 2026

Summary

This PR adds a new integration named wazuh_decoder_rule_tool — a FastAPI-based tool for analyzing logs, checking existing Wazuh decoder/rule matches through wazuh-logtest, and generating custom decoder and rule XML.

New Features

AI-Powered Generation (Hybrid Approach)

  • Hybrid architecture: programmatically generates correct Wazuh decoder XML, then uses an LLM to review and improve osregex patterns
  • Multiple AI providers: Ollama (local, no rate limits), DashScope (Qwen 3.6 Plus), and OpenRouter
  • wazuh-logtest integration: every AI generation first checks wazuh-logtest to determine:
    • Whether a custom decoder is needed at all
    • The correct parent strategy (<program_name> when available, <prematch> otherwise)
    • Which fields are already decoded by built-in decoders (skipped automatically)
  • Priority fallback: Ollama > DashScope > OpenRouter

Enhanced ML Decoder Similarity

  • Ensemble model combining TF-IDF (exact token matching) + SBERT (semantic similarity)
  • Configurable weighting (default: 40% TF-IDF, 60% SBERT)
  • Enhanced tokenization preserving regex patterns
  • Backward compatible with existing TF-IDF fallback

Improved Decoder Generation

  • Split decoders: one child decoder per field for better accuracy
  • Robust prefix generalization (timestamps, IPs, MAC addresses, PIDs)
  • CEF (Common Event Format) log support with field mapping
  • Per-field validation explaining which fields will/won't be decoded
  • Multiple log type handlers: syslog, JSON, key=value, bracketed, Java dash, Android, Palo Alto CSV

Robustness & Reliability

  • Timeouts on all git subprocess calls (clone, pull, sparse-checkout) to prevent startup hangs
  • Proper Wazuh OS_Regex validation (no PCRE patterns, correct \. vs . semantics)
  • Non-blocking SSH with configurable timeouts

Included

  • FastAPI backend with streaming AI responses
  • Single-page HTML/JS UI with decoder analysis, rule generation, AI generation, and testing
  • Log analysis using heuristics with regex generation engine for Wazuh OS_Regex compatibility
  • wazuh-logtest validation (local or remote via SSH)
  • ML-based decoder similarity (TF-IDF + optional SBERT ensemble)
  • Rule ML model trained from wazuh-ruleset
  • Per-field feedback collection for continuous improvement
  • README with comprehensive setup instructions including AI provider configuration

Testing

The app can be tested locally:

  1. Set up the virtual environment:
cd integrations/wazuh_decoder_rule_tool
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Generate SSL certificates:
mkdir -p certs
openssl req -x509 -newkey rsa:4096 -keyout certs/localhost.key -out certs/localhost.crt -days 365 -nodes -subj "/CN=localhost"
  1. Start the application (with AI):
export OLLAMA_BASE_URL=http://localhost:11434/v1
export OLLAMA_MODEL=llama3.2:3b
uvicorn app.main:app --host 0.0.0.0 --port 8443 --ssl-certfile certs/localhost.crt --ssl-keyfile certs/localhost.key

Access the application via https://localhost:8443.

Connecting to Wazuh VM for wazuh-logtest

export WAZUH_SSH_HOST=192.168.56.10
export WAZUH_SSH_PORT=22
export WAZUH_SSH_USER=vagrant
export WAZUH_SSH_PASSWORD=vagrant

Example Scenario

  1. Paste a log like: May 19 12:34:56 custom-server myapp[1234]: User 'admin' failed to authenticate from IP 192.168.1.100 due to invalid_password
  2. Click Analyze to detect log type and extract fields
  3. Select fields to extract (e.g., user, srcip)
  4. Click Generate for programmatic decoder+rule generation
  5. Click AI Generate for AI-assisted pattern improvement

Hasitha9796 and others added 30 commits August 20, 2025 17:33
…tterns and full preceding words instead of truncating prefixes
…d generalize them to \d+ to prevent brittle anchors
…of rule conditions

- Add STATIC_FIELD_TAGS set (srcip, dstip, srcport, dstport, protocol,
  action, id, url, data, extra_data, status, system_name, user, hostname,
  program_name) — known Wazuh tags rendered as direct XML children of <rule>
- Add _render_static_tags() to emit <tagname>value</tagname> instead of
  <field name="tagname">value</field> for static tags
- Add child_static_conditions to CandidateRequest, build_rule_xml,
  build_candidate, and derive_child_rule_conditions
- Enhance derive_child_rule_conditions() to use extract_fields, field_hints,
  and parsed_logtest_fields for smarter auto-detection:
  - field_hints take priority and lock fields to prevent override
  - Explicit patterns parsed: 'field X is Y', 'X equals Y'
  - extract_fields guide which field names to look for
  - IP addresses skipped as too specific for conditions
- Update RulePattern and parse_rule_file in decoder_ml.py to extract
  static field tags from real Wazuh rules for ML training
- Include static_conditions in ML rule suggestions and feature_text
- Update AI prompt with static field tag rules (prefer over <field name="">)
- Add Static Field Tags UI section in rule form with add/remove rows
- Update readRulePayload() to collect child_static_conditions
…on cleaning

- Add child_only parameter to build_rule_xml() — when True, emits a
  single <rule> with if_sid=parent_rule_id instead of parent+child pair
- Fix build_candidate() to use child_only=True when user sets parent_rule_id
- Improve clean_rule_description() to extract 'use description as X',
  'description should be X', and 'create alert using NNNN parent rule
  by matching X' patterns
- Fix clean_rule_description to handle 'use the description as X' pattern
- Skip <regex> in child_only mode — child rules extending a parent should
  use <match> / <field> / static tags only
- Filter meta fields (program, program_name, hostname, decoder_name)
  from auto-detected conditions unless user explicitly mentions them
- Add 'use' to stopwords to prevent false <match> from 'User' in body
- Use \b word boundary check for match conditions (no substring-inside-word)
- Exclude description words from match condition auto-detection
- Pass clean_description to derive_child_rule_conditions
- Add rule_description field to CandidateRequest (overrides auto-detected)
- Add rule_description input in rule form below rule requirement
- Update readRulePayload() to send rule_description
- build_candidate() uses rule_description when provided
- child_rule is no longer auto-created from rule_requirement alone
- Only build child_rule when parent_rule_id is set or user provides
  explicit child_field_conditions / child_match_conditions / child_static_conditions
- Add description parameter to build_rule_xml() for parent rule
- Derive parent rule description from requirement when no child rule exists
… split regex Path 2

- escape_xml: only escape & and < (required by XML spec). html.escape
  also escapes > which breaks Wazuh regex patterns using -> arrow
  notation (e.g. \.+\s->\s(\d+.\d+.\d+.\d+))
- build_split_regexes_from_fields Path 2: use the raw prefix before the
  value in target_text and pass it through generalize_prefix_text.
  Previously the regex '\b(\w+\s*[=:]\s*)' would falsely match '4:' as
  a key:value separator in '1.2.3.4:1234' because \b matched between .
  and 4, producing \.+4:(\d+). Now it generalizes to
  \.+\d+.\d+.\d+.\d+:(\d+)
… fields

- osregex_escape now escapes [ and ] in both inner (generalize_prefix_text)
  and outer (build_split_regexes_from_fields) definitions. Without this,
  suricata signature IDs like [1:2010935:2] in the body would be rendered
  as Wazuh character class [\d+:\d+:\d+] instead of escaped \[\d+:\d+:\d+\].
- parse_logtest_output now extracts phase-2 decoded_fields (e.g. srcip,
  dstip, protocol decoded by built-in suricata decoder).
- analyze_logs_impl computes effective_extract_fields by removing fields
  already decoded by the built-in decoder, avoiding unnecessary decoders.
- And adds skipped_decoded_fields / logtest_decoded_fields to analysis output.
…ation

- validate_individual_fields() explains why each requested field
  is 'decoded' (built-in), 'skipped' (syslog pre-decoded like
  timestamp/hostname), 'pending' (will be extracted by custom
  decoder), or 'warning' (value not found in body).
- PREDECODED_SYSLOG_FIELDS frozenset defines fields consumed
  during syslog pre-decoding that cannot be re-decoded.
- parse_logtest_output now extracts phase-2 decoded_fields from
  logtest stdout, so the system knows which fields a built-in
  decoder already handles (e.g. suricata extracts srcip, dstip,
  protocol).
- analyze_logs_impl computes effective_extract_fields by filtering
  out already-decoded fields, avoiding redundant decoders.
…ckets as literal

Wazuh's OS_REGEX does not use [ ] for character classes or { } for
quantifiers — they are literal characters. The previous commit's
escaping of [ to \[ was wrong and produced invalid Wazuh regex
patterns like \[\d+:\d+:\d+\] instead of the correct [\d+:\d+:\d+].
The result dict for parse_logtest_output was missing the 'decoded_fields'
key initialization. When phase 2 decoded fields were found, the code
tried result['decoded_fields'][fname] = fval which raised KeyError.
…d simplify func

Path 2 used target_text[:val_start] as the raw prefix which included ALL
context before the value (e.g. full suricata signature [1:2010935:2]
plus 'ET MALWARE ...'). Now it uses the same prefix-shortening regex as
Path 3, extracting only the last 1-2 tokens before the captured value.

This produces simpler, more robust regexes:
  BEFORE: \.+[\d+:\d+:\d+] ET MALWARE ... {\S+} \d+.\d+.\d+.\d+:(\d+)
  AFTER:  \.+\S+} \d+.\d+.\d+.\d+:(\d+)

Removed unused simplify_escaped_prefix function.
…rouped'

- Removed trailing comma in <group name="custom,app_name,"> → "custom,app_name"
- rule_description now only sets the child rule description, never the parent.
- Parent description always falls back to "{log_source_name} messages grouped"
  when a child_rule exists. Only when there's NO child_rule does the
  rule_requirement influence the parent description.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new wazuh_decoder_rule_tool integration: a FastAPI-based UI/API for analyzing pasted logs, optionally validating them via wazuh-logtest, and generating Wazuh decoder/rule XML. It also adds an “enhanced” ML decoder-similarity approach (TF‑IDF + SBERT) plus scripts/datasets to train a custom similarity model from Wazuh ruleset test data.

Changes:

  • Add the FastAPI app’s HTML/JS/CSS frontend and supporting backend utilities for decoder/rule generation workflows.
  • Add ML enhancements: ensemble similarity model wrapper, dataset builder + training script, and accompanying tests/docs.
  • Add local datasets and TLS artifacts for local HTTPS testing (currently including private keys).

Reviewed changes

Copilot reviewed 21 out of 26 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
integrations/wazuh_decoder_rule_tool/tests/test_ml_enhanced.py Adds unit tests for enhanced ML similarity components.
integrations/wazuh_decoder_rule_tool/tests/test_integration.py Adds a basic integration test for enhanced ML model loading.
integrations/wazuh_decoder_rule_tool/scripts/train_similarity.py Adds SBERT contrastive training script for decoder similarity.
integrations/wazuh_decoder_rule_tool/scripts/build_dataset.py Adds script to build training/validation datasets from Wazuh rules-testing suites + feedback.
integrations/wazuh_decoder_rule_tool/requirements.txt Adds Python dependencies for running the tool (FastAPI/Uvicorn/ML libs).
integrations/wazuh_decoder_rule_tool/README.md Documents local HTTPS run instructions, remote VM mode, and ML training workflow.
integrations/wazuh_decoder_rule_tool/ML_ENHANCEMENT_SUMMARY.md Documents ML feature-engineering + ensemble approach and future tuning ideas.
integrations/wazuh_decoder_rule_tool/key.pem Adds a private key file (should not be committed).
integrations/wazuh_decoder_rule_tool/generated/decoders/local_myapp_decoder_20260307094900.xml Adds generated decoder XML output artifact.
integrations/wazuh_decoder_rule_tool/generated/decoders/local_myapp_decoder_20260307094544.xml Adds generated decoder XML output artifact (duplicate-style).
integrations/wazuh_decoder_rule_tool/data/datasets/val.jsonl Adds validation dataset records for ML training.
integrations/wazuh_decoder_rule_tool/data/datasets/feedback.jsonl Adds feedback dataset examples used for training/tuning.
integrations/wazuh_decoder_rule_tool/data/datasets/feedback_rejections.jsonl Adds rejected feedback examples for analysis/training workflows.
integrations/wazuh_decoder_rule_tool/certs/localhost.key Adds a private TLS key for local HTTPS (should not be committed).
integrations/wazuh_decoder_rule_tool/certs/localhost.crt Adds a self-signed TLS certificate for local HTTPS.
integrations/wazuh_decoder_rule_tool/cert.pem Adds a certificate artifact for local HTTPS usage.
integrations/wazuh_decoder_rule_tool/app/wazuh_logtest.py Adds a helper to run wazuh-logtest via SSH (currently hardcoded/inconsistent).
integrations/wazuh_decoder_rule_tool/app/templates/index.html Adds the single-page HTML UI for the tool.
integrations/wazuh_decoder_rule_tool/app/static/styles.css Adds styling for the UI.
integrations/wazuh_decoder_rule_tool/app/static/app.js Adds UI logic for navigation, generate/test flows, ML status, AI generation, feedback, history.
integrations/wazuh_decoder_rule_tool/app/decoder_ml.py Adds baseline TF‑IDF similarity models + parsing utilities for decoders/rules.
integrations/wazuh_decoder_rule_tool/app/decoder_ml_enhanced.py Adds enhanced feature engineering + ensemble similarity model + compatibility wrapper.
integrations/wazuh_decoder_rule_tool/.gitignore Adds ignores for venv/cache/model/repo directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

function toggleConditionsRow() {
const req = document.getElementById('ruleRequirement').value.trim();
document.getElementById('ruleFieldConditionsRow').style.display = req ? 'flex' : 'none';
document.getElementById('ruleMatchConditionsRow').style.display = req ? 'flex' : 'none';
Comment on lines +18 to +35
try:
# This might fail if no Wazuh repo is available, but that's OK for this test
model = ensure_ml_model_enhanced(force_refresh=False, use_ensemble=True)
# If we get here without exception, the function works
assert model is not None or model is None # Either is fine
print("✓ ensure_ml_model_enhanced executed successfully")
return True
except Exception as e:
print(f"✗ ensure_ml_model_enhanced failed: {e}")
return False


if __name__ == "__main__":
success = test_ensure_ml_model_enhanced()
if success:
print("Integration test passed!")
else:
print("Integration test failed!")
parts.extend([self.prematch] * int(prematch_weight))
if self.regex:
# Extract meaningful tokens from regex
regex_tokens = re.findall(r'\[\\w\+\\]|\\\\d\+|\\\\S\+|\\\\w\+', self.regex)
Comment on lines +38 to +51

parts = []
if self.name:
parts.extend([self.name] * int(name_weight))
if self.program_name:
parts.extend([self.program_name] * int(program_weight))
if self.prematch:
parts.extend([self.prematch] * int(prematch_weight))
if self.regex:
# Extract meaningful tokens from regex
regex_tokens = re.findall(r'\[\\w\+\\]|\\\\d\+|\\\\S\+|\\\\w\+', self.regex)
parts.extend(regex_tokens * int(regex_weight))
if self.order:
parts.extend(self.order * int(order_weight))
Comment on lines +1 to +5
-----BEGIN PRIVATE KEY-----
MIIJQgIBADANBgkqhkiG9w0BAQEFAASCCSwwggkoAgEAAoICAQDeCJuheTkfwUSK
shHW/6XR28sohDtaA+BgE5VQhA/dO0A0OD4Y+FHFvwqDZg4j74mZ1s4BBxdercSO
l1NXmfTJvH0WhY09vSyS3g4N/T1unrtTFUTrC3Dc5ovLAxAUe2AHLGhQcXGWRbTq
pEL1KEoYG89DSisTjSBOcoM3dE8fnU2Gc7YCvLUh8IpIaYLr0GOiQumAGhxIyWGq
Comment on lines +14 to +18
# Cache directories and ML models
data/models/
data/wazuh_repo/
data/wazuh_ruleset_repo/

@@ -0,0 +1,3 @@
{"log":"03-17 16:13:38.811 1702 2395 D WindowManager: printFreezingDisplayLogsopening app wtoken = AppWindowToken{9f4ef63 token=Token{a64f992 ActivityRecord{de9231d u0 com.tencent.qt.qtl/.activity.info.NewsDetailXmlActivity t761}}}, allDrawn= false, startingDisplayed = false, startingMoved = false, isRelaunching = false","decoder":{"name":"myapp-event","parent":"myapp","prematch":"WindowManager:","regex":"(\\d+-\\d+ \\d+:\\d+:\\d+.\\d+) \\d+ \\d+ \\S WindowManager: \\S+ \\S+ wtoken = (\\.+) token=(\\.+), allDrawn= (\\S+)","order":["logtime","wtoken","token","allDrawn"],"source_file":"feedback/windowmanager.json"}}
{"log":"20171223-22:15:33:144|Step_SPUtils|30002312| getTodayTotalDetailSteps = 1514038440000##7013##548365##8661##12836##27176966","decoder":{"name":"myapp-event","parent":"myapp","prematch":"Step_SPUtils","regex":"(\\.+)\\|Step_SPUtils\\|30002312\\| getTodayTotalDetailSteps = (\\.+)","order":["logtime","getTodayTotalDetailSteps"],"source_file":"feedback/pipemetric.json"}}
{"timestamp": "2026-05-16T08:56:11.647689Z", "approved": true, "log": "May 16 14:22:31 plc-gateway01 scada-engine[2241]: ALERT Modbus unauthorized write request detected from 10.10.50.24 function_code=0x10 register=40123", "extract_fields": ["srcip", "funtion_code"], "notes": "", "decoder": {"name": "myapp-event", "parent": "myapp", "prematch": "scada-engine", "regex": "ALERT\\s+Modbus\\s+unauthorized\\s+write\\s+request\\s+detected\\s+from\\s+(\\d+.\\d+.\\d+.\\d+)\\s+function_code=(\\d+x\\d+)\\s+register=\\d+", "order": ["srcip", "function_code"], "source_file": "feedback/myapp.json"}, "target_text": "myapp-event myapp scada-engine alert\\s+modbus\\s+unauthorized\\s+write\\s+request\\s+detected\\s+from\\s+(\\d+.\\d+.\\d+.\\d+)\\s+function_code=(\\d+x\\d+)\\s+register=\\d+ srcip function_code feedback/myapp.json"}
{"timestamp": "2026-04-29T05:52:13.354712Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": ["logtime", "loglevel", "message"], "notes": "[(\\d+-\\d+-\\S+:\\d+:\\d+,\\d+)][(\\S+)\\s][\\.+] [\\S+] (\\.+)"}
{"timestamp": "2026-04-29T08:50:41.323760Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": [], "notes": "It should be corrected like this"}
{"timestamp": "2026-04-29T08:50:41.368350Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": [], "notes": "It should be corrected like this"}
{"timestamp": "2026-05-16T08:56:23.312599Z", "approved": false, "app_name": "myapp", "log": "May 16 14:22:31 plc-gateway01 scada-engine[2241]: ALERT Modbus unauthorized write request detected from 10.10.50.24 function_code=0x10 register=40123", "extract_fields": ["srcip", "funtion_code"], "notes": ""}
Comment on lines +71 to +78
For this workspace, the app now defaults to:

```bash
WAZUH_SSH_HOST=192.168.56.10
WAZUH_SSH_PORT=22
WAZUH_SSH_USER=vagrant
WAZUH_SSH_PASSWORD=vagrant
```
Comment on lines +4 to +18
WAZUH_HOST = "127.0.0.1"
WAZUH_PORT = "2222"
WAZUH_USER = "vagrant"

# read from environment variable
WAZUH_LOGTEST = os.getenv("WAZUH_LOGTEST_PATH", "/var/ossec/bin/wazuh-logtest")


def run_logtest(log_line):
cmd = [
"ssh",
"-p", WAZUH_PORT,
f"{WAZUH_USER}@{WAZUH_HOST}",
f"sudo {WAZUH_LOGTEST}"
]
- Hybrid AI generation: programmatic base XML (guaranteed correct) + AI review for regex improvement
- wazuh-logtest always checked before AI generation to determine parent strategy
- Parent decoder uses <program_name> when log has a decoded program name
- Fields already decoded by built-in decoders are skipped automatically
- AI prompt refocused on reviewing/improving regex patterns instead of writing XML from scratch
- Git subprocess calls now have timeouts to prevent startup hangs
- Updated README with AI provider setup and hybrid approach documentation
…ation

- Removed Decoder Generator and Rule Generator sections from HTML
- Moved input fields (appName, logsInput, extractFields, etc.) into AI view
- Removed 'Generate Decoder' and 'Generate Rule' sidebar nav items
- Made 'AI Generate' the default active view
- Cleaned up app.js: removed unused functions (showAnalysis, showXml,
  syncFeedback, readRulePayload, rule conditions UI, old button handlers)
- Updated history loading and test function to work without decoder view
- Added POST /api/install endpoint to write decoder/rule XML to Wazuh's
  custom decoders/rules directories (SSH or local)
- Added POST /api/uninstall endpoint to remove installed files
- Added POST /api/logtest/raw endpoint for running wazuh-logtest with
  arbitrary log samples and returning raw output + parsed fields
- Redesigned Test view with three cards: Installed Decoder (install/
  uninstall), Test Logs (editable sample input), and wazuh-logtest
  Output (raw stdout + parsed fields table)
- Added state management storing installed file paths in localStorage
- AI-generated XML is now persisted in JS so it can be installed from
  the Test view without re-running AI generation
…ailure

- Add generation_mode (auto/decoder_only/rule_only/both) to AI request
- Add validate_with_logtest flag and /api/ai/generate-validated endpoint
- Add _collect_ai_response, _extract_xml_from_ai_response helpers
- Add _validate_ai_decoder_with_logtest for auto-install+test validation
- Refactor _build_ai_prompt: shorter config block, concise ML/logtest context
- Add system prompt for Ollama (system+user roles), fix URL path
- Lower default temperature to 0.05 for more deterministic output
- Default model changed to wazuh-decoder
- UI: generation mode dropdown, validate checkbox, Generate & Validate button
- UI: show validation badge & details in AI output section
- UI: hide rule section when generation_mode=decoder_only
…ndpoint and automate rule group/static field sanitization
…coring, and sigmoid calibration

- Add log-type detection (_detect_log_type) with type-based boosting to bias results toward relevant decoder families (JSON, Windows, syslog, etc.)
- Add regex token overlap scoring (_regex_overlap_score) to boost patterns whose OS_Regex tokens match query log literals
- Add sigmoid confidence calibration for well-calibrated probabilities in [0,1]
- Tune ensemble weights: TF-IDF 0.3, SBERT 0.7 (semantic model is stronger for unseen formats)
- Raise minimum confidence gate to 0.15 to avoid low-confidence noise
- Add fine-tuned SBERT checkpoint loading with graceful fallback
- Enhance tokenizer to preserve more OS_Regex character classes
… Modelfile

- Lower temperature (0.05→0.02) and top_p (0.85→0.80) for more deterministic output
- Increase repeat_penalty (1.15→1.20) and lower top_k (20→15) to reduce repetition
- Add self-validation checklist to catch common errors before output
- Add JSON log decoder and DHCP/MAC address examples
- Fix sshd example to use same decoder name for multiple children
- Add instruction: 'No text before or after' the XML block
…lization

- Default OLLAMA_BASE_URL to http://localhost:11434/v1 so it works without env vars
- Normalize /v1 suffix to prevent double-/v1 404 errors in URL construction
- Add 60s timeout to streaming client with retry on ReadTimeout (up to 3 attempts)
- Add decoder rule: multiple child decoders must use exact same decoder name
- Fix IP regex guidance: do not escape dots in \d+.\d+.\d+.\d+
- Update top_k to 15 and repeat_penalty to 1.20 to match Modelfile tuning
- Improve error messages for network/server issues
…to dataset builder

- Add load_rejection_records(): convert rejection notes with regex corrections into positive training pairs
- Add augment_with_dropout(): create robustness variants by randomly masking log tokens (15% prob)
- Rejection corrections teach SBERT to distinguish correct from broken regex patterns
- Dropout augmentation teaches model that partial log lines still map to same decoder
- Add structured logging of record counts throughout pipeline
…nting to SBERT training

- 5 epochs with best-checkpoint saving (by validation AUC)
- Larger batch size (64 configurable) for better in-batch negatives with MultipleNegativesRankingLoss
- Hard-negative augmentation: pair logs with categorically distinct decoders (30% ratio)
- Token dropout data augmentation for robustness on partial input
- Early stopping with patience=2 epochs
- Add binary evaluator with both positive and negative pairs for AUC measurement
- Configurable training device (default CPU to avoid MPS OOM with Ollama)
- Copy best checkpoint to 'final' directory for easy model loading
The sidebar defaulted to AI Generate as active, but the corresponding
#view-ai div was missing the 'active' class, so CSS display:none kept
the entire AI generation page blank on initial load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants