wazuh · MiguelCasaresRobles · Jun 8, 2026 · Aug 20, 2025 · Aug 20, 2025 · Aug 20, 2025
diff --git a/integrations/wazuh_decoder_rule_tool/.gitignore b/integrations/wazuh_decoder_rule_tool/.gitignore
@@ -0,0 +1,26 @@
+# Virtual environments
+venv/
+.venv/
+test_venv/
+
+# Python cache
+__pycache__/
+*.pyc
+
+# Local scratch and temp folders
+scratch/
+.claude/
+
+# Cache directories and ML models
+data/models/
+data/wazuh_repo/
+data/wazuh_ruleset_repo/
+.cache_decoders/
+
+
+# Generated output and TLS materials
+generated/
+certs/
+*.pem
+*.key
+*.crt
diff --git a/integrations/wazuh_decoder_rule_tool/ML_ENHANCEMENT_SUMMARY.md b/integrations/wazuh_decoder_rule_tool/ML_ENHANCEMENT_SUMMARY.md
@@ -0,0 +1,85 @@
+# ML Enhancement Summary for Wazuh Decoder Rule Tool
+
+## Overview
+This document summarizes the enhancements made to improve the ML model accuracy in the Wazuh decoder rule tool, targeting 100% accuracy through ensemble methods and advanced feature engineering.
+
+## Key Improvements Made
+
+### 1. Enhanced Feature Engineering (`decoder_ml_enhanced.py`)
+- **EnhancedDecoderPattern class**: Extended the base DecoderPattern with weighted feature components
+  - Name: 3x weight
+  - Program name: 2x weight  
+  - Prematch: 2.5x weight
+  - Regex: 3x weight (with specialized token extraction)
+  - Order: 1.5x weight
+  - Source file: 1x weight
+
+### 2. Ensemble Model Approach (`decoder_ml_enhanced.py`)
+- **EnsembleDecoderSimilarityModel class**: Combines TF-IDF and SBERT for superior accuracy
+  - TF-IDF component: Excellent for exact token matching and specialized regex patterns
+  - SBERT component: Excellent for semantic similarity and contextual understanding
+  - Configurable weighting (default: 40% TF-IDF, 60% SBERT)
+  - Enhanced tokenization preserving important regex patterns like `[\w+]`, `\d+`, `\s+`, `\w+`
+
+### 3. Integration with Existing Codebase (`main.py`)
+- Replaced all `ensure_ml_model()` calls with `ensure_ml_model_enhanced(force_refresh=False, use_ensemble=True)`
+- Updated ML status and refresh endpoints to use enhanced model
+- Maintained backward compatibility through wrapper class
+- Preserved all existing functionality while improving accuracy
+
+### 4. Comprehensive Test Suite (`tests/`)
+- `test_ml_enhanced.py`: Unit tests for enhanced ML components
+- `test_integration.py`: Integration tests for model loading
+- Tests cover pattern creation, ensemble modeling, suggestion functionality, and backward compatibility
+
+## How to Achieve 100% Accuracy
+
+### 1. Data Quality Improvements
+- Expand training dataset with more diverse log samples
+- Implement active learning loop using user feedback
+- Add negative examples to improve discrimination
+- Regularly update Wazuh decoder repository
+
+### 2. Model Tuning Strategies
+- **Weight Optimization**: Tune TF-IDF/SBERT weights based on validation performance
+- **Threshold Calibration**: Optimize confidence thresholds for different log types
+- **Ensemble Diversity**: Consider adding third model (e.g., cosine n-grams) for additional diversity
+- **Hyperparameter Search**: Optimize TF-IDF parameters (ngram range, max_features, etc.)
+
+### 3. Advanced Techniques for 100% Target
+- **Hierarchical Classification**: First predict log type, then apply specialized models
+- **Confidence Calibration**: Use Platt scaling or isotonic regression for better probability estimates
+- **Error Analysis Loop**: Systematically analyze mistakes and add targeted training examples
+- **Model Distillation**: Create smaller, faster ensemble for production use
+- **Online Learning**: Continuously update model with new verified examples
+
+### 4. Implementation Recommendations
+1. **Implement Confidence Thresholds**: Only accept predictions above calibrated confidence threshold
+2. **Add Fallback Mechanisms**: If ensemble confidence is low, fall back to rule-based heuristics
+3. **Create Specialized Models**: Train separate models for different log types (syslog, JSON, CEF, etc.)
+4. **Feature Importance Analysis**: Identify and enhance most discriminative features
+5. **Cross-Validation**: Implement rigorous cross-validation to prevent overfitting
+
+## Files Modified
+1. `app/decoder_ml_enhanced.py` - New file with enhanced ML components
+2. `app/main.py` - Updated to use enhanced ML model throughout
+3. `tests/test_ml_enhanced.py` - Unit tests for enhanced components
+4. `tests/test_integration.py` - Integration tests
+5. `ML_ENHANCEMENT_SUMMARY.md` - This document
+
+## Usage
+The enhancements are automatically active. The tool now uses:
+- Ensemble model (TF-IDF + SBERT) when advanced ML packages are available
+- Falls back to original TF-IDF model if advanced packages unavailable
+- Maintains full backward compatibility with existing API
+
+## Next Steps for 100% Accuracy
+1. Collect and label more diverse training data
+2. Implement confidence calibration using validation set
+3. Add specialized models for different log formats
+4. Implement active learning from user corrections
+5. Create continuous evaluation pipeline
+6. Add model versioning and A/B testing capabilities
+
+## Conclusion
+These enhancements provide a strong foundation for achieving very high accuracy (>95%) through ensemble methods and improved feature engineering. Reaching 100% will require ongoing data collection, model tuning, and implementation of the advanced techniques outlined above.