Skip to content

Fixes stack stats with a work item job#2129

Merged
niemyjski merged 10 commits intomainfrom
bugfix/stack-usage-job
Feb 24, 2026
Merged

Fixes stack stats with a work item job#2129
niemyjski merged 10 commits intomainfrom
bugfix/stack-usage-job

Conversation

@niemyjski
Copy link
Member

Addresses an issue where stack statistics may become inaccurate due to a bug.

Introduces a work item handler and job to recalculate stack stats based on event data within a specified time range. An admin endpoint has been added to queue the work item.

Adds tests to ensure the stats are correctly repaired.

Addresses an issue where stack statistics may become inaccurate due to a bug.

Introduces a work item handler and job to recalculate stack stats based on event data within a specified time range. An admin endpoint has been added to queue the work item.

Adds tests to ensure the stats are correctly repaired.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a fix for stack statistics that may have become corrupted due to a bug. The solution adds a work item handler and admin endpoint to recalculate stack statistics from event data for stacks created within a specified time window.

Changes:

  • Adds a new work item type (FixStackStatsWorkItem) and handler to repair stack stats by aggregating event data
  • Introduces new repository methods: GetByCreatedUtcRangeAsync for querying stacks by creation date, SetEventCounterAsync for setting stack counters with monotonic updates, and GetEventStatsForStacksAsync for aggregating event statistics
  • Adds admin endpoint /admin/maintenance/fix-stack-stats with configurable UTC time window (defaults to 2026-02-10 start)
  • Refactors IncrementEventCounterAsync to use the PatchAsync pattern for consistency

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/http/admin.http Adds HTTP test cases for the new fix-stack-stats endpoint with default and explicit UTC windows
tests/Exceptionless.Tests/Repositories/StackRepositoryTests.cs Adds tests for GetByCreatedUtcRangeAsync and SetEventCounterAsync monotonic update behavior
tests/Exceptionless.Tests/Repositories/EventRepositoryTests.cs Adds tests for GetEventStatsForStacksAsync aggregation logic and removes unused import
tests/Exceptionless.Tests/Jobs/FixStackStatsJobTests.cs Comprehensive test suite covering repair scenarios, boundary conditions, and edge cases
tests/Exceptionless.Tests/Controllers/AdminControllerTests.cs End-to-end tests for admin endpoint with various window configurations and validation
src/Exceptionless.Web/Controllers/AdminController.cs Adds fix-stack-stats case to RunJobAsync with parameter validation and default date
src/Exceptionless.Core/Repositories/StackRepository.cs Implements GetByCreatedUtcRangeAsync and SetEventCounterAsync, refactors IncrementEventCounterAsync to use PatchAsync
src/Exceptionless.Core/Repositories/Interfaces/IStackRepository.cs Adds interface definitions for new repository methods
src/Exceptionless.Core/Repositories/Interfaces/IEventRepository.cs Adds GetEventStatsForStacksAsync interface and StackEventStats record
src/Exceptionless.Core/Repositories/EventRepository.cs Implements GetEventStatsForStacksAsync with pagination support for aggregating event statistics
src/Exceptionless.Core/Models/WorkItems/FixStackStatsWorkItem.cs Defines the work item model with UTC time window properties
src/Exceptionless.Core/Jobs/WorkItemHandlers/FixStackStatsWorkItemHandler.cs Implements the handler with pagination, progress reporting, and monotonic update logic
src/Exceptionless.Core/Bootstrapper.cs Registers the new work item handler (handlers reordered alphabetically)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Refactors the stack event stats calculation to use aggregations,
resulting in significantly improved performance.
Removes unnecessary data loading, and simplifies the stats computation logic.
Ensures the stack stats job correctly processes and updates stack statistics by adjusting the UTC timestamp handling and fixing validation issues.

Improves the accuracy of the fix stack stats job and updates related tests to reflect the changes and ensure they account for all edge cases.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 208 to 236
[Fact]
public async Task GetEventStatsForStacksAsync_WhenStacksHaveEvents_ShouldReturnExpectedAggregates()
{
// Arrange
var stack1 = await _stackRepository.AddAsync(_stackData.GenerateStack(generateId: true, organizationId: TestConstants.OrganizationId, projectId: TestConstants.ProjectId), o => o.ImmediateConsistency());
var stack2 = await _stackRepository.AddAsync(_stackData.GenerateStack(generateId: true, organizationId: TestConstants.OrganizationId, projectId: TestConstants.ProjectId), o => o.ImmediateConsistency());

var stack1First = new DateTimeOffset(2026, 2, 12, 0, 0, 0, TimeSpan.Zero);
var stack1Last = new DateTimeOffset(2026, 2, 13, 0, 0, 0, TimeSpan.Zero);
var stack2Only = new DateTimeOffset(2026, 2, 14, 0, 0, 0, TimeSpan.Zero);

await _repository.AddAsync([
_eventData.GenerateEvent(TestConstants.OrganizationId, TestConstants.ProjectId, stack1.Id, occurrenceDate: stack1First),
_eventData.GenerateEvent(TestConstants.OrganizationId, TestConstants.ProjectId, stack1.Id, occurrenceDate: stack1Last),
_eventData.GenerateEvent(TestConstants.OrganizationId, TestConstants.ProjectId, stack2.Id, occurrenceDate: stack2Only)
], o => o.ImmediateConsistency());

// Act
var stats = await _repository.GetEventStatsForStacksAsync([stack1.Id, stack2.Id]);

// Assert
Assert.Equal(2, stats.Count);
Assert.Equal(2, stats[stack1.Id].TotalOccurrences);
Assert.Equal(stack1First.UtcDateTime, stats[stack1.Id].FirstOccurrence);
Assert.Equal(stack1Last.UtcDateTime, stats[stack1.Id].LastOccurrence);
Assert.Equal(1, stats[stack2.Id].TotalOccurrences);
Assert.Equal(stack2Only.UtcDateTime, stats[stack2.Id].FirstOccurrence);
Assert.Equal(stack2Only.UtcDateTime, stats[stack2.Id].LastOccurrence);
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new GetEventStatsForStacksAsync test only covers the case where PersistentEvent.Count is null (effectively 1 per document). Please add a case where an event has Count > 1 to ensure the aggregate uses summed occurrences rather than document count.

Copilot uses AI. Check for mistakes.
Comment on lines 33 to 45
var wi = context.GetData<FixStackStatsWorkItem>();
var utcEnd = wi.UtcEnd ?? _timeProvider.GetUtcNow().UtcDateTime;

Log.LogInformation("Fixing stack stats for stacks created between {UtcStart:O} and {UtcEnd:O}", wi.UtcStart, utcEnd);
await context.ReportProgressAsync(0, $"Starting stack stats repair for window {wi.UtcStart:O} – {utcEnd:O}");

int pagesProcessed = 0;
int totalFixed = 0;
int totalSkipped = 0;

var results = await _stackRepository.GetByCreatedUtcRangeAsync(wi.UtcStart, utcEnd);
long totalStacks = results.Total;

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work item’s UtcStart/UtcEnd is described as a time window, but the handler uses it only to filter stacks by CreatedUtc and then aggregates events across all time for those stacks. If the underlying stats bug affected stacks that had events during the window (regardless of when the stack was created), those stacks won’t be repaired. Consider driving stack selection and/or event aggregation off the event date window (e.g., aggregate stack_ids from events in the window) so the repair matches the intended scope.

Copilot uses AI. Check for mistakes.
Comment on lines 73 to 80
Log.LogInformation(
"Fixing stack {StackId}: first={OldFirst:O}→{NewFirst:O} last={OldLast:O}→{NewLast:O} total={OldTotal}→{NewTotal}",
stack.Id,
stack.FirstOccurrence, firstOccurrenceToSet,
stack.LastOccurrence, lastOccurrenceToSet,
stack.TotalOccurrences, totalOccurrencesToSet);

await _stackRepository.SetEventCounterAsync(stack.Id, firstOccurrenceToSet, lastOccurrenceToSet, totalOccurrencesToSet, sendNotifications: false);
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging every repaired stack at Information level can generate very high log volume for large windows and may impact job runtime/ingestion costs. Consider dropping the per-stack message to Debug/Trace (or sampling/batching) while keeping the start/finish summary at Information.

Copilot uses AI. Check for mistakes.
Comment on lines 91 to 101
.AggregationsExpression($"terms:(stack_id~{stackIds.Count} min:date max:date)"));

var result = new Dictionary<string, StackEventStats>(stackIds.Count);
foreach (var bucket in countResult.Aggregations.Terms<string>("terms_stack_id")?.Buckets ?? [])
{
var first = bucket.Aggregations.Min<DateTime>("min_date")?.Value;
var last = bucket.Aggregations.Max<DateTime>("max_date")?.Value;
if (first is null || last is null || bucket.Total is null)
continue;

result[bucket.Key] = new StackEventStats(first.Value, last.Value, bucket.Total.Value);
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetEventStatsForStacksAsync is using the terms bucket doc count (bucket.Total) as TotalOccurrences, which ignores PersistentEvent.Count (duplicates). Other aggregations in the codebase use sum:count~1 to compute occurrence totals. Consider adding a sum aggregation for count (defaulting to 1) and using that value for TotalOccurrences instead of doc_count.

Suggested change
.AggregationsExpression($"terms:(stack_id~{stackIds.Count} min:date max:date)"));
var result = new Dictionary<string, StackEventStats>(stackIds.Count);
foreach (var bucket in countResult.Aggregations.Terms<string>("terms_stack_id")?.Buckets ?? [])
{
var first = bucket.Aggregations.Min<DateTime>("min_date")?.Value;
var last = bucket.Aggregations.Max<DateTime>("max_date")?.Value;
if (first is null || last is null || bucket.Total is null)
continue;
result[bucket.Key] = new StackEventStats(first.Value, last.Value, bucket.Total.Value);
.AggregationsExpression($"terms:(stack_id~{stackIds.Count} min:date max:date sum:count~1)"));
var result = new Dictionary<string, StackEventStats>(stackIds.Count);
foreach (var bucket in countResult.Aggregations.Terms<string>("terms_stack_id")?.Buckets ?? [])
{
var first = bucket.Aggregations.Min<DateTime>("min_date")?.Value;
var last = bucket.Aggregations.Max<DateTime>("max_date")?.Value;
var totalOccurrences = bucket.Aggregations.Sum<double>("sum_count")?.Value;
if (first is null || last is null || totalOccurrences is null)
continue;
result[bucket.Key] = new StackEventStats(first.Value, last.Value, (long)totalOccurrences.Value);

Copilot uses AI. Check for mistakes.
@exceptionless exceptionless deleted a comment from Copilot AI Feb 24, 2026
Refactors the stack stats fix job to improve performance and correctness.

The job now processes stacks within a specific organization or all organizations with events in the time window.
It also uses aggregations to calculate stack event stats more efficiently and avoid unnecessary stack updates.
The previous implementation was inefficient and could lead to incorrect stack stats.
Updates the `FixStackStatsWorkItem` model and related code to use `OrganizationId` instead of `Organization` for clarity and consistency with the rest of the codebase.

This change ensures that the correct organization is targeted when fixing stack statistics.
Refactors stack patching to ensure that notifications are always sent after a stack is updated. This addresses an issue where the stack usage job was not triggering notifications due to a conditional check.
Replaces the custom DocumentNotFoundException with the one provided by Foundatio.

Handles potential DocumentNotFoundException when patching a stack, preventing unexpected errors if a stack has been deleted.
Ensures that stack event counter updates succeed even if the stack document does not exist.

This prevents issues in background jobs that process event counts asynchronously, where the stack may have been deleted between event processing and the update operation.
@niemyjski niemyjski merged commit f112cd7 into main Feb 24, 2026
7 checks passed
@niemyjski niemyjski deleted the bugfix/stack-usage-job branch February 24, 2026 14:24
@github-actions
Copy link

Code Coverage

Package Line Rate Branch Rate Complexity Health
Exceptionless.Core 67% 60% 7524
Exceptionless.AppHost 26% 14% 55
Exceptionless.Web 56% 43% 3499
Exceptionless.Insulation 24% 23% 208
Summary 61% (12100 / 19711) 53% (5758 / 10784) 11286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

2 participants