Fixes stack stats with a work item job#2129
Conversation
Addresses an issue where stack statistics may become inaccurate due to a bug. Introduces a work item handler and job to recalculate stack stats based on event data within a specified time range. An admin endpoint has been added to queue the work item. Adds tests to ensure the stats are correctly repaired.
There was a problem hiding this comment.
Pull request overview
This PR introduces a fix for stack statistics that may have become corrupted due to a bug. The solution adds a work item handler and admin endpoint to recalculate stack statistics from event data for stacks created within a specified time window.
Changes:
- Adds a new work item type (
FixStackStatsWorkItem) and handler to repair stack stats by aggregating event data - Introduces new repository methods:
GetByCreatedUtcRangeAsyncfor querying stacks by creation date,SetEventCounterAsyncfor setting stack counters with monotonic updates, andGetEventStatsForStacksAsyncfor aggregating event statistics - Adds admin endpoint
/admin/maintenance/fix-stack-statswith configurable UTC time window (defaults to 2026-02-10 start) - Refactors
IncrementEventCounterAsyncto use thePatchAsyncpattern for consistency
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/http/admin.http | Adds HTTP test cases for the new fix-stack-stats endpoint with default and explicit UTC windows |
| tests/Exceptionless.Tests/Repositories/StackRepositoryTests.cs | Adds tests for GetByCreatedUtcRangeAsync and SetEventCounterAsync monotonic update behavior |
| tests/Exceptionless.Tests/Repositories/EventRepositoryTests.cs | Adds tests for GetEventStatsForStacksAsync aggregation logic and removes unused import |
| tests/Exceptionless.Tests/Jobs/FixStackStatsJobTests.cs | Comprehensive test suite covering repair scenarios, boundary conditions, and edge cases |
| tests/Exceptionless.Tests/Controllers/AdminControllerTests.cs | End-to-end tests for admin endpoint with various window configurations and validation |
| src/Exceptionless.Web/Controllers/AdminController.cs | Adds fix-stack-stats case to RunJobAsync with parameter validation and default date |
| src/Exceptionless.Core/Repositories/StackRepository.cs | Implements GetByCreatedUtcRangeAsync and SetEventCounterAsync, refactors IncrementEventCounterAsync to use PatchAsync |
| src/Exceptionless.Core/Repositories/Interfaces/IStackRepository.cs | Adds interface definitions for new repository methods |
| src/Exceptionless.Core/Repositories/Interfaces/IEventRepository.cs | Adds GetEventStatsForStacksAsync interface and StackEventStats record |
| src/Exceptionless.Core/Repositories/EventRepository.cs | Implements GetEventStatsForStacksAsync with pagination support for aggregating event statistics |
| src/Exceptionless.Core/Models/WorkItems/FixStackStatsWorkItem.cs | Defines the work item model with UTC time window properties |
| src/Exceptionless.Core/Jobs/WorkItemHandlers/FixStackStatsWorkItemHandler.cs | Implements the handler with pagination, progress reporting, and monotonic update logic |
| src/Exceptionless.Core/Bootstrapper.cs | Registers the new work item handler (handlers reordered alphabetically) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Refactors the stack event stats calculation to use aggregations, resulting in significantly improved performance. Removes unnecessary data loading, and simplifies the stats computation logic.
Ensures the stack stats job correctly processes and updates stack statistics by adjusting the UTC timestamp handling and fixing validation issues. Improves the accuracy of the fix stack stats job and updates related tests to reflect the changes and ensure they account for all edge cases.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [Fact] | ||
| public async Task GetEventStatsForStacksAsync_WhenStacksHaveEvents_ShouldReturnExpectedAggregates() | ||
| { | ||
| // Arrange | ||
| var stack1 = await _stackRepository.AddAsync(_stackData.GenerateStack(generateId: true, organizationId: TestConstants.OrganizationId, projectId: TestConstants.ProjectId), o => o.ImmediateConsistency()); | ||
| var stack2 = await _stackRepository.AddAsync(_stackData.GenerateStack(generateId: true, organizationId: TestConstants.OrganizationId, projectId: TestConstants.ProjectId), o => o.ImmediateConsistency()); | ||
|
|
||
| var stack1First = new DateTimeOffset(2026, 2, 12, 0, 0, 0, TimeSpan.Zero); | ||
| var stack1Last = new DateTimeOffset(2026, 2, 13, 0, 0, 0, TimeSpan.Zero); | ||
| var stack2Only = new DateTimeOffset(2026, 2, 14, 0, 0, 0, TimeSpan.Zero); | ||
|
|
||
| await _repository.AddAsync([ | ||
| _eventData.GenerateEvent(TestConstants.OrganizationId, TestConstants.ProjectId, stack1.Id, occurrenceDate: stack1First), | ||
| _eventData.GenerateEvent(TestConstants.OrganizationId, TestConstants.ProjectId, stack1.Id, occurrenceDate: stack1Last), | ||
| _eventData.GenerateEvent(TestConstants.OrganizationId, TestConstants.ProjectId, stack2.Id, occurrenceDate: stack2Only) | ||
| ], o => o.ImmediateConsistency()); | ||
|
|
||
| // Act | ||
| var stats = await _repository.GetEventStatsForStacksAsync([stack1.Id, stack2.Id]); | ||
|
|
||
| // Assert | ||
| Assert.Equal(2, stats.Count); | ||
| Assert.Equal(2, stats[stack1.Id].TotalOccurrences); | ||
| Assert.Equal(stack1First.UtcDateTime, stats[stack1.Id].FirstOccurrence); | ||
| Assert.Equal(stack1Last.UtcDateTime, stats[stack1.Id].LastOccurrence); | ||
| Assert.Equal(1, stats[stack2.Id].TotalOccurrences); | ||
| Assert.Equal(stack2Only.UtcDateTime, stats[stack2.Id].FirstOccurrence); | ||
| Assert.Equal(stack2Only.UtcDateTime, stats[stack2.Id].LastOccurrence); | ||
| } |
There was a problem hiding this comment.
The new GetEventStatsForStacksAsync test only covers the case where PersistentEvent.Count is null (effectively 1 per document). Please add a case where an event has Count > 1 to ensure the aggregate uses summed occurrences rather than document count.
| var wi = context.GetData<FixStackStatsWorkItem>(); | ||
| var utcEnd = wi.UtcEnd ?? _timeProvider.GetUtcNow().UtcDateTime; | ||
|
|
||
| Log.LogInformation("Fixing stack stats for stacks created between {UtcStart:O} and {UtcEnd:O}", wi.UtcStart, utcEnd); | ||
| await context.ReportProgressAsync(0, $"Starting stack stats repair for window {wi.UtcStart:O} – {utcEnd:O}"); | ||
|
|
||
| int pagesProcessed = 0; | ||
| int totalFixed = 0; | ||
| int totalSkipped = 0; | ||
|
|
||
| var results = await _stackRepository.GetByCreatedUtcRangeAsync(wi.UtcStart, utcEnd); | ||
| long totalStacks = results.Total; | ||
|
|
There was a problem hiding this comment.
The work item’s UtcStart/UtcEnd is described as a time window, but the handler uses it only to filter stacks by CreatedUtc and then aggregates events across all time for those stacks. If the underlying stats bug affected stacks that had events during the window (regardless of when the stack was created), those stacks won’t be repaired. Consider driving stack selection and/or event aggregation off the event date window (e.g., aggregate stack_ids from events in the window) so the repair matches the intended scope.
| Log.LogInformation( | ||
| "Fixing stack {StackId}: first={OldFirst:O}→{NewFirst:O} last={OldLast:O}→{NewLast:O} total={OldTotal}→{NewTotal}", | ||
| stack.Id, | ||
| stack.FirstOccurrence, firstOccurrenceToSet, | ||
| stack.LastOccurrence, lastOccurrenceToSet, | ||
| stack.TotalOccurrences, totalOccurrencesToSet); | ||
|
|
||
| await _stackRepository.SetEventCounterAsync(stack.Id, firstOccurrenceToSet, lastOccurrenceToSet, totalOccurrencesToSet, sendNotifications: false); |
There was a problem hiding this comment.
Logging every repaired stack at Information level can generate very high log volume for large windows and may impact job runtime/ingestion costs. Consider dropping the per-stack message to Debug/Trace (or sampling/batching) while keeping the start/finish summary at Information.
| .AggregationsExpression($"terms:(stack_id~{stackIds.Count} min:date max:date)")); | ||
|
|
||
| var result = new Dictionary<string, StackEventStats>(stackIds.Count); | ||
| foreach (var bucket in countResult.Aggregations.Terms<string>("terms_stack_id")?.Buckets ?? []) | ||
| { | ||
| var first = bucket.Aggregations.Min<DateTime>("min_date")?.Value; | ||
| var last = bucket.Aggregations.Max<DateTime>("max_date")?.Value; | ||
| if (first is null || last is null || bucket.Total is null) | ||
| continue; | ||
|
|
||
| result[bucket.Key] = new StackEventStats(first.Value, last.Value, bucket.Total.Value); |
There was a problem hiding this comment.
GetEventStatsForStacksAsync is using the terms bucket doc count (bucket.Total) as TotalOccurrences, which ignores PersistentEvent.Count (duplicates). Other aggregations in the codebase use sum:count~1 to compute occurrence totals. Consider adding a sum aggregation for count (defaulting to 1) and using that value for TotalOccurrences instead of doc_count.
| .AggregationsExpression($"terms:(stack_id~{stackIds.Count} min:date max:date)")); | |
| var result = new Dictionary<string, StackEventStats>(stackIds.Count); | |
| foreach (var bucket in countResult.Aggregations.Terms<string>("terms_stack_id")?.Buckets ?? []) | |
| { | |
| var first = bucket.Aggregations.Min<DateTime>("min_date")?.Value; | |
| var last = bucket.Aggregations.Max<DateTime>("max_date")?.Value; | |
| if (first is null || last is null || bucket.Total is null) | |
| continue; | |
| result[bucket.Key] = new StackEventStats(first.Value, last.Value, bucket.Total.Value); | |
| .AggregationsExpression($"terms:(stack_id~{stackIds.Count} min:date max:date sum:count~1)")); | |
| var result = new Dictionary<string, StackEventStats>(stackIds.Count); | |
| foreach (var bucket in countResult.Aggregations.Terms<string>("terms_stack_id")?.Buckets ?? []) | |
| { | |
| var first = bucket.Aggregations.Min<DateTime>("min_date")?.Value; | |
| var last = bucket.Aggregations.Max<DateTime>("max_date")?.Value; | |
| var totalOccurrences = bucket.Aggregations.Sum<double>("sum_count")?.Value; | |
| if (first is null || last is null || totalOccurrences is null) | |
| continue; | |
| result[bucket.Key] = new StackEventStats(first.Value, last.Value, (long)totalOccurrences.Value); |
Refactors the stack stats fix job to improve performance and correctness. The job now processes stacks within a specific organization or all organizations with events in the time window. It also uses aggregations to calculate stack event stats more efficiently and avoid unnecessary stack updates. The previous implementation was inefficient and could lead to incorrect stack stats.
Updates the `FixStackStatsWorkItem` model and related code to use `OrganizationId` instead of `Organization` for clarity and consistency with the rest of the codebase. This change ensures that the correct organization is targeted when fixing stack statistics.
Refactors stack patching to ensure that notifications are always sent after a stack is updated. This addresses an issue where the stack usage job was not triggering notifications due to a conditional check.
Replaces the custom DocumentNotFoundException with the one provided by Foundatio. Handles potential DocumentNotFoundException when patching a stack, preventing unexpected errors if a stack has been deleted.
Ensures that stack event counter updates succeed even if the stack document does not exist. This prevents issues in background jobs that process event counts asynchronously, where the stack may have been deleted between event processing and the update operation.
Addresses an issue where stack statistics may become inaccurate due to a bug.
Introduces a work item handler and job to recalculate stack stats based on event data within a specified time range. An admin endpoint has been added to queue the work item.
Adds tests to ensure the stats are correctly repaired.