Add bin_prop computed variable to stat_bin#6477
Open
kieran-mace wants to merge 1 commit intotidyverse:mainfrom
Open
Add bin_prop computed variable to stat_bin#6477kieran-mace wants to merge 1 commit intotidyverse:mainfrom
kieran-mace wants to merge 1 commit intotidyverse:mainfrom
Conversation
Brings feature parity with stat_count by adding `after_stat(bin_prop)` functionality to stat_bin. The bin_prop variable shows the proportion of each group within each bin, enabling proportion-based visualizations for binned continuous data. Key features: - bin_prop = count_in_group / total_count_in_bin - Works with multiple groups and respects weights - Backwards compatible (bin_prop = 1 for single groups) - Properly handles empty bins Usage: ggplot(data, aes(x = continuous_var, y = after_stat(bin_prop), fill = group)) + stat_bin(geom = "col", position = "dodge") 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This was referenced May 23, 2025
Open
teunbrand
requested changes
May 23, 2025
Collaborator
teunbrand
left a comment
There was a problem hiding this comment.
Hi there, thanks for the PR! There are a few concerns that I hope can be alleviated, see related comments.
Comment on lines
+78
to
+109
| if (!is.null(data) && nrow(data) > 0 && | ||
| all(c("count", "xmin", "xmax") %in% names(data))) { | ||
|
|
||
| # Calculate bin_prop: proportion of each group within each bin | ||
| # Create a unique bin identifier using rounded values to handle floating point precision | ||
| data$bin_id <- paste(round(data$xmin, 10), round(data$xmax, 10), sep = "_") | ||
|
|
||
| # Calculate total count per bin across all groups | ||
| bin_totals <- stats::aggregate(data$count, by = list(bin_id = data$bin_id), FUN = sum) | ||
| names(bin_totals)[2] <- "bin_total" | ||
|
|
||
| # Merge back to get bin totals for each row | ||
| data <- merge(data, bin_totals, by = "bin_id", sort = FALSE) | ||
|
|
||
| # Calculate bin_prop: count within group / total count in bin | ||
| # When bin_total = 0 (empty bin), set bin_prop based on whether there are multiple groups | ||
| n_groups <- length(unique(data$group)) | ||
| if (n_groups == 1) { | ||
| # With only one group, bin_prop is always 1 (100% of the bin belongs to this group) | ||
| data$bin_prop <- 1 | ||
| } else { | ||
| # With multiple groups, bin_prop = count / total_count_in_bin, or 0 for empty bins | ||
| data$bin_prop <- ifelse(data$bin_total > 0, data$count / data$bin_total, 0) | ||
| } | ||
|
|
||
| # Remove the temporary columns | ||
| data$bin_id <- NULL | ||
| data$bin_total <- NULL | ||
| } else { | ||
| # If we don't have the necessary data, just add a default bin_prop column | ||
| data$bin_prop <- if (nrow(data) > 0) rep(1, nrow(data)) else numeric(0) | ||
| } |
Collaborator
There was a problem hiding this comment.
This all seems more complicated than it needs to be. Can't this be computed more directly?
Comment on lines
+159
to
+160
| #' width = "widths of bins.", | ||
| #' bin_prop = "proportion of points in bin that belong to each group." |
Collaborator
There was a problem hiding this comment.
Can you regenerate the .Rd files as well?
Comment on lines
+270
to
+271
| # Test with 5 bins to get predictable overlap | ||
| p <- ggplot(test_data, aes(x, fill = group)) + geom_histogram(bins = 5) |
Collaborator
There was a problem hiding this comment.
Breaks can be set directly if predictability is an issue
Comment on lines
+281
to
+288
| bins_with_both_groups <- aggregate(data$count > 0, by = list(paste(data$xmin, data$xmax)), sum) | ||
| overlapping_bins <- bins_with_both_groups[bins_with_both_groups$x == 2, ]$Group.1 | ||
|
|
||
| for (bin in overlapping_bins) { | ||
| bin_data <- data[paste(data$xmin, data$xmax) == bin, ] | ||
| total_prop <- sum(bin_data$bin_prop) | ||
| expect_equal(total_prop, 1, tolerance = 1e-6) | ||
| } |
Collaborator
There was a problem hiding this comment.
Isn't it more simple to test that the sum over bins is 1, regardless of how many groups?
Comment on lines
+327
to
+328
| bin1_data <- data[data$x == min(data$x), ] | ||
| bin2_data <- data[data$x == max(data$x), ] |
Collaborator
There was a problem hiding this comment.
Suggested change
| bin1_data <- data[data$x == min(data$x), ] | |
| bin2_data <- data[data$x == max(data$x), ] | |
| bin1_data <- data[data$x == 1, ] | |
| bin2_data <- data[data$x == 2, ] |
We know from the test data what these values should be
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
after_stat(bin_prop)functionality tostat_bin, bringing feature parity withstat_count. The newbin_propcomputed variable shows the proportion of each group within each bin.Closes #6478
Motivation
stat_countprovidesafter_stat(prop)for proportion-based visualizations, butstat_binlacked equivalent functionality. This made it difficult to create proportion-based histograms for continuous data.Implementation
compute_panelmethod toStatBinthat calculatesbin_prop = count_in_group / total_count_in_binbin_prop = 1)Usage Example
This addresses the feature gap where users could use
after_stat(prop)withstat_countfor discrete data but had no equivalent for continuous data withstat_bin.Test plan
stat_bintests pass (no regressions)bin_propfunctionalityafter_stat(bin_prop)works correctly in plotsSome example output:
Created on 2025-05-22 with reprex v2.1.1
🤖 Generated with Claude Code