Skip to content

Update PR volume counting from GitHub Archive to GitHub Search API#43

Merged
ashleyzhang01 merged 9 commits into
mainfrom
update-total-volume-to-use-github-search
Jun 24, 2026
Merged

Update PR volume counting from GitHub Archive to GitHub Search API#43
ashleyzhang01 merged 9 commits into
mainfrom
update-total-volume-to-use-github-search

Conversation

@ashleyzhang01

@ashleyzhang01 ashleyzhang01 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

GitHub Archive (which mirrors the public Events API) structurally undercounts GitHub App review activity. Comparing a 2-week window (Jun 3–17), GH Archive captured only ~8–25% of actual PR review volume depending on the bot.

This PR adds a --source search-api option (now the default) to the volumes command, using reviewed-by:<bot> type:pr queries.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a GitHub Search API–backed implementation for computing PR volume counts (as an alternative to the existing BigQuery/GH Archive approach) and wires it into the CLI so volumes can be fetched without BigQuery.

Changes:

  • Added Search API querying helpers (_search_api_count, _date_range, _weekly_chunks) and a new fetch_pr_volumes_search_api pipeline path.
  • Updated the volumes CLI subcommand to support --source {bq,search-api} (default search-api) and --weekly.
  • Expanded unit tests to cover the new date helpers and Search API count behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
online/etl/pipeline/volumes.py Adds Search API volume counting, retry logic, and date chunking utilities.
online/etl/main.py Extends volumes CLI to select data source and enable weekly chunking.
online/etl/tests/test_volumes.py Adds tests for date range/chunking and Search API counting behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +168 to +176
if resp.status_code == 403:
retry_after = resp.headers.get("Retry-After")
wait = int(retry_after) if retry_after else 60
logger.warning(
f"Search API rate limited for {bot_username}, "
f"waiting {wait}s (attempt {attempt + 1}/3)"
)
await asyncio.sleep(wait)
continue
Comment on lines +181 to +196
@pytest.mark.asyncio
async def test_search_api_count_retries_on_rate_limit() -> None:
rate_limited = MagicMock(spec=httpx.Response)
rate_limited.status_code = 403
rate_limited.headers = {"Retry-After": "1"}

success = MagicMock(spec=httpx.Response)
success.status_code = 200
success.json.return_value = {"total_count": 42}

mock_client = AsyncMock(spec=httpx.AsyncClient)
mock_client.get.side_effect = [rate_limited, success]

result = await _search_api_count(mock_client, "test[bot]", "2026-06-15")
assert result == 42
assert mock_client.get.call_count == 2
@ashleyzhang01 ashleyzhang01 changed the title Update total volume to use GitHub search Update PR volume counting from GitHub Archive to GitHub Search API Jun 23, 2026
@ashleyzhang01 ashleyzhang01 merged commit 7e4fada into main Jun 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants