Sync main branch to develop by snowfox1003 · Pull Request #149 · cppalliance/boost-data-collector

snowfox1003 · 2026-04-24T18:30:03Z

Summary by CodeRabbit

Release Notes

New Features
- YouTube Script Tracker: Track and transcribe C++ community videos from YouTube channels.
- WG21 Paper Tracker: Automatically monitor and catalog C++ standards committee papers.
- Clang GitHub Tracker: New database for incremental GitHub issue and PR syncing.
- Startup health notifications sent to Slack/Discord on deployment.
- GitHub repository dispatch integration for downstream workflow triggers.
Improvements
- Enhanced Boost version parsing and comparison utilities.
- Refined Markdown export and publication workflows for better error handling.
- Improved database watermark-based cursoring for efficient incremental syncing.
- Configurable git commit author metadata.
Documentation
- New service API docs for YouTube and Clang trackers.
- Updated schema documentation with new entities.
- Operations guide for GitHub dispatch integration.

…d docs #24

…llow, test fixes #24

…r=0 promotion, logging #24

… use raw/wg21_paper_tracker/YYYY/<mailing_date>/ #24

…g race recovery, logging and Ruf #24

…ne validation, mailing race recovery, logging, and API docs #24

…recovery #24

…ceededError and most importantly introducing rich name extractor

…tack #113

…125

add 403 error fix logic in upload folder to github - #134

…ub.com/jonathanMLDev/boost-data-collector into 126-bug-fix-boost-library-docs-tracker

…docs-tracker #126-fixed this app and cppa-pinecone app

#38-add youtube tracker. confirmed that transcripts are downloaded.

… backfill upserts, batch DB helpers, stale-export detection in github_export, safer git clone errors without leaking tokens, and aligned tests including isolated test_settings for publish - #136

…act git clone logs; ignore /nul - #136

…r_watermark as +1 ms - #136

…ted git clone errors - #136

…nup uses export tree - #136

…136

… redaction for clone/pull - #136

…_raw issue-number guards - #136

Clang GitHub tracker: DB-backed sync, markdown publish, chunked backfill (#136)

Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)

coderabbitai · 2026-04-24T18:30:19Z

📝 Walkthrough

Walkthrough

This PR significantly expands the data collector with two new tracker apps (YouTube and WG21 papers), introduces a refactored Clang GitHub tracker with database persistence and publishing, unifies GitHub issue/PR fetching with Link-header pagination, improves Boost tracker publishing flows, adds shared version/text utilities, and updates configurations and documentation to integrate all changes.

Changes

Cohort / File(s)	Summary
YouTube Tracker (New App) `cppa_youtube_script_tracker/*`	Adds complete YouTube video/channel tracking with fetcher (Google API v3), transcript downloader (yt-dlp), speaker extraction utils, service layer (CRUD), Pinecone preprocessor, models, migrations, admin UI, workspace helpers, and comprehensive test suite.
WG21 Paper Tracker (New App) `wg21_paper_tracker/*`	Introduces WG21 C++ standards papers tracker with fetcher (BeautifulSoup scraper), pipeline (date-driven mailing/paper processing), CSV import command, GitHub dispatch triggering, models (Mailing/Paper/PaperAuthor), migrations, services, admin UI, workspace utils, and full test coverage.
Clang GitHub Tracker (New App) `clang_github_tracker/*` (excluding migrations/models)	Refactors to database-backed architecture with issue/PR/commit models, backfill and run commands, DB-driven preprocessors (replacing delegation), publisher (persistent clone with git ops), cursor-based state management, and comprehensive test suite for all components.
Clang GitHub Tracker Models/Migrations `clang_github_tracker/models.py`, `clang_github_tracker/migrations/0001_initial.py`	Defines `ClangGithubIssueItem` and `ClangGithubCommit` ORM models with timestamp watermarks for incremental sync and creates corresponding database schema migration.
Clang GitHub Tracker Services `clang_github_tracker/services.py`	Adds upsert helpers (single and batch) for issues/PRs and commits with UTC normalization, timestamp merging rules, and watermark/cursor functions for incremental fetch bounds.
GitHub Activity Tracker Refactor `github_activity_tracker/fetcher.py`, `github_activity_tracker/sync/*.py`, `github_ops/client.py`	Consolidates issue and PR fetching into `fetch_issues_and_prs_from_github`, refactors commit fetching with Link-header backward pagination, adds Redis ETag caching, introduces `rest_request_*_with_all_links` client methods, and unifies `sync_issues_and_prs` function replacing separate sync logic.
GitHub Ops Infrastructure `github_ops/git_ops.py`, `github_ops/tests/test_git_ops.py`	Enhances git operations with token-safe subprocess error reporting (`sanitize_git_output`), configurable commit author identity via env vars, 403 retry logic using rate-limit headers, and new `prepare_repo_for_pull` helper for repo setup before pulling.
Boost Library Docs Tracker `boost_library_docs_tracker/...`	Updates fetcher (skip existing zips), HTML-to-markdown (text normalization), preprocessor (improved link scoping), management command (cleanup-extract, DB watermark-based sync), and services (remove `content_changed` status) with matching test updates.
Boost Library Usage Dashboard `boost_library_usage_dashboard/...`	Refactors analyzer (remove `base_dir` parameter), introduces new `publish_dashboard` function with persistent clone/sync/push workflow using `GITHUB_TOKEN_WRITE` and git commit identity, updates command to use `--skip-*` flags and delegate to publisher, removes legacy `_publish_via_raw_clone` method.
Boost Mailing List Tracker `boost_mailing_list_tracker/...`	Corrects misspelled `preprocesser` module import, changes preprocessor output metadata field from `table_ids` to `source_ids`, updates tests accordingly.
Boost Usage Tracker & Boost GitHub Activity Tracker `boost_usage_tracker/boost_searcher.py`, `boost_library_tracker/release_check.py`, `clang_github_tracker/__init__.py`	Centralizes version parsing via shared `boost_version_operations` utilities, adds helper function usage, updates docstrings to reflect DB persistence alongside raw JSON.
Core Utilities `core/utils/boost_version_operations.py`, `core/utils/datetime_parsing.py`, `core/utils/text_processing.py`	Introduces `boost_version_operations` module (encode/decode, parsing, normalization, comparison, stable tag detection), adds `ensure_aware_utc` datetime helper, extends text processing to handle nullable inputs and additional character normalization/HTML entity decoding.
Startup Notification Command `core/management/commands/send_startup_notification.py`	New management command sending health-check summaries (DB/Celery/beat status) to Slack/Discord webhooks with configurable message formatting and error handling.
CPPA Pinecone Sync `cppa_pinecone_sync/...`	Updates preprocess contract to return optional metadata-update documents, adds configurable parallelism for metadata updates via `ThreadPoolExecutor`, adjusts document building to use `source_ids` for ID tracking, supports separate upsert vs metadata-only sync flows.
CPPA Slack Tracker `cppa_slack_tracker/...`	Relocates text-processing imports to `core.utils`, changes preprocessor metadata field from `ids` to `source_ids`, removes `cppa_slack_tracker/tests/test_text_processing.py` (duplicate coverage in core), updates test assertions.
CPPA User Tracker `cppa_user_tracker/models.py`, `cppa_user_tracker/migrations/...`, `cppa_user_tracker/services.py`	Adds `YoutubeSpeaker` profile type with `external_id`/`display_name` fields and migrations, introduces `author_alias` field on `WG21PaperAuthorProfile`, adds service helpers for profile/speaker creation and linking.
Configuration & Environment `.env.example`, `config/settings.py`, `config/test_settings.py`, `Makefile`, `Dockerfile`, `docker-compose.yml`, `.gitignore`	Updates Clang config naming (`PRIVATE_REPO_` → `CONTEXT_REPO_`), adds YouTube/WG21 app settings, introduces `GIT_AUTHOR_*` vars for commit identity, refactors Slack token parsing, adds `ENABLE_STARTUP_NOTIFICATIONS` flag, switches workspace storage from named volumes to bind mounts, adds git safe-directory config in Dockerfile, introduces `notify` Makefile target.
Operations & Markdown Export `operations/md_ops/github_export.py`, `operations/tests/test_github_export.py`	Refactors rename/stale-file detection with shared `_stale_titled_paths_vs_listing` helper, adds `detect_stale_titled_paths` for filesystem-based stale detection, updates docstrings to reference "target" repo.
Documentation `docs/...`, `docs/service_api/...`	Extends schema documentation to include Clang/YouTube tracker models, WG21 schema updates, new service API docs for YouTube/Clang/WG21 trackers, updates workspace structure, adds WG21 GitHub dispatch operations guide, refines Pinecone preprocessing guide.
Test Infrastructure & Requirements `requirements.txt`, `workflow/management/commands/run_all_collectors.py`, `github_activity_tracker/tests/...`	Adds `beautifulsoup4`, `google-api-python-client`, `yt-dlp==2026.2.4` dependencies, extends test suite with new modules for client Link parsing, fetcher pagination/dates, consolidated sync flow, appends new collector commands to execution sequence.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as WG21 Pipeline
    participant Fetcher
    participant DB as Django DB
    participant Dispatch as GitHub API
    participant Webhook as External Webhook

    User->>CLI: run_wg21_paper_tracker
    Note over CLI: Resolve mailing date range
    CLI->>Fetcher: fetch_papers_for_mailing(year, date)
    Fetcher-->>CLI: Paper metadata list
    
    loop For each paper
        CLI->>DB: get_or_create_paper(...)
        DB-->>CLI: (paper, created)
        alt Paper created (new)
            CLI->>DB: get_or_create_paper_author(...)
            DB-->>CLI: Author link
        end
    end
    
    alt New papers exist
        Note over CLI: Collect new paper URLs
        CLI->>Dispatch: repository_dispatch(repo, event, payload)
        Dispatch-->>Webhook: GitHub event trigger
        Dispatch-->>CLI: 202 Accepted
    end
    CLI-->>User: Pipeline complete

sequenceDiagram
    actor User
    participant CLI as YouTube Tracker
    participant Fetcher as YouTube Fetcher
    participant API as YouTube API v3
    participant DB as Django DB
    participant Transcript as yt-dlp
    participant Pinecone as Pinecone Sync

    User->>CLI: run_cppa_youtube_script_tracker
    Note over CLI: Phase 1: Process metadata queue
    
    loop Metadata queue JSON files
        CLI->>DB: Persist video/channel/speaker
        DB-->>CLI: (video, created)
        alt Processing success
            CLI->>CLI: Archive JSON to raw/metadata
        end
    end
    
    Note over CLI: Phase 2: Fetch new videos
    CLI->>Fetcher: fetch_videos(start_date, end_date)
    Fetcher->>API: search().list() + videos().list()
    API-->>Fetcher: Video metadata
    loop For each video
        Fetcher->>DB: Persist to metadata queue
        CLI->>DB: Upsert video/channel
    end
    
    Note over CLI: Phase 3: Download transcripts
    loop For videos without transcripts
        CLI->>Transcript: download_vtt(video_id)
        Transcript-->>CLI: Transcript file path
        CLI->>DB: update_video_transcript()
    end
    
    Note over CLI: Phase 4: Pinecone sync
    CLI->>Pinecone: run_cppa_pinecone_sync(...)
    Pinecone-->>CLI: Sync status
    CLI-->>User: Tracker complete

sequenceDiagram
    actor User
    participant CLI as Clang GitHub Tracker
    participant StateManager as State Manager
    participant GitHub as GitHub API
    participant SyncRaw as Sync Raw
    participant DB as Django DB
    participant Preprocessor as Preprocessor
    participant Publisher as Publisher
    participant GitOps as Git Operations

    User->>CLI: run_clang_github_tracker --since/--until
    CLI->>StateManager: resolve_start_end_dates(since, until)
    StateManager->>DB: Query watermarks (max updated_at)
    DB-->>StateManager: Issue/Commit timestamps
    StateManager-->>CLI: (start_commit, start_item, end_date)
    
    alt Not skip-github-sync
        CLI->>SyncRaw: sync_clang_github_activity(...)
        SyncRaw->>GitHub: Fetch commits/issues/PRs
        GitHub-->>SyncRaw: Raw JSON payloads
        SyncRaw->>DB: Upsert via services
        DB-->>SyncRaw: (inserted, updated counts)
        SyncRaw-->>CLI: Sync results
    end
    
    alt Not skip-markdown-export
        CLI->>Preprocessor: preprocess_for_pinecone(...)
        Preprocessor->>DB: Query issue/PR candidates
        Preprocessor-->>CLI: Markdown documents
        Note over CLI: Export to md_output_dir
    end
    
    alt Not skip-remote-push
        CLI->>Publisher: publish_clang_markdown(...)
        Publisher->>GitOps: clone_repo() / prepare_repo_for_pull()
        GitOps-->>Publisher: Repo ready
        Publisher->>Publisher: Copy markdown files
        Publisher->>GitOps: push(branch, commit_identity)
        GitOps-->>Publisher: Push complete
        Publisher-->>CLI: Publish success
    end
    
    alt Not skip-pinecone
        CLI->>CLI: run_cppa_pinecone_sync(...)
    end
    CLI-->>User: Sync + export + publish complete

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

unified issues/PR sync, backward commit pagination, Link-based REST (#125) #127: Refactors GitHub activity sync/fetch stack with Link-header pagination helpers and unified issues/PR fetcher, overlapping with main PR's GitHub Activity Tracker refactor.
Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24) #104: Adds the WG21 paper tracker app (models, migrations, fetcher, pipeline, services, tests), directly related to the main PR's WG21 implementation.
#38-add youtube tracker. confirmed that transcripts are downloaded. #107: Adds the CPPA YouTube script tracker (models, migrations, fetcher, services, tests), directly related to the main PR's YouTube tracker implementation.

Suggested labels

enhancement, new-feature, major-refactor, multi-subsystem

Suggested reviewers

jonathanMLDev

Poem

🐰 Hop along, new trackers bloom bright,
YouTube and papers in data's delight,
Clang GitHub now dances with DB in sync,
Link-header pagination—what a wink!
Publish and push, the pipeline takes flight! 🚀

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch develop

coderabbitai

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

clang_github_tracker/tests/test_commands.py (1)

50-67: ⚠️ Potential issue | 🟡 Minor

Make the alias test assert the parsed bounds.

Lines 57-66 only prove that some "Resolved:" log line was emitted. If --from-date / --to-date were accidentally ignored and the command fell back to DB-derived defaults, this test would still pass.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/tests/test_commands.py` around lines 50 - 67, The test
test_run_clang_github_tracker_since_until_aliases should assert the actual
parsed bounds rather than just that a "Resolved:" message exists; update the
test to either (A) inspect caplog.records for the specific resolved start/end
strings (e.g. look for "Resolved: since=2024-01-01" and "until=2024-06-30" in
the log record produced by the command) or (B) remove "--dry-run" and assert
sync_clang_github_activity (sync_mock) was called with the expected keyword args
(e.g. since="2024-01-01", until="2024-06-30") by checking sync_mock.call_args;
modify test_run_clang_github_tracker_since_until_aliases accordingly to assert
those exact parsed bounds.

🟠 Major comments (23)

requirements.txt-27-28 (1)

27-28: ⚠️ Potential issue | 🟠 Major

Address duplicate dependency declarations throughout requirements.txt:

Verification confirmed beautifulsoup4 is duplicated at Lines 13 (>=4.12) and 28 (>=4.12.0). Additionally, three other packages have duplicate entries with conflicting version constraints: langchain-core (0.2 vs 0.1), langchain-text-splitters (0.2 vs 0.0.1), and pinecone (5.0 vs 3.0). These conflicting constraints risk dependency resolution failures. Remove all duplicates and consolidate to single, canonical entries for each package.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` around lines 27 - 28, requirements.txt contains duplicate
and conflicting entries for beautifulsoup4, langchain-core,
langchain-text-splitters, and pinecone; remove all duplicate lines and
consolidate each package to a single canonical constraint (prefer the highest
compatible/stable version for each package, e.g., choose one of the
langchain-core constraints and one of langchain-text-splitters constraints
rather than keeping both 0.2 and 0.1/0.0.1 entries, and pick the appropriate
pinecone version) so that each package appears only once; update the file to a
single entry for beautifulsoup4, langchain-core, langchain-text-splitters, and
pinecone, and verify no other duplicates remain (and regenerate any lockfile or
dependency pinning if you use one).

core/utils/boost_version_operations.py-59-83 (1)

59-83: ⚠️ Potential issue | 🟠 Major

Reject malformed versions and canonicalize normalization through the strict parser.

These two helpers now disagree on what a valid Boost version looks like. parse_boost_version_string() silently accepts 1.2.3.4 as (1, 2, 3), while normalize_boost_version_string() returns malformed/non-canonical values such as boost.1.2.3 for boost_1_2_3 and Boost.1.2.3 for Boost-1.2.3. Because boost_usage_tracker.detect_boost_version_in_repo() now trusts this shared normalizer, those bad values can leak into persisted version fields and break comparison/sorting downstream.

Proposed fix

 def parse_boost_version_string(version_str: str) -> tuple[int, int, int] | None:
@@
     s = _VERSION_STRIP_PREFIX.sub("", str(version_str).strip())
     s = s.replace("_", ".")
     parts = s.split(".")
-    if not parts or not parts[0].strip():
+    if not 1 <= len(parts) <= 3 or any(not part.strip() for part in parts):
         return None
@@
 def normalize_boost_version_string(version_str: str) -> str | None:
@@
-    version = (version_str or "").strip().replace("boost-", "")
-    version = version.replace("-", ".").replace("_", ".")
-    if not version or version.startswith("0."):
+    triple = parse_boost_version_string(version_str)
+    if triple is None:
         return None
-    if len(version.split(".")) == 2:
-        version = f"{version}.0"
-    return version
+    major, minor, patch = triple
+    if major == 0:
+        return None
+    return f"{major}.{minor}.{patch}"

Also applies to: 123-136

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/utils/boost_version_operations.py` around lines 59 - 83,
parse_boost_version_string currently tolerates extra segments (e.g. "1.2.3.4")
and normalize_boost_version_string emits non-canonical prefixes (e.g.
"boost.1.2.3" or "Boost.1.2.3"); update both to use a single strict parsing
flow: have normalize_boost_version_string call parse_boost_version_string (or
share a single canonicalizer) and reject any input with more than three numeric
segments or non-numeric segments, only returning canonical "major.minor.patch"
(with underscores/hyphens/prefixes stripped via _VERSION_STRIP_PREFIX before
parsing), and ensure parse_boost_version_string returns None for malformed
inputs so no malformed values are produced or persisted by
boost_usage_tracker.detect_boost_version_in_repo.

Dockerfile-39-39 (1)

39-39: ⚠️ Potential issue | 🟠 Major

Git safe.directory wildcard syntax requires Git 2.46+.

The wildcard pattern /app/workspace/* in safe.directory is not supported in Git versions prior to 2.46. If the base image contains Git 2.35–2.45, the pattern will be treated as a literal path and fail to match mounted repositories. Either verify the base image includes Git 2.46 or later, or use safe.directory '*' to trust all directories (less restrictive).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` at line 39, The Dockerfile currently adds a Git safe.directory
with a wildcard using the RUN instruction "git config --system --add
safe.directory '/app/workspace/*'", which requires Git >=2.46; update the
Dockerfile so it either ensures the base image installs Git 2.46+ before this
RUN (upgrade Git) or replace the unsafe wildcard pattern with a supported
alternative such as "git config --system --add safe.directory '*'" (to trust all
directories) or explicitly add the concrete paths you expect, and keep the
change scoped to the same RUN that performs the git config to ensure
compatibility.

core/utils/datetime_parsing.py-19-20 (1)

19-20: ⚠️ Potential issue | 🟠 Major

Replace django_timezone.utc with timezone.utc — required for Django 5.0 compatibility.

Lines 19–20 use the deprecated django.utils.timezone.utc alias, which was removed in Django 5.0. The stdlib timezone from datetime is already imported at line 5, so use that instead.

Proposed fix

-        return django_timezone.make_aware(dt, django_timezone.utc)
-    return dt.astimezone(django_timezone.utc)
+        return django_timezone.make_aware(dt, timezone.utc)
+    return dt.astimezone(timezone.utc)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/utils/datetime_parsing.py` around lines 19 - 20, Replace uses of the
removed django_timezone.utc alias with the stdlib timezone.utc in the datetime
parsing logic: update the calls in the function that currently return
django_timezone.make_aware(dt, django_timezone.utc) and
dt.astimezone(django_timezone.utc) to use timezone.utc (the datetime.timezone
imported at top) so the function (e.g., the aware/astimezone branches in
core/utils/datetime_parsing.py) is compatible with Django 5.0.

boost_library_docs_tracker/fetcher.py-323-342 (1)

323-342: ⚠️ Potential issue | 🟠 Major

Use path-segment checks instead of raw substring matching for library scope.

Line 339 checks lib_segment via substring on the whole URL, which can admit out-of-scope links (false positives) when the token appears elsewhere in the path/query.

🛠️ Proposed fix

-from urllib.parse import urljoin
+from urllib.parse import urljoin, urlparse

-                # Stay within this library's doc subtree (path contains lib segment)
-                if lib_segment not in abs_url:
+                # Stay within this library's doc subtree using path segments
+                path_parts = [p for p in urlparse(abs_url).path.split("/") if p]
+                if lib_segment not in path_parts:
                     continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 323 - 342, The current
check "if lib_segment not in abs_url" uses a raw substring match on the full URL
and yields false positives; change this to parse abs_url (e.g., via
urllib.parse.urlparse) and test lib_segment against the path segments only
(split the parsed.path on "/" and check membership), ignoring fragments/queries;
update the check in the link discovery loop that currently references
lib_segment and abs_url so only URLs whose path contains lib_segment as a
distinct segment are allowed (still keep the existing base_url and visited/queue
checks).

boost_library_docs_tracker/fetcher.py-79-81 (1)

79-81: ⚠️ Potential issue | 🟠 Major

Validate existing ZIP before skipping download.

Lines 79-81 currently skip network fetch on mere file existence. A truncated/corrupt local zip will be treated as valid and can break downstream extraction.

🛠️ Proposed fix

-    if zip_path.exists():
-        logger.info("Source zip already present, skipping download: %s", zip_path)
-        return zip_path
+    if zip_path.exists():
+        if zipfile.is_zipfile(zip_path):
+            logger.info("Source zip already present, skipping download: %s", zip_path)
+            return zip_path
+        logger.warning("Existing zip is invalid; re-downloading: %s", zip_path)
+        zip_path.unlink(missing_ok=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 79 - 81, The code
currently returns early if zip_path.exists() which risks using a
truncated/corrupt archive; change the logic in the fetcher block that checks
zip_path to validate the archive (e.g., use zipfile.is_zipfile or attempt to
open/test the ZipFile) before returning; if validation fails, log a warning via
logger (including zip_path), remove or rename the bad file, and proceed to
download the zip as originally intended so downstream extraction uses a verified
archive.

boost_mailing_list_tracker/preprocessor.py-123-123 (1)

123-123: ⚠️ Potential issue | 🟠 Major

source_ids value no longer matches retry selector.

Line 123 emits message.pk, but retry selection still uses msg_id (Line 96) and the function contract describes failed_ids as msg_id values. This can break failed-item retries.

🔧 Proposed fix

         metadata: dict[str, Any] = {
             "doc_id": msg_id,
             "type": "mailing",
             "thread_id": message.thread_id or "",
             "subject": message.subject or "",
             "author": sender_name,
             "timestamp": safe_timestamp,
             "parent_id": message.parent_id or "",
-            "source_ids": str(message.pk),
+            "source_ids": msg_id,
             "list_name": message.list_name or "",
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@boost_mailing_list_tracker/preprocessor.py` at line 123, The emitted
"source_ids" currently uses message.pk but the retry selector and the function
contract expect msg_id values; update the emitter so "source_ids" is set to the
message's msg_id (e.g., str(message.msg_id)) instead of message.pk to keep
failed_ids and retry selection consistent with the contract (ensure any
surrounding code that parses failed_ids expects the same string format).

cppa_youtube_script_tracker/workspace.py-63-75 (1)

63-75: ⚠️ Potential issue | 🟠 Major

Validate video_id/lang before composing filesystem paths.

Line 65, Line 70, and Line 75 directly interpolate identifiers into paths. If a separator or traversal token slips in, writes can escape the workspace subtree.

🔧 Proposed fix

+def _validate_path_component(value: str, field: str) -> str:
+    value = (value or "").strip()
+    if not value or "/" in value or "\\" in value or ".." in value:
+        raise ValueError(f"Invalid {field}: {value!r}")
+    return value
+
 def get_metadata_queue_path(video_id: str) -> Path:
     """Return workspace/cppa_youtube_script_tracker/metadata/{video_id}.json."""
-    return get_metadata_queue_dir() / f"{video_id}.json"
+    safe_video_id = _validate_path_component(video_id, "video_id")
+    return get_metadata_queue_dir() / f"{safe_video_id}.json"
 
 def get_raw_metadata_path(video_id: str) -> Path:
     """Return workspace/raw/cppa_youtube_script_tracker/metadata/{video_id}.json."""
-    return get_raw_metadata_dir() / f"{video_id}.json"
+    safe_video_id = _validate_path_component(video_id, "video_id")
+    return get_raw_metadata_dir() / f"{safe_video_id}.json"
 
 def get_transcript_path(video_id: str, lang: str = "en") -> Path:
     """Return workspace/raw/cppa_youtube_script_tracker/transcripts/{video_id}.{lang}.vtt."""
-    return get_raw_transcripts_dir() / f"{video_id}.{lang}.vtt"
+    safe_video_id = _validate_path_component(video_id, "video_id")
+    safe_lang = _validate_path_component(lang, "lang")
+    return get_raw_transcripts_dir() / f"{safe_video_id}.{safe_lang}.vtt"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cppa_youtube_script_tracker/workspace.py` around lines 63 - 75, The three
path helpers (get_metadata_queue_path, get_raw_metadata_path,
get_transcript_path) currently interpolate video_id and lang directly; validate
and sanitize these inputs before composing paths by rejecting any value
containing path separators or traversal tokens (e.g. "/", "\\", "..") and
enforce a safe pattern (e.g. alphanumeric plus limited chars like "-" and "_"
for video_id and a strict lang pattern such as two-letter or [a-zA-Z0-9_-]+); if
validation fails raise ValueError. Apply this validation inside each helper (or
extract a small sanitizer used by them) so the returned Path always uses only
the sanitized basename components.

cppa_user_tracker/migrations/0007_youtubespeaker_external_id.py-26-31 (1)

26-31: ⚠️ Potential issue | 🟠 Major

Bound generated external_id length before saving.

Generated values can exceed 255 chars for long names, which can fail this migration before uniqueness is enforced.

Proposed fix

-        slug = _slugify_speaker_name(speaker.display_name)
-        candidate = f"youtube:name:{slug}"
-        if candidate in used:
-            candidate = f"{candidate}:{speaker.baseprofile_ptr_id}"
+        slug = _slugify_speaker_name(speaker.display_name)
+        prefix = "youtube:name:"
+        max_slug_len = 255 - len(prefix)
+        slug = slug[:max_slug_len]
+        candidate = f"{prefix}{slug}"
+        if candidate in used:
+            suffix = f":{speaker.baseprofile_ptr_id}"
+            candidate = f"{prefix}{slug[: max_slug_len - len(suffix)]}{suffix}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/migrations/0007_youtubespeaker_external_id.py` around lines
26 - 31, The migration can generate external_id strings longer than 255 chars;
before assigning speaker.external_id and calling
speaker.save(update_fields=["external_id"]), truncate the generated candidate to
a safe length (<=255) — if you append the uniqueness suffix
f":{speaker.baseprofile_ptr_id}" ensure you reserve enough characters for that
suffix by truncating the slug portion produced by _slugify_speaker_name
accordingly so the final candidate (including any appended suffix) is at most
255 characters; update the logic around candidate creation and the branch that
adds the id to guarantee the saved external_id never exceeds 255.

clang_github_tracker/preprocessors/pr_preprocessor.py-70-72 (1)

70-72: ⚠️ Potential issue | 🟠 Major

Guard document building so one bad PR payload doesn’t abort the whole run.

build_pr_document(...) is currently unguarded. If one record raises, preprocessing exits early and skips all remaining candidates.

Proposed fix

-        doc = build_pr_document(path, data, repo)
-        if doc:
-            documents.append(doc)
+        try:
+            doc = build_pr_document(path, data, repo)
+        except (KeyError, TypeError, ValueError) as e:
+            logger.warning("preprocess pr #%s: build failed %s", number, e)
+            continue
+        if doc:
+            documents.append(doc)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/preprocessors/pr_preprocessor.py` around lines 70 - 72,
Wrap the call to build_pr_document(...) in a try/except inside the loop that
populates documents so a single failing PR payload doesn't abort the run: catch
Exception around the build_pr_document(path, data, repo) invocation in
pr_preprocessor.py (the loop that appends to documents), log the error
(including exception details) and continue to the next record instead of
re-raising, ensuring only valid docs are appended to documents.

core/management/commands/send_startup_notification.py-91-105 (1)

91-105: ⚠️ Potential issue | 🟠 Major

Validate webhook URLs before urlopen().

Both helpers trust the configured URL scheme and hand it straight to request.urlopen(). A bad DISCORD_WEBHOOK_URL / SLACK_WEBHOOK_URL value can invoke file: or another unintended scheme from the deploy container. Fail fast unless the URL is HTTPS.

Suggested fix

 from urllib import request
 from urllib.error import URLError
+from urllib.parse import urlparse
 ...
+def _validate_webhook_url(webhook_url: str) -> str:
+    parsed = urlparse(webhook_url)
+    if parsed.scheme != "https" or not parsed.netloc:
+        raise ValueError("Webhook URL must be an https URL")
+    return webhook_url
+
+
 def post_discord(webhook_url: str, title: str, description: str) -> None:
+    webhook_url = _validate_webhook_url(webhook_url)
     embed = {
         "title": title,
         "description": description[:4000],
         "color": 0x3498DB,
         "timestamp": datetime.now(timezone.utc).isoformat(),
@@
 def post_slack(webhook_url: str, title: str, text: str) -> None:
+    webhook_url = _validate_webhook_url(webhook_url)
     blocks = [
         {
             "type": "header",
             "text": {"type": "plain_text", "text": title, "emoji": True},
         },

Also applies to: 110-129

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/management/commands/send_startup_notification.py` around lines 91 - 105,
The post_discord function currently passes webhook_url straight to
request.urlopen; validate the URL first by parsing it (e.g.,
urllib.parse.urlparse) and ensure scheme is 'https' and netloc is present before
calling urlopen, failing fast (raise ValueError or log and return) on invalid or
non-HTTPS schemes; apply the same validation to the corresponding Slack helper
(the function handling lines ~110-129) so both only allow secure https webhook
URLs.

core/management/commands/send_startup_notification.py-205-208 (1)

205-208: ⚠️ Potential issue | 🟠 Major

Don’t make notification outages fail an otherwise healthy deploy.

This command now runs after the health check, so sys.exit(1) turns a transient Slack/Discord problem into a failed deployment even when the stack is up. That is a pretty expensive failure mode for a best-effort notification path. Prefer logging and returning success, or gate the hard failure behind an explicit strict setting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/management/commands/send_startup_notification.py` around lines 205 -
208, The current notification command calls sys.exit(1) when notification errors
exist (errors list logged via logger.error), which will fail deployments; change
it to log each error and return normally (or exit 0) so notification outages
don’t mark the run as failed. Replace the sys.exit(1) call in the
send_startup_notification command/management command with a non-fatal path (just
logging and return), and if you need strict behavior add an explicit
flag/setting (e.g., --strict or NOTIFICATIONS_STRICT) that, when enabled, will
still call sys.exit(1); reference the errors list, logger.error calls, and the
sys.exit(1) site when making the change.

cppa_youtube_script_tracker/preprocessor.py-66-73 (1)

66-73: ⚠️ Potential issue | 🟠 Major

This defeats the prefetch and re-queries per video.

_build_candidate_queryset() already prefetches video_speakers__speaker, but _get_speaker_names() builds a fresh queryset with select_related(...).values_list(...). That bypasses the prefetched cache and adds one SQL query per video during Pinecone preprocessing. Read from video.video_speakers.all() instead and sort in Python.

Suggested fix

 def _get_speaker_names(video: YouTubeVideo) -> list[str]:
     """Return a sorted list of speaker display_names linked to this video."""
-    names = list(
-        video.video_speakers.select_related("speaker")
-        .values_list("speaker__display_name", flat=True)
-        .order_by("speaker__display_name")
-    )
-    return [n for n in names if n]
+    names = [
+        rel.speaker.display_name
+        for rel in video.video_speakers.all()
+        if getattr(rel, "speaker", None) and rel.speaker.display_name
+    ]
+    return sorted(names)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cppa_youtube_script_tracker/preprocessor.py` around lines 66 - 73,
_get_speaker_names is bypassing the prefetched video_speakers by issuing a new
queryset (select_related/values_list), causing an extra SQL query per video;
modify _get_speaker_names to iterate over the already-prefetched related set
(use video.video_speakers.all() or video.video_speakers iterator), extract
speaker.display_name from each prefetched speaker object, filter out falsy
names, sort the resulting list in Python, and return that sorted list so it
works with the prefetch done in _build_candidate_queryset.

cppa_youtube_script_tracker/migrations/0001_initial.py-72-75 (1)

72-75: ⚠️ Potential issue | 🟠 Major

Use bigint counters for YouTube metrics.

view_count, like_count, and comment_count are IntegerFields here. view_count can exceed 2.1B on real channels, which will overflow or reject inserts on 32-bit integer columns. These should be PositiveBigIntegerFields before this migration lands.

Suggested fix

-                ("view_count", models.IntegerField(blank=True, null=True)),
-                ("like_count", models.IntegerField(blank=True, null=True)),
-                ("comment_count", models.IntegerField(blank=True, null=True)),
+                ("view_count", models.PositiveBigIntegerField(blank=True, null=True)),
+                ("like_count", models.PositiveBigIntegerField(blank=True, null=True)),
+                ("comment_count", models.PositiveBigIntegerField(blank=True, null=True)),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cppa_youtube_script_tracker/migrations/0001_initial.py` around lines 72 - 75,
The migration defines YouTube metric columns using models.IntegerField which
will overflow for large values; update the three fields ("view_count",
"like_count", "comment_count") in migrations/0001_initial.py to use
models.PositiveBigIntegerField (preserving blank=True and null=True) instead of
models.IntegerField so the DB columns are created as bigint unsigned-capable
types; also ensure the migration imports or references
models.PositiveBigIntegerField correctly so the generated schema uses big
integer columns for these counters.

cppa_user_tracker/services.py-372-388 (1)

372-388: ⚠️ Potential issue | 🟠 Major

This falls through to create() for existing authors too often.

With a provided email, the loop only returns on an exact match. A single existing display_name with a different/new email still drops into WG21PaperAuthorProfile.objects.create(...), even though the docstring says “If one exists, returns it” and “otherwise returns the first.” That will fragment the same author across multiple profiles on repeated imports.

Suggested fix

     candidates = list(
         WG21PaperAuthorProfile.objects.filter(display_name=display_name_val).order_by(
             "id"
         )
     )

-    # Disambiguate by email if provided.
-    for p in candidates:
-        if email_val and p.emails.filter(email=email_val).exists():
-            return p, False
-        elif not email_val and not p.emails.exists():
-            return p, False
+    if candidates:
+        if email_val:
+            for p in candidates:
+                if p.emails.filter(email=email_val).exists():
+                    return p, False
+        return candidates[0], False

     profile = WG21PaperAuthorProfile.objects.create(display_name=display_name_val)
     if email_val:
         add_email(profile, email_val, is_primary=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/services.py` around lines 372 - 388, The current logic
creates a new WG21PaperAuthorProfile whenever no exact email match is found,
even if candidates (same display_name) exist; update the flow so that after
iterating candidates you do not fall through to
WG21PaperAuthorProfile.objects.create when candidates is non-empty—instead
select the first candidate (candidates[0]) and return it (and if email_val is
provided, call add_email(profile, email_val, ...) to attach the address to that
existing profile); keep creating only when candidates is empty. Ensure you still
use the emails.filter(email=...) check to detect exact matches before this
fallback.

cppa_pinecone_sync/sync.py-206-249 (1)

206-249: ⚠️ Potential issue | 🟠 Major

Metadata-update failures are dropped from retry bookkeeping.

attempted_source_ids and new_failed_ids only come from the upsert path. On a metadata-only run, or when update_documents() returns errors, this code still clears the fail list and advances sync status, so those metadata refreshes won't be retried.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cppa_pinecone_sync/sync.py` around lines 206 - 249, The code only derives
attempted_source_ids and new_failed_ids from the upsert path, so metadata-only
runs or metadata update failures are cleared from retry bookkeeping; fix by
including metadata update results in the bookkeeping: when calling
update_documents (function update_documents) capture its attempted IDs and
failed IDs (from update_result), merge those attempted IDs into
attempted_source_ids (or create a union) and merge failed IDs into
new_failed_ids before the transaction block, and then use that combined
attempted_source_ids/new_failed_ids when calling services.clear_failed_ids and
services.record_failed_ids and when deciding whether to advance sync status
(services.update_sync_status). Ensure this change references the existing
symbols upsert_documents, update_documents, meta_documents, metas_to_update,
attempted_source_ids, new_failed_ids, update_result, result,
services.clear_failed_ids, services.record_failed_ids, and
services.update_sync_status.

wg21_paper_tracker/pipeline.py-212-216 (1)

212-216: ⚠️ Potential issue | 🟠 Major

Don't pass "0" into the mailing fetcher.

_parse_mailing_year() deliberately returns 0 for missing/invalid years, but fetch_papers_for_mailing(str(year), mailing_date) uses that sentinel as a real fetch parameter. For those mailings, the fetch URL/path becomes wrong, so retries can never recover.

Proposed fix

     year = _parse_mailing_year(m_info)
+    fetch_year = mailing_date[:4] if year == 0 else str(year)
     mailing_obj, _ = get_or_create_mailing(mailing_date, title)
 
-    papers = fetch_papers_for_mailing(str(year), mailing_date)
+    papers = fetch_papers_for_mailing(fetch_year, mailing_date)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/pipeline.py` around lines 212 - 216, _parse_mailing_year
can return 0 for missing/invalid years, so avoid passing the string "0" into
fetch_papers_for_mailing; update the call site in pipeline.py to check if year
== 0 and, in that case, call fetch_papers_for_mailing with a null/absent year
(e.g., pass None or remove the year argument) instead of str(year) so the
fetcher receives a clear "no year" signal (refer to _parse_mailing_year,
fetch_papers_for_mailing, year, and mailing_date).

wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py-34-37 (1)

34-37: ⚠️ Potential issue | 🟠 Major

Tighten YYYY-MM validation before writing mailings.

^\d{4}-\d{2}$ accepts impossible values like 2025-00 and 2025-13. Those rows currently become real WG21Mailing keys, which can poison later ordering and incremental imports.

Proposed fix

-MAILING_DATE_PATTERN = re.compile(r"^\d{4}-\d{2}$")
+MAILING_DATE_PATTERN = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 34 - 37, MAILING_DATE_PATTERN currently allows invalid months;
tighten validation by changing the pattern from r"^\d{4}-\d{2}$" to
r"^\d{4}-(0[1-9]|1[0-2])$" or, alternatively, validate with datetime.strptime
when creating WG21Mailing keys; update the code that uses MAILING_DATE_PATTERN
(the import/validation path that writes WG21Mailing keys) to reject or
substitute PLACEHOLDER_MAILING_DATE for any date that fails the stricter check
so only YYYY-01..12 become real mailing keys.

cppa_youtube_script_tracker/models.py-54-56 (1)

54-56: ⚠️ Potential issue | 🟠 Major

Use BigIntegerField for YouTube engagement counters.

YouTube videos regularly exceed the 2^31-1 (≈2.1 billion) limit of a 32-bit signed integer. As of April 2026, over 50 videos have surpassed 2 billion views—including "Baby Shark Dance" at 16.85 billion and "Despacito" at 8.99 billion. Using IntegerField will cause data truncation or insertion failures for these videos. Switch to BigIntegerField to accommodate the full range of engagement metrics:
-    view_count = models.IntegerField(null=True, blank=True)
-    like_count = models.IntegerField(null=True, blank=True)
-    comment_count = models.IntegerField(null=True, blank=True)
+    view_count = models.BigIntegerField(null=True, blank=True)
+    like_count = models.BigIntegerField(null=True, blank=True)
+    comment_count = models.BigIntegerField(null=True, blank=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_youtube_script_tracker/models.py` around lines 54 - 56, The three
engagement fields view_count, like_count, and comment_count currently use
models.IntegerField which can overflow for large YouTube metrics; change each to
models.BigIntegerField while preserving null=True and blank=True (i.e., replace
IntegerField with BigIntegerField for view_count, like_count, comment_count in
models.py) and then create and apply a Django migration (makemigrations and
migrate) to update the database schema.

clang_github_tracker/services.py-97-99 (1)

97-99: ⚠️ Potential issue | 🟠 Major

Validate SHA contents, not just length.

A 40-character non-hex string currently passes both paths and can be stored as a commit key. That breaks the stated 40-hex contract and weakens data integrity.

Suggested fix

+import re
@@
 logger = logging.getLogger(__name__)
 
 DEFAULT_UPSERT_BATCH_SIZE = 500
+_SHA40_RE = re.compile(r"^[0-9a-f]{40}$")
@@
 def upsert_commit(
     sha: str,
     *,
     github_committed_at: datetime | None,
 ) -> tuple[ClangGithubCommit, bool]:
@@
     sha_clean = (sha or "").strip().lower()
-    if len(sha_clean) != 40:
+    if not _SHA40_RE.fullmatch(sha_clean):
         raise ValueError(f"commit sha must be 40 hex chars, got {sha_clean!r}")
@@
     for sha, dt in rows:
         s = (sha or "").strip().lower()
-        if len(s) != 40:
+        if not _SHA40_RE.fullmatch(s):
             continue

Also applies to: 178-181

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/services.py` around lines 97 - 99, The current check
only enforces length for sha_clean and allows non-hex characters; update the
validation to ensure sha_clean is exactly 40 lowercase hex chars (e.g. use a
full-match against [0-9a-f]{40} or attempt hex decode) before raising the
ValueError. Apply this stronger validation in both places where sha_clean is
used (the block that currently sets sha_clean and raises ValueError and the
second occurrence around lines 178-181), keep the same error message but only
raise it when the string fails the hex-content check as well as length.

clang_github_tracker/services.py-244-248 (1)

244-248: ⚠️ Potential issue | 🟠 Major

Guard non-positive batch_size in upsert_issue_items_batch.

Unlike upsert_commits_batch, this helper never normalizes batch_size. 0 will crash at Line 277, and a negative value makes the loop skip every row.

Suggested fix

 def upsert_issue_items_batch(
     rows: Sequence[tuple[int, bool, datetime | None, datetime | None]],
     *,
     batch_size: int = DEFAULT_UPSERT_BATCH_SIZE,
 ) -> tuple[int, int]:
@@
     Returns:
         (inserted, updated) counts across all batches.
     """
+    if batch_size <= 0:
+        logger.warning(
+            "batch_size must be positive, using %s", DEFAULT_UPSERT_BATCH_SIZE
+        )
+        batch_size = DEFAULT_UPSERT_BATCH_SIZE
     merged: dict[int, tuple[bool, datetime | None, datetime | None]] = {}

Also applies to: 277-278

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/services.py` around lines 244 - 248, The
upsert_issue_items_batch helper doesn't normalize or guard batch_size, so a
value of 0 or negative will crash or skip rows; modify upsert_issue_items_batch
to normalize batch_size at the start (similar to upsert_commits_batch) by
ensuring it's a positive integer (e.g., set batch_size = max(1, int(batch_size))
or default to DEFAULT_UPSERT_BATCH_SIZE when non-positive) before the loop that
slices rows, so the loop and slicing at lines referencing batch_size cannot fail
or skip items.

wg21_paper_tracker/services.py-142-150 (1)

142-150: ⚠️ Potential issue | 🟠 Major

Make author replacement atomic and distinguish [] from None.

if author_names: means an explicit empty list never clears stale authors, and the delete/recreate sequence runs outside the transaction used for the paper upsert. If one insert fails midway, the paper is left with a partially rewritten author set.

Suggested fix

-    if author_names:
-        if not created:
-            for author in paper.authors.all():
-                author.delete()
-        emails = author_emails or []
-        for i, name in enumerate(author_names):
-            email = emails[i] if i < len(emails) else None
-            profile, _ = get_or_create_wg21_paper_author_profile(name, email=email)
-            get_or_create_paper_author(paper, profile, i + 1)
+    if author_names is not None:
+        with transaction.atomic():
+            if not created:
+                paper.authors.all().delete()
+            emails = author_emails or []
+            for i, name in enumerate(author_names):
+                email = emails[i] if i < len(emails) else None
+                profile, _ = get_or_create_wg21_paper_author_profile(name, email=email)
+                get_or_create_paper_author(paper, profile, i + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/services.py` around lines 142 - 150, The current logic
uses "if author_names:" which treats an explicit empty list as falsy and
therefore never clears existing authors, and it performs delete/recreate outside
the paper upsert transaction; change this to explicitly check "if author_names
is not None" so [] clears authors, and perform the delete-and-create sequence
inside the same atomic transaction used for the paper upsert (e.g., use
transaction.atomic() or the existing upsert transaction scope) so that deleting
existing authors (paper.authors.all().delete()) and creating new author rows via
get_or_create_wg21_paper_author_profile(...) and
get_or_create_paper_author(paper, profile, position) are all committed or rolled
back together; also compute emails = author_emails or [] and map by index as
before but keep the explicit None-vs-empty distinction.

config/settings.py-338-361 (1)

338-361: ⚠️ Potential issue | 🟠 Major

Add fallback for legacy Slack token environment variables.

The token parsing functions at lines 360–361 now build token maps exclusively from SLACK_TEAM_IDS and per-team env vars (e.g., SLACK_BOT_TOKEN_<team_id>). Environments still using the legacy single-team format (SLACK_BOT_TOKEN + SLACK_TEAM_ID env vars) will load empty credentials and fail at startup with "No teams configured." Keep a compatibility fallback to check the legacy format if the new format yields no tokens.

Suggested fix

def _slack_per_team_tokens_from_env(env_key_prefix: str):
    """
    Build team_id -> token from SLACK_TEAM_IDS and ``{prefix}_{team_id}`` env vars
    (e.g. prefix SLACK_BOT_TOKEN → SLACK_BOT_TOKEN_T123).
    """
    out = {}
    for tid in _slack_team_ids_from_env():
        key = f"{env_key_prefix}_{tid}"
        token = (env(key, default="") or "").strip()
        if token:
            out[tid] = token
+    if out:
+        return out
+
+    legacy_token = (env(env_key_prefix, default="") or "").strip()
+    if legacy_token and SLACK_TEAM_ID:
+        return {SLACK_TEAM_ID: legacy_token}
    return out

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@config/settings.py` around lines 338 - 361, The current per-team token
builder (_slack_per_team_tokens_from_env) only reads tokens for team ids listed
in _slack_team_ids_from_env and so SLACK_BOT_TOKEN / SLACK_APP_TOKEN mappings
are empty in legacy single-team setups; modify _slack_per_team_tokens_from_env
to, after building out from per-team keys, check if out is empty and a legacy
single-token env exists (env("SLACK_BOT_TOKEN") or env("SLACK_APP_TOKEN"))
together with a legacy SLACK_TEAM_ID (env("SLACK_TEAM_ID")), and if so populate
out with {legacy_team_id: legacy_token} (ensuring you strip whitespace) so
SLACK_BOT_TOKEN and SLACK_APP_TOKEN variables (and SLACK_TEAM_ID) serve as
fallbacks when the new per-team keys produce no entries; keep references to
_slack_team_ids_from_env, _slack_per_team_tokens_from_env, SLACK_BOT_TOKEN and
SLACK_APP_TOKEN.

🟡 Minor comments (11)

core/utils/text_processing.py-125-130 (1)

125-130: ⚠️ Potential issue | 🟡 Minor

Normalize ellipsis after HTML unescape to cover entity inputs.

Lines 125-130 normalize \u2026 before html.unescape, so inputs like … stay as … instead of becoming ....

🛠️ Proposed fix

     text = (
         text.replace("\xad", "")
         .replace("\u200b", "")
         .replace("\u200c", "")
         .replace("\u200d", "")
         .replace("\xa0", " ")
         .replace("\u2002", " ")
         .replace("\u2003", " ")
-        .replace("\u2026", "...")
         .replace("\u202f", " ")
     )
 
     text = html.unescape(text)
+    text = text.replace("\u2026", "...")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/utils/text_processing.py` around lines 125 - 130, The normalization of
Unicode ellipsis and narrow no-break space currently runs before html.unescape
so HTML entities like &hellip; remain as “…” after unescape; move the
.replace("\u2026", "...").replace("\u202f", " ") calls to after the
html.unescape(text) call (i.e., perform html.unescape(text) first, then apply
those .replace calls on the unescaped text variable) so entities are converted
to the normalized forms; update usages of text in the function (the variable
named text and the call to html.unescape) accordingly.

github_activity_tracker/preprocessors/github_preprocess.py-111-113 (1)

111-113: ⚠️ Potential issue | 🟡 Minor

Rename parameter type to avoid shadowing built-in.

The parameter type shadows Python's built-in and generates linter warnings. All call sites (lines 179, 224, 272, 330) use positional arguments, making the rename safe. Update the parameter name to entity_type and the corresponding f-string reference.

♻️ Proposed refactor

-def get_ids_for_pinecone(repo: str, type: Literal["issue", "pr"], number: int) -> str:
+def get_ids_for_pinecone(
+    repo: str, entity_type: Literal["issue", "pr"], number: int
+) -> str:
     """Get the ids for Pinecone from a repo, type, and number."""
-    return f"{repo}:{type}:{number}"
+    return f"{repo}:{entity_type}:{number}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@github_activity_tracker/preprocessors/github_preprocess.py` around lines 111
- 113, Rename the parameter `type` in get_ids_for_pinecone to avoid shadowing
the built-in: change the parameter name to `entity_type` and update the f-string
inside get_ids_for_pinecone to use `{entity_type}`; ensure all call sites that
pass arguments positionally remain unaffected (calls at the noted locations use
positional args so no caller changes needed).

boost_library_docs_tracker/html_to_md.py-12-12 (1)

12-12: ⚠️ Potential issue | 🟡 Minor

Minor: Use HYPHEN-MINUS instead of EN DASH in docstring.

Static analysis flagged an ambiguous character: the – (EN DASH) should be - (HYPHEN-MINUS) for consistency in ASCII documentation.

📝 Proposed fix

-3. _postprocess_markdown – strip residual HTML artefacts, rejoin split lines, then clean_text (unicode/line endings only)
+3. _postprocess_markdown - strip residual HTML artefacts, rejoin split lines, then clean_text (unicode/line endings only)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/html_to_md.py` at line 12, The docstring for
_postprocess_markdown contains an EN DASH character (–) that static analysis
flagged; open the _postprocess_markdown function's docstring and replace the EN
DASH with a standard HYPHEN-MINUS (-) so the comment reads "strip residual HTML
artefacts, rejoin split lines, then clean_text (unicode/line endings only)"
using the ASCII hyphen-minus character.

cppa_pinecone_sync/ingestion.py-564-566 (1)

564-566: ⚠️ Potential issue | 🟡 Minor

Remove table_ids assignment from sync.py; migration to source_ids is incomplete.

Preprocessors correctly populate source_ids, but sync.py line 97 still writes metadata["table_ids"], creating redundant dual representation. The function _extract_source_ids_from_documents (line 119) reads from this table_ids field it just wrote, bypassing the source_ids already present. Remove the table_ids assignment and update sync consumers to read source_ids directly from preprocessor output.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_pinecone_sync/ingestion.py` around lines 564 - 566, Remove the legacy
"table_ids" write and ensure all consumers use "source_ids": stop assigning
metadata["table_ids"] in the sync writer (the block that builds updates with
metadata and track_ids), change the logic that sets track_ids to use
metadata.get("source_ids") only, and update the consumer function
_extract_source_ids_from_documents to read source_ids directly from document
metadata instead of reading table_ids; also remove any metadata.pop/usage
related to "table_ids" so only source_ids is propagated.

github_activity_tracker/tests/test_fetcher_commits_backward.py-63-108 (1)

63-108: ⚠️ Potential issue | 🟡 Minor

Use multi-item pages in the backward-traversal fixture.

Lines 79-95 only put one commit on each fetched page, so this still passes if the implementation walks pages in the right order but forgets to reverse commits within each page before yielding. Add at least two commits to one non-first page and assert the full oldest→newest sequence.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@github_activity_tracker/tests/test_fetcher_commits_backward.py` around lines
63 - 108, The test fixture uses single-commit pages so it won't catch a bug
where fetch_commits_from_github fails to reverse commits within a page; update
the side_effect responses for client.rest_request_url_with_all_links to include
at least two commits on a non-first page (for example, make the Page 2 tuple
contain two commit dicts with distinct shas and dates), then update the final
expected sha sequence in the assertion to reflect the full oldest→newest
ordering across commits within pages (and keep the check that
rest_request_url_with_all_links.call_count == 2). Ensure you reference the same
MagicMock calls (client.rest_request_with_all_links and
client.rest_request_url_with_all_links) and the test function name
test_fetch_commits_multiple_pages_backward_traversal when making the change.

wg21_paper_tracker/workspace.py-22-35 (1)

22-35: ⚠️ Potential issue | 🟡 Minor

Reject mismatched year / mailing_date pairs.

Right now get_raw_dir("2024-05", 2025) will happily create .../2025/2024-05. That makes the raw layout depend on caller mistakes and can scatter one mailing across different directories.

🛠️ Proposed fix

 def get_raw_dir(mailing_date: str | None, year: int) -> Path:
     """Return workspace/raw/wg21_paper_tracker/<year>/<mailing_date>/; creates if missing."""
     if mailing_date is not None and not _MAILING_DATE_RE.fullmatch(mailing_date):
         raise ValueError("mailing_date must be in YYYY-MM format")
+    if mailing_date is not None and int(mailing_date[:4]) != year:
+        raise ValueError("year must match mailing_date")
     if getattr(settings, "RAW_DIR", None):
         raw_root = Path(settings.RAW_DIR) / _APP_SLUG

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/workspace.py` around lines 22 - 35, The function
get_raw_dir allows mismatched mailing_date and year (e.g., "2024-05" with
year=2025); modify get_raw_dir to validate that when mailing_date is provided
its year portion matches the year parameter and raise ValueError on mismatch.
After the existing _MAILING_DATE_RE check, parse the year from mailing_date
(e.g., take the YYYY prefix and int() it) and compare it to the year argument;
if they differ raise ValueError("mailing_date year does not match year
argument") so directories like <year>/<mailing_date> remain consistent.

clang_github_tracker/tests/test_commands.py-69-93 (1)

69-93: ⚠️ Potential issue | 🟡 Minor

Assert the forwarded sync values, not just kwarg presence.

This currently passes even if start_commit, start_item, or end_date are wrong or None. Patch the date resolver to return fixed sentinel values and assert those exact values are forwarded to sync_clang_github_activity.

🛠️ Proposed fix

 `@pytest.mark.django_db`
 def test_run_clang_github_tracker_calls_sync_clang_github_activity_when_not_dry_run(
     caplog,
 ):
     """Without --dry-run, command calls sync_clang_github_activity with start_item."""
-    with patch(
+    start = object()
+    end = object()
+    with patch(
+        "clang_github_tracker.management.commands.run_clang_github_tracker.resolve_start_end_dates",
+        return_value=(start, start, end),
+    ), patch(
         "clang_github_tracker.management.commands.run_clang_github_tracker.sync_clang_github_activity",
         return_value=(0, [], []),
     ) as sync_mock:
         with caplog.at_level(logging.INFO):
             call_command(
@@
     sync_mock.assert_called_once()
     call_kw = sync_mock.call_args[1]
-    assert "start_commit" in call_kw
-    assert "start_item" in call_kw
-    assert "end_date" in call_kw
+    assert call_kw["start_commit"] is start
+    assert call_kw["start_item"] is start
+    assert call_kw["end_date"] is end
     assert "start_issue" not in call_kw
     assert any("commits=" in r.getMessage() for r in caplog.records)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/tests/test_commands.py` around lines 69 - 93, Update the
test_run_clang_github_tracker_calls_sync_clang_github_activity_when_not_dry_run
test to assert exact forwarded values by patching the date resolver used inside
the run_clang_github_tracker management command to return fixed sentinel values
(e.g., SENTINEL_COMMIT, SENTINEL_ITEM, SENTINEL_DATE), then call
call_command(CMD_NAME, ...) as before and assert that sync_clang_github_activity
was called with those exact sentinel values for start_commit, start_item and
end_date (use sync_clang_github_activity from
clang_github_tracker.management.commands.run_clang_github_tracker to inspect
call args); keep the existing check that start_issue is not present and that log
records contain "commits=" unchanged.

wg21_paper_tracker/models.py-12-12 (1)

12-12: ⚠️ Potential issue | 🟡 Minor

Add field-level validation for mailing_date to enforce YYYY-MM format.

The field accepts any 7-character string. While the scraper, workspace helper, and pipeline all validate YYYY-MM format externally, the model has no constraints. Invalid values can slip in through Django admin or direct database access, breaking ordering and workspace path generation.

Use the regex pattern already employed in workspace.py to validate at the model level.
🛠️ Proposed fix
+from django.core.validators import RegexValidator
 from django.db import models
+
+_MAILING_DATE_VALIDATOR = RegexValidator(
+    r"^\d{4}-(0[1-9]|1[0-2])$",
+    "mailing_date must be in YYYY-MM format",
+)
@@
-    mailing_date = models.CharField(max_length=7, unique=True, db_index=True)
+    mailing_date = models.CharField(
+        max_length=7,
+        unique=True,
+        db_index=True,
+        validators=[_MAILING_DATE_VALIDATOR],
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/models.py` at line 12, The mailing_date CharField in
models.py accepts any 7-char string; add a field-level RegexValidator using the
same YYYY-MM regex from workspace.py to enforce format at the model layer (keep
unique=True and db_index=True). Import RegexValidator from
django.core.validators, attach validators=[RegexValidator(regex=… ,
message="Expected YYYY-MM")] to the mailing_date field, and run migrations if
needed so invalid values cannot be saved via admin or direct model use.

cppa_youtube_script_tracker/management/commands/run_cppa_youtube_script_tracker.py-291-291 (1)

291-291: ⚠️ Potential issue | 🟡 Minor

Typo in comment: "unkown" should be "unknown".

📝 Proposed fix

-    # If we discovered a concrete speaker name, remove fallback "unkown" links first.
+    # If we discovered a concrete speaker name, remove fallback "unknown" links first.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cppa_youtube_script_tracker/management/commands/run_cppa_youtube_script_tracker.py`
at line 291, Fix the typo in the inline comment in
run_cppa_youtube_script_tracker.py: change "unkown" to "unknown" in the comment
that reads "# If we discovered a concrete speaker name, remove fallback "unkown"
links first." so the comment correctly reads "unknown" (search for that exact
comment text in the RunCppaYoutubeScriptTracker command or the
run_cppa_youtube_script_tracker module to locate and update it).

docs/Schema.md-915-927 (1)

915-927: ⚠️ Potential issue | 🟡 Minor

Appendix A is stale for the WG21 schema.

The table summary still describes WG21Paper as (paper_id, url, title, publication_date) and does not list WG21Mailing, but Section 7 now documents a (paper_id, year) key plus document_date, mailing_id, subgroup, and is_downloaded. The appendix should match the schema above.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/Schema.md` around lines 915 - 927, The Appendix A table summary for WG21
is out of sync: update the WG21-related rows so they match the current schema in
Section 7—replace the outdated WG21Paper signature `(paper_id, url, title,
publication_date)` with the current WG21Paper key `(paper_id, year)` and include
the additional columns `document_date, mailing_id, subgroup, is_downloaded`, and
add a new WG21Mailing entry (or update existing row) reflecting its fields;
ensure the WG21PaperAuthor/related rows still reference the correct FK names
(paper_id, profile_id) so the appendix mirrors the schema in Section 7 exactly.

docs/Schema.md-933-973 (1)

933-973: ⚠️ Potential issue | 🟡 Minor

Appendix B is missing new relationships.

Lines 937-938 still omit BaseProfile → YoutubeSpeaker, and Lines 967-968 omit WG21Mailing → WG21Paper, so the relationship summary no longer matches the diagrams in Sections 1, 7, and 10.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/Schema.md` around lines 933 - 973, Appendix B's relationship table is
missing two rows: add a row for BaseProfile -> YoutubeSpeaker (e.g.,
"BaseProfile | YoutubeSpeaker | May be a speaker / One profile can be a YouTube
speaker") and a row for WG21Mailing -> WG21Paper (e.g., "WG21Mailing | WG21Paper
| Has many / Mailing contains papers"), ensuring the relationship wording
matches the diagrams in Sections 1, 7, and 10 and uses the exact symbols
BaseProfile, YoutubeSpeaker, WG21Mailing, and WG21Paper.

leostar0412 and others added 30 commits March 9, 2026 09:25

Initial wg21-paper-tracker added

e3c1a65

wg21_paper_tracker: features, tests, and cleanup #24

9892a45

Fix lint/format error #24

18f07c3

Validate mailing_date in get_raw_dir; WG21 author order/resolution an…

f4388ff

…d docs #24

Fix: WG21 tracker (year, GCS guard, IntegrityError), author_alias, Pi…

62d5d42

…llow, test fixes #24

Fix: WG21 – optional Cloud Run, per-blob isolation, PDF priority, yea…

e3e91c8

…r=0 promotion, logging #24

wg21: fix author_alias migration default, fail job when bucket unset,…

2159f53

… use raw/wg21_paper_tracker/YYYY/<mailing_date>/ #24

wg21: honor settings.RAW_DIR for raw paper storage #24

be392a0

Fix: lint/format error #24

005278a

fix(openai_converter): use neutral page placeholder for failed pages #24

3547652

Fix: doc and converter fixes #24

7403033

Fix: default sqlite, document #24

c33c475

Fix: author profile merge avoidance, blank paper_id rejection, mailin…

93ee8b7

…g race recovery, logging and Ruf #24

Fix: author profile merge avoidance, blank paper_id rejection, pipeli…

61a6c7f

…ne validation, mailing race recovery, logging, and API docs #24

#38-add youtube tracker. confirmed that transcripts are downloaded.

c9023f5

Fix: OpenRouter retries, CSV year from parsed date, placeholder race …

c246241

…recovery #24

#38-fixed lint errors

5051258

#38-fix a bug

17827f0

#38-updated youtube_tracker by adding search term, processing QuotaEx…

5dce586

…ceededError and most importantly introducing rich name extractor

#38-fixed lint error

6abde94

#38-addressed the review results of coderabbitai

516a0f8

#38-addressed the coderabbitai's suggestions

5f445ee

#38-fixed a query issue

bf571ec

#38-fix timezone issue

36eb704

Merge branch 'develop' into dev-38

282046e

Merge branch 'develop' into dev-24

c79c1dd

refactor(wg21): pipeline dispatch + mailing range; remove Cloud Run s…

3e32ee2

…tack #113

Merge remote-tracking branch 'origin/dev-24' into dev-24

637b0e8

Remove migration #24

818dcaf

unified issues/PR sync, backward commit pagination, Link-based REST - #…

6748f28

…125

snowfox1003 and others added 26 commits March 31, 2026 17:24

Merge pull request #135 from snowfox1003/bug/134

00d23a5

add 403 error fix logic in upload folder to github - #134

#126-fixed file name and duplications

b8ccf7b

Merge branch '126-bug-fix-boost-library-docs-tracker' of https://gith…

256b352

…ub.com/jonathanMLDev/boost-data-collector into 126-bug-fix-boost-library-docs-tracker

#126-fixed content typo of test file

763ac54

Merge pull request #128 from jonathanMLDev/126-bug-fix-boost-library-…

8d9e91a

…docs-tracker #126-fixed this app and cppa-pinecone app

Merge pull request #107 from jonathanMLDev/dev-38

625e60a

#38-add youtube tracker. confirmed that transcripts are downloaded.

Update the clang_github_tracker

edd54e4

Fix: end_date update

7d50b2d

Add Clang markdown publishing with context-repo settings, chunked raw…

d5c9534

… backfill upserts, batch DB helpers, stale-export detection in github_export, safer git clone errors without leaking tokens, and aligned tests including isolated test_settings for publish - #136

Fix: lint/format error - #136

113713b

fix(clang): docstrings, batch updated_at, CSV/CLI/Pinecone fixes; red…

dc5a43b

…act git clone logs; ignore /nul - #136

Fix: lint/format error

2a1f4de

fix(clang): max-merge duplicate rows in batch upserts; doc start_afte…

9249784

…r_watermark as +1 ms - #136

Fix: map prepare/pull git errors to CommandError - #136

abe1519

fix: defensive clang GitHub upserts, Docker safe.directory, and redac…

4968e5c

…ted git clone errors - #136

fix: sanitize git_ops remote errors; clang publisher stale-title clea…

6922fd5

…nup uses export tree - #136

docs: fix ClangGithubIssueItem watermark (+1ms); format services.py - #…

f8c1880

…136

fix(clang): issue number validation + lowercase SHAs; git_ops timeout…

10daaa3

… redaction for clone/pull - #136

fix: clang backfill raw-only; sanitize clone/publish git errors; sync…

9464188

…_raw issue-number guards - #136

Update the code

e7384f4

Fix: staging logic in clange github tracker, use core functions - #136

dfc7d69

Merge pull request #137 from snowfox1003/dev-136

b59fc99

Clang GitHub tracker: DB-backed sync, markdown publish, chunked backfill (#136)

Merge branch 'develop' into dev-24

22d740e

Fix: lint/format error

57d9334

Fix: compose error

e122eb7

Merge pull request #104 from leostar0412/dev-24

cf8f184

Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread clang_github_tracker/services.py

Comment thread cppa_youtube_script_tracker/management/commands/run_cppa_youtube_script_tracker.py

Comment thread requirements.txt

snowfox1003 changed the title ~~Sync develop branch to main~~ Sync main branch to develop Apr 29, 2026

snowfox1003 merged commit 2ce7167 into main Apr 29, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync main branch to develop#149

Sync main branch to develop#149
snowfox1003 merged 91 commits intomainfrom
develop

snowfox1003 commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

snowfox1003 commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snowfox1003 commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading