Sync main branch to develop#149
Conversation
add 403 error fix logic in upload folder to github - #134
…ub.com/jonathanMLDev/boost-data-collector into 126-bug-fix-boost-library-docs-tracker
…docs-tracker #126-fixed this app and cppa-pinecone app
#38-add youtube tracker. confirmed that transcripts are downloaded.
… backfill upserts, batch DB helpers, stale-export detection in github_export, safer git clone errors without leaking tokens, and aligned tests including isolated test_settings for publish - #136
…act git clone logs; ignore /nul - #136
…r_watermark as +1 ms - #136
…ted git clone errors - #136
…nup uses export tree - #136
… redaction for clone/pull - #136
…_raw issue-number guards - #136
Clang GitHub tracker: DB-backed sync, markdown publish, chunked backfill (#136)
Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)
📝 WalkthroughWalkthroughThis PR significantly expands the data collector with two new tracker apps (YouTube and WG21 papers), introduces a refactored Clang GitHub tracker with database persistence and publishing, unifies GitHub issue/PR fetching with Link-header pagination, improves Boost tracker publishing flows, adds shared version/text utilities, and updates configurations and documentation to integrate all changes. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as WG21 Pipeline
participant Fetcher
participant DB as Django DB
participant Dispatch as GitHub API
participant Webhook as External Webhook
User->>CLI: run_wg21_paper_tracker
Note over CLI: Resolve mailing date range
CLI->>Fetcher: fetch_papers_for_mailing(year, date)
Fetcher-->>CLI: Paper metadata list
loop For each paper
CLI->>DB: get_or_create_paper(...)
DB-->>CLI: (paper, created)
alt Paper created (new)
CLI->>DB: get_or_create_paper_author(...)
DB-->>CLI: Author link
end
end
alt New papers exist
Note over CLI: Collect new paper URLs
CLI->>Dispatch: repository_dispatch(repo, event, payload)
Dispatch-->>Webhook: GitHub event trigger
Dispatch-->>CLI: 202 Accepted
end
CLI-->>User: Pipeline complete
sequenceDiagram
actor User
participant CLI as YouTube Tracker
participant Fetcher as YouTube Fetcher
participant API as YouTube API v3
participant DB as Django DB
participant Transcript as yt-dlp
participant Pinecone as Pinecone Sync
User->>CLI: run_cppa_youtube_script_tracker
Note over CLI: Phase 1: Process metadata queue
loop Metadata queue JSON files
CLI->>DB: Persist video/channel/speaker
DB-->>CLI: (video, created)
alt Processing success
CLI->>CLI: Archive JSON to raw/metadata
end
end
Note over CLI: Phase 2: Fetch new videos
CLI->>Fetcher: fetch_videos(start_date, end_date)
Fetcher->>API: search().list() + videos().list()
API-->>Fetcher: Video metadata
loop For each video
Fetcher->>DB: Persist to metadata queue
CLI->>DB: Upsert video/channel
end
Note over CLI: Phase 3: Download transcripts
loop For videos without transcripts
CLI->>Transcript: download_vtt(video_id)
Transcript-->>CLI: Transcript file path
CLI->>DB: update_video_transcript()
end
Note over CLI: Phase 4: Pinecone sync
CLI->>Pinecone: run_cppa_pinecone_sync(...)
Pinecone-->>CLI: Sync status
CLI-->>User: Tracker complete
sequenceDiagram
actor User
participant CLI as Clang GitHub Tracker
participant StateManager as State Manager
participant GitHub as GitHub API
participant SyncRaw as Sync Raw
participant DB as Django DB
participant Preprocessor as Preprocessor
participant Publisher as Publisher
participant GitOps as Git Operations
User->>CLI: run_clang_github_tracker --since/--until
CLI->>StateManager: resolve_start_end_dates(since, until)
StateManager->>DB: Query watermarks (max updated_at)
DB-->>StateManager: Issue/Commit timestamps
StateManager-->>CLI: (start_commit, start_item, end_date)
alt Not skip-github-sync
CLI->>SyncRaw: sync_clang_github_activity(...)
SyncRaw->>GitHub: Fetch commits/issues/PRs
GitHub-->>SyncRaw: Raw JSON payloads
SyncRaw->>DB: Upsert via services
DB-->>SyncRaw: (inserted, updated counts)
SyncRaw-->>CLI: Sync results
end
alt Not skip-markdown-export
CLI->>Preprocessor: preprocess_for_pinecone(...)
Preprocessor->>DB: Query issue/PR candidates
Preprocessor-->>CLI: Markdown documents
Note over CLI: Export to md_output_dir
end
alt Not skip-remote-push
CLI->>Publisher: publish_clang_markdown(...)
Publisher->>GitOps: clone_repo() / prepare_repo_for_pull()
GitOps-->>Publisher: Repo ready
Publisher->>Publisher: Copy markdown files
Publisher->>GitOps: push(branch, commit_identity)
GitOps-->>Publisher: Push complete
Publisher-->>CLI: Publish success
end
alt Not skip-pinecone
CLI->>CLI: run_cppa_pinecone_sync(...)
end
CLI-->>User: Sync + export + publish complete
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 3
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
clang_github_tracker/tests/test_commands.py (1)
50-67:⚠️ Potential issue | 🟡 MinorMake the alias test assert the parsed bounds.
Lines 57-66 only prove that some
"Resolved:"log line was emitted. If--from-date/--to-datewere accidentally ignored and the command fell back to DB-derived defaults, this test would still pass.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 50 - 67, The test test_run_clang_github_tracker_since_until_aliases should assert the actual parsed bounds rather than just that a "Resolved:" message exists; update the test to either (A) inspect caplog.records for the specific resolved start/end strings (e.g. look for "Resolved: since=2024-01-01" and "until=2024-06-30" in the log record produced by the command) or (B) remove "--dry-run" and assert sync_clang_github_activity (sync_mock) was called with the expected keyword args (e.g. since="2024-01-01", until="2024-06-30") by checking sync_mock.call_args; modify test_run_clang_github_tracker_since_until_aliases accordingly to assert those exact parsed bounds.
🟠 Major comments (23)
requirements.txt-27-28 (1)
27-28:⚠️ Potential issue | 🟠 MajorAddress duplicate dependency declarations throughout requirements.txt:
Verification confirmed
beautifulsoup4is duplicated at Lines 13 (>=4.12) and 28 (>=4.12.0). Additionally, three other packages have duplicate entries with conflicting version constraints:langchain-core(0.2 vs 0.1),langchain-text-splitters(0.2 vs 0.0.1), andpinecone(5.0 vs 3.0). These conflicting constraints risk dependency resolution failures. Remove all duplicates and consolidate to single, canonical entries for each package.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements.txt` around lines 27 - 28, requirements.txt contains duplicate and conflicting entries for beautifulsoup4, langchain-core, langchain-text-splitters, and pinecone; remove all duplicate lines and consolidate each package to a single canonical constraint (prefer the highest compatible/stable version for each package, e.g., choose one of the langchain-core constraints and one of langchain-text-splitters constraints rather than keeping both 0.2 and 0.1/0.0.1 entries, and pick the appropriate pinecone version) so that each package appears only once; update the file to a single entry for beautifulsoup4, langchain-core, langchain-text-splitters, and pinecone, and verify no other duplicates remain (and regenerate any lockfile or dependency pinning if you use one).core/utils/boost_version_operations.py-59-83 (1)
59-83:⚠️ Potential issue | 🟠 MajorReject malformed versions and canonicalize normalization through the strict parser.
These two helpers now disagree on what a valid Boost version looks like.
parse_boost_version_string()silently accepts1.2.3.4as(1, 2, 3), whilenormalize_boost_version_string()returns malformed/non-canonical values such asboost.1.2.3forboost_1_2_3andBoost.1.2.3forBoost-1.2.3. Becauseboost_usage_tracker.detect_boost_version_in_repo()now trusts this shared normalizer, those bad values can leak into persisted version fields and break comparison/sorting downstream.Proposed fix
def parse_boost_version_string(version_str: str) -> tuple[int, int, int] | None: @@ s = _VERSION_STRIP_PREFIX.sub("", str(version_str).strip()) s = s.replace("_", ".") parts = s.split(".") - if not parts or not parts[0].strip(): + if not 1 <= len(parts) <= 3 or any(not part.strip() for part in parts): return None @@ def normalize_boost_version_string(version_str: str) -> str | None: @@ - version = (version_str or "").strip().replace("boost-", "") - version = version.replace("-", ".").replace("_", ".") - if not version or version.startswith("0."): + triple = parse_boost_version_string(version_str) + if triple is None: return None - if len(version.split(".")) == 2: - version = f"{version}.0" - return version + major, minor, patch = triple + if major == 0: + return None + return f"{major}.{minor}.{patch}"Also applies to: 123-136
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/utils/boost_version_operations.py` around lines 59 - 83, parse_boost_version_string currently tolerates extra segments (e.g. "1.2.3.4") and normalize_boost_version_string emits non-canonical prefixes (e.g. "boost.1.2.3" or "Boost.1.2.3"); update both to use a single strict parsing flow: have normalize_boost_version_string call parse_boost_version_string (or share a single canonicalizer) and reject any input with more than three numeric segments or non-numeric segments, only returning canonical "major.minor.patch" (with underscores/hyphens/prefixes stripped via _VERSION_STRIP_PREFIX before parsing), and ensure parse_boost_version_string returns None for malformed inputs so no malformed values are produced or persisted by boost_usage_tracker.detect_boost_version_in_repo.Dockerfile-39-39 (1)
39-39:⚠️ Potential issue | 🟠 MajorGit
safe.directorywildcard syntax requires Git 2.46+.The wildcard pattern
/app/workspace/*insafe.directoryis not supported in Git versions prior to 2.46. If the base image contains Git 2.35–2.45, the pattern will be treated as a literal path and fail to match mounted repositories. Either verify the base image includes Git 2.46 or later, or usesafe.directory '*'to trust all directories (less restrictive).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` at line 39, The Dockerfile currently adds a Git safe.directory with a wildcard using the RUN instruction "git config --system --add safe.directory '/app/workspace/*'", which requires Git >=2.46; update the Dockerfile so it either ensures the base image installs Git 2.46+ before this RUN (upgrade Git) or replace the unsafe wildcard pattern with a supported alternative such as "git config --system --add safe.directory '*'" (to trust all directories) or explicitly add the concrete paths you expect, and keep the change scoped to the same RUN that performs the git config to ensure compatibility.core/utils/datetime_parsing.py-19-20 (1)
19-20:⚠️ Potential issue | 🟠 MajorReplace
django_timezone.utcwithtimezone.utc— required for Django 5.0 compatibility.Lines 19–20 use the deprecated
django.utils.timezone.utcalias, which was removed in Django 5.0. The stdlibtimezonefromdatetimeis already imported at line 5, so use that instead.Proposed fix
- return django_timezone.make_aware(dt, django_timezone.utc) - return dt.astimezone(django_timezone.utc) + return django_timezone.make_aware(dt, timezone.utc) + return dt.astimezone(timezone.utc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/utils/datetime_parsing.py` around lines 19 - 20, Replace uses of the removed django_timezone.utc alias with the stdlib timezone.utc in the datetime parsing logic: update the calls in the function that currently return django_timezone.make_aware(dt, django_timezone.utc) and dt.astimezone(django_timezone.utc) to use timezone.utc (the datetime.timezone imported at top) so the function (e.g., the aware/astimezone branches in core/utils/datetime_parsing.py) is compatible with Django 5.0.boost_library_docs_tracker/fetcher.py-323-342 (1)
323-342:⚠️ Potential issue | 🟠 MajorUse path-segment checks instead of raw substring matching for library scope.
Line 339 checks
lib_segmentvia substring on the whole URL, which can admit out-of-scope links (false positives) when the token appears elsewhere in the path/query.🛠️ Proposed fix
-from urllib.parse import urljoin +from urllib.parse import urljoin, urlparse- # Stay within this library's doc subtree (path contains lib segment) - if lib_segment not in abs_url: + # Stay within this library's doc subtree using path segments + path_parts = [p for p in urlparse(abs_url).path.split("/") if p] + if lib_segment not in path_parts: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 323 - 342, The current check "if lib_segment not in abs_url" uses a raw substring match on the full URL and yields false positives; change this to parse abs_url (e.g., via urllib.parse.urlparse) and test lib_segment against the path segments only (split the parsed.path on "/" and check membership), ignoring fragments/queries; update the check in the link discovery loop that currently references lib_segment and abs_url so only URLs whose path contains lib_segment as a distinct segment are allowed (still keep the existing base_url and visited/queue checks).boost_library_docs_tracker/fetcher.py-79-81 (1)
79-81:⚠️ Potential issue | 🟠 MajorValidate existing ZIP before skipping download.
Lines 79-81 currently skip network fetch on mere file existence. A truncated/corrupt local zip will be treated as valid and can break downstream extraction.
🛠️ Proposed fix
- if zip_path.exists(): - logger.info("Source zip already present, skipping download: %s", zip_path) - return zip_path + if zip_path.exists(): + if zipfile.is_zipfile(zip_path): + logger.info("Source zip already present, skipping download: %s", zip_path) + return zip_path + logger.warning("Existing zip is invalid; re-downloading: %s", zip_path) + zip_path.unlink(missing_ok=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 79 - 81, The code currently returns early if zip_path.exists() which risks using a truncated/corrupt archive; change the logic in the fetcher block that checks zip_path to validate the archive (e.g., use zipfile.is_zipfile or attempt to open/test the ZipFile) before returning; if validation fails, log a warning via logger (including zip_path), remove or rename the bad file, and proceed to download the zip as originally intended so downstream extraction uses a verified archive.boost_mailing_list_tracker/preprocessor.py-123-123 (1)
123-123:⚠️ Potential issue | 🟠 Major
source_idsvalue no longer matches retry selector.Line 123 emits
message.pk, but retry selection still usesmsg_id(Line 96) and the function contract describesfailed_idsasmsg_idvalues. This can break failed-item retries.🔧 Proposed fix
metadata: dict[str, Any] = { "doc_id": msg_id, "type": "mailing", "thread_id": message.thread_id or "", "subject": message.subject or "", "author": sender_name, "timestamp": safe_timestamp, "parent_id": message.parent_id or "", - "source_ids": str(message.pk), + "source_ids": msg_id, "list_name": message.list_name or "", }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_mailing_list_tracker/preprocessor.py` at line 123, The emitted "source_ids" currently uses message.pk but the retry selector and the function contract expect msg_id values; update the emitter so "source_ids" is set to the message's msg_id (e.g., str(message.msg_id)) instead of message.pk to keep failed_ids and retry selection consistent with the contract (ensure any surrounding code that parses failed_ids expects the same string format).cppa_youtube_script_tracker/workspace.py-63-75 (1)
63-75:⚠️ Potential issue | 🟠 MajorValidate
video_id/langbefore composing filesystem paths.Line 65, Line 70, and Line 75 directly interpolate identifiers into paths. If a separator or traversal token slips in, writes can escape the workspace subtree.
🔧 Proposed fix
+def _validate_path_component(value: str, field: str) -> str: + value = (value or "").strip() + if not value or "/" in value or "\\" in value or ".." in value: + raise ValueError(f"Invalid {field}: {value!r}") + return value + def get_metadata_queue_path(video_id: str) -> Path: """Return workspace/cppa_youtube_script_tracker/metadata/{video_id}.json.""" - return get_metadata_queue_dir() / f"{video_id}.json" + safe_video_id = _validate_path_component(video_id, "video_id") + return get_metadata_queue_dir() / f"{safe_video_id}.json" def get_raw_metadata_path(video_id: str) -> Path: """Return workspace/raw/cppa_youtube_script_tracker/metadata/{video_id}.json.""" - return get_raw_metadata_dir() / f"{video_id}.json" + safe_video_id = _validate_path_component(video_id, "video_id") + return get_raw_metadata_dir() / f"{safe_video_id}.json" def get_transcript_path(video_id: str, lang: str = "en") -> Path: """Return workspace/raw/cppa_youtube_script_tracker/transcripts/{video_id}.{lang}.vtt.""" - return get_raw_transcripts_dir() / f"{video_id}.{lang}.vtt" + safe_video_id = _validate_path_component(video_id, "video_id") + safe_lang = _validate_path_component(lang, "lang") + return get_raw_transcripts_dir() / f"{safe_video_id}.{safe_lang}.vtt"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_youtube_script_tracker/workspace.py` around lines 63 - 75, The three path helpers (get_metadata_queue_path, get_raw_metadata_path, get_transcript_path) currently interpolate video_id and lang directly; validate and sanitize these inputs before composing paths by rejecting any value containing path separators or traversal tokens (e.g. "/", "\\", "..") and enforce a safe pattern (e.g. alphanumeric plus limited chars like "-" and "_" for video_id and a strict lang pattern such as two-letter or [a-zA-Z0-9_-]+); if validation fails raise ValueError. Apply this validation inside each helper (or extract a small sanitizer used by them) so the returned Path always uses only the sanitized basename components.cppa_user_tracker/migrations/0007_youtubespeaker_external_id.py-26-31 (1)
26-31:⚠️ Potential issue | 🟠 MajorBound generated
external_idlength before saving.Generated values can exceed 255 chars for long names, which can fail this migration before uniqueness is enforced.
Proposed fix
- slug = _slugify_speaker_name(speaker.display_name) - candidate = f"youtube:name:{slug}" - if candidate in used: - candidate = f"{candidate}:{speaker.baseprofile_ptr_id}" + slug = _slugify_speaker_name(speaker.display_name) + prefix = "youtube:name:" + max_slug_len = 255 - len(prefix) + slug = slug[:max_slug_len] + candidate = f"{prefix}{slug}" + if candidate in used: + suffix = f":{speaker.baseprofile_ptr_id}" + candidate = f"{prefix}{slug[: max_slug_len - len(suffix)]}{suffix}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/migrations/0007_youtubespeaker_external_id.py` around lines 26 - 31, The migration can generate external_id strings longer than 255 chars; before assigning speaker.external_id and calling speaker.save(update_fields=["external_id"]), truncate the generated candidate to a safe length (<=255) — if you append the uniqueness suffix f":{speaker.baseprofile_ptr_id}" ensure you reserve enough characters for that suffix by truncating the slug portion produced by _slugify_speaker_name accordingly so the final candidate (including any appended suffix) is at most 255 characters; update the logic around candidate creation and the branch that adds the id to guarantee the saved external_id never exceeds 255.clang_github_tracker/preprocessors/pr_preprocessor.py-70-72 (1)
70-72:⚠️ Potential issue | 🟠 MajorGuard document building so one bad PR payload doesn’t abort the whole run.
build_pr_document(...)is currently unguarded. If one record raises, preprocessing exits early and skips all remaining candidates.Proposed fix
- doc = build_pr_document(path, data, repo) - if doc: - documents.append(doc) + try: + doc = build_pr_document(path, data, repo) + except (KeyError, TypeError, ValueError) as e: + logger.warning("preprocess pr #%s: build failed %s", number, e) + continue + if doc: + documents.append(doc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/preprocessors/pr_preprocessor.py` around lines 70 - 72, Wrap the call to build_pr_document(...) in a try/except inside the loop that populates documents so a single failing PR payload doesn't abort the run: catch Exception around the build_pr_document(path, data, repo) invocation in pr_preprocessor.py (the loop that appends to documents), log the error (including exception details) and continue to the next record instead of re-raising, ensuring only valid docs are appended to documents.core/management/commands/send_startup_notification.py-91-105 (1)
91-105:⚠️ Potential issue | 🟠 MajorValidate webhook URLs before
urlopen().Both helpers trust the configured URL scheme and hand it straight to
request.urlopen(). A badDISCORD_WEBHOOK_URL/SLACK_WEBHOOK_URLvalue can invokefile:or another unintended scheme from the deploy container. Fail fast unless the URL is HTTPS.Suggested fix
from urllib import request from urllib.error import URLError +from urllib.parse import urlparse ... +def _validate_webhook_url(webhook_url: str) -> str: + parsed = urlparse(webhook_url) + if parsed.scheme != "https" or not parsed.netloc: + raise ValueError("Webhook URL must be an https URL") + return webhook_url + + def post_discord(webhook_url: str, title: str, description: str) -> None: + webhook_url = _validate_webhook_url(webhook_url) embed = { "title": title, "description": description[:4000], "color": 0x3498DB, "timestamp": datetime.now(timezone.utc).isoformat(), @@ def post_slack(webhook_url: str, title: str, text: str) -> None: + webhook_url = _validate_webhook_url(webhook_url) blocks = [ { "type": "header", "text": {"type": "plain_text", "text": title, "emoji": True}, },Also applies to: 110-129
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/management/commands/send_startup_notification.py` around lines 91 - 105, The post_discord function currently passes webhook_url straight to request.urlopen; validate the URL first by parsing it (e.g., urllib.parse.urlparse) and ensure scheme is 'https' and netloc is present before calling urlopen, failing fast (raise ValueError or log and return) on invalid or non-HTTPS schemes; apply the same validation to the corresponding Slack helper (the function handling lines ~110-129) so both only allow secure https webhook URLs.core/management/commands/send_startup_notification.py-205-208 (1)
205-208:⚠️ Potential issue | 🟠 MajorDon’t make notification outages fail an otherwise healthy deploy.
This command now runs after the health check, so
sys.exit(1)turns a transient Slack/Discord problem into a failed deployment even when the stack is up. That is a pretty expensive failure mode for a best-effort notification path. Prefer logging and returning success, or gate the hard failure behind an explicit strict setting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/management/commands/send_startup_notification.py` around lines 205 - 208, The current notification command calls sys.exit(1) when notification errors exist (errors list logged via logger.error), which will fail deployments; change it to log each error and return normally (or exit 0) so notification outages don’t mark the run as failed. Replace the sys.exit(1) call in the send_startup_notification command/management command with a non-fatal path (just logging and return), and if you need strict behavior add an explicit flag/setting (e.g., --strict or NOTIFICATIONS_STRICT) that, when enabled, will still call sys.exit(1); reference the errors list, logger.error calls, and the sys.exit(1) site when making the change.cppa_youtube_script_tracker/preprocessor.py-66-73 (1)
66-73:⚠️ Potential issue | 🟠 MajorThis defeats the prefetch and re-queries per video.
_build_candidate_queryset()already prefetchesvideo_speakers__speaker, but_get_speaker_names()builds a fresh queryset withselect_related(...).values_list(...). That bypasses the prefetched cache and adds one SQL query per video during Pinecone preprocessing. Read fromvideo.video_speakers.all()instead and sort in Python.Suggested fix
def _get_speaker_names(video: YouTubeVideo) -> list[str]: """Return a sorted list of speaker display_names linked to this video.""" - names = list( - video.video_speakers.select_related("speaker") - .values_list("speaker__display_name", flat=True) - .order_by("speaker__display_name") - ) - return [n for n in names if n] + names = [ + rel.speaker.display_name + for rel in video.video_speakers.all() + if getattr(rel, "speaker", None) and rel.speaker.display_name + ] + return sorted(names)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_youtube_script_tracker/preprocessor.py` around lines 66 - 73, _get_speaker_names is bypassing the prefetched video_speakers by issuing a new queryset (select_related/values_list), causing an extra SQL query per video; modify _get_speaker_names to iterate over the already-prefetched related set (use video.video_speakers.all() or video.video_speakers iterator), extract speaker.display_name from each prefetched speaker object, filter out falsy names, sort the resulting list in Python, and return that sorted list so it works with the prefetch done in _build_candidate_queryset.cppa_youtube_script_tracker/migrations/0001_initial.py-72-75 (1)
72-75:⚠️ Potential issue | 🟠 MajorUse bigint counters for YouTube metrics.
view_count,like_count, andcomment_countareIntegerFields here.view_countcan exceed 2.1B on real channels, which will overflow or reject inserts on 32-bit integer columns. These should bePositiveBigIntegerFields before this migration lands.Suggested fix
- ("view_count", models.IntegerField(blank=True, null=True)), - ("like_count", models.IntegerField(blank=True, null=True)), - ("comment_count", models.IntegerField(blank=True, null=True)), + ("view_count", models.PositiveBigIntegerField(blank=True, null=True)), + ("like_count", models.PositiveBigIntegerField(blank=True, null=True)), + ("comment_count", models.PositiveBigIntegerField(blank=True, null=True)),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_youtube_script_tracker/migrations/0001_initial.py` around lines 72 - 75, The migration defines YouTube metric columns using models.IntegerField which will overflow for large values; update the three fields ("view_count", "like_count", "comment_count") in migrations/0001_initial.py to use models.PositiveBigIntegerField (preserving blank=True and null=True) instead of models.IntegerField so the DB columns are created as bigint unsigned-capable types; also ensure the migration imports or references models.PositiveBigIntegerField correctly so the generated schema uses big integer columns for these counters.cppa_user_tracker/services.py-372-388 (1)
372-388:⚠️ Potential issue | 🟠 MajorThis falls through to
create()for existing authors too often.With a provided
display_namewith a different/new email still drops intoWG21PaperAuthorProfile.objects.create(...), even though the docstring says “If one exists, returns it” and “otherwise returns the first.” That will fragment the same author across multiple profiles on repeated imports.Suggested fix
candidates = list( WG21PaperAuthorProfile.objects.filter(display_name=display_name_val).order_by( "id" ) ) - # Disambiguate by email if provided. - for p in candidates: - if email_val and p.emails.filter(email=email_val).exists(): - return p, False - elif not email_val and not p.emails.exists(): - return p, False + if candidates: + if email_val: + for p in candidates: + if p.emails.filter(email=email_val).exists(): + return p, False + return candidates[0], False profile = WG21PaperAuthorProfile.objects.create(display_name=display_name_val) if email_val: add_email(profile, email_val, is_primary=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/services.py` around lines 372 - 388, The current logic creates a new WG21PaperAuthorProfile whenever no exact email match is found, even if candidates (same display_name) exist; update the flow so that after iterating candidates you do not fall through to WG21PaperAuthorProfile.objects.create when candidates is non-empty—instead select the first candidate (candidates[0]) and return it (and if email_val is provided, call add_email(profile, email_val, ...) to attach the address to that existing profile); keep creating only when candidates is empty. Ensure you still use the emails.filter(email=...) check to detect exact matches before this fallback.cppa_pinecone_sync/sync.py-206-249 (1)
206-249:⚠️ Potential issue | 🟠 MajorMetadata-update failures are dropped from retry bookkeeping.
attempted_source_idsandnew_failed_idsonly come from the upsert path. On a metadata-only run, or whenupdate_documents()returns errors, this code still clears the fail list and advances sync status, so those metadata refreshes won't be retried.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/sync.py` around lines 206 - 249, The code only derives attempted_source_ids and new_failed_ids from the upsert path, so metadata-only runs or metadata update failures are cleared from retry bookkeeping; fix by including metadata update results in the bookkeeping: when calling update_documents (function update_documents) capture its attempted IDs and failed IDs (from update_result), merge those attempted IDs into attempted_source_ids (or create a union) and merge failed IDs into new_failed_ids before the transaction block, and then use that combined attempted_source_ids/new_failed_ids when calling services.clear_failed_ids and services.record_failed_ids and when deciding whether to advance sync status (services.update_sync_status). Ensure this change references the existing symbols upsert_documents, update_documents, meta_documents, metas_to_update, attempted_source_ids, new_failed_ids, update_result, result, services.clear_failed_ids, services.record_failed_ids, and services.update_sync_status.wg21_paper_tracker/pipeline.py-212-216 (1)
212-216:⚠️ Potential issue | 🟠 MajorDon't pass
"0"into the mailing fetcher.
_parse_mailing_year()deliberately returns0for missing/invalid years, butfetch_papers_for_mailing(str(year), mailing_date)uses that sentinel as a real fetch parameter. For those mailings, the fetch URL/path becomes wrong, so retries can never recover.Proposed fix
year = _parse_mailing_year(m_info) + fetch_year = mailing_date[:4] if year == 0 else str(year) mailing_obj, _ = get_or_create_mailing(mailing_date, title) - papers = fetch_papers_for_mailing(str(year), mailing_date) + papers = fetch_papers_for_mailing(fetch_year, mailing_date)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/pipeline.py` around lines 212 - 216, _parse_mailing_year can return 0 for missing/invalid years, so avoid passing the string "0" into fetch_papers_for_mailing; update the call site in pipeline.py to check if year == 0 and, in that case, call fetch_papers_for_mailing with a null/absent year (e.g., pass None or remove the year argument) instead of str(year) so the fetcher receives a clear "no year" signal (refer to _parse_mailing_year, fetch_papers_for_mailing, year, and mailing_date).wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py-34-37 (1)
34-37:⚠️ Potential issue | 🟠 MajorTighten
YYYY-MMvalidation before writing mailings.
^\d{4}-\d{2}$accepts impossible values like2025-00and2025-13. Those rows currently become realWG21Mailingkeys, which can poison later ordering and incremental imports.Proposed fix
-MAILING_DATE_PATTERN = re.compile(r"^\d{4}-\d{2}$") +MAILING_DATE_PATTERN = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 34 - 37, MAILING_DATE_PATTERN currently allows invalid months; tighten validation by changing the pattern from r"^\d{4}-\d{2}$" to r"^\d{4}-(0[1-9]|1[0-2])$" or, alternatively, validate with datetime.strptime when creating WG21Mailing keys; update the code that uses MAILING_DATE_PATTERN (the import/validation path that writes WG21Mailing keys) to reject or substitute PLACEHOLDER_MAILING_DATE for any date that fails the stricter check so only YYYY-01..12 become real mailing keys.cppa_youtube_script_tracker/models.py-54-56 (1)
54-56:⚠️ Potential issue | 🟠 MajorUse
BigIntegerFieldfor YouTube engagement counters.YouTube videos regularly exceed the 2^31-1 (≈2.1 billion) limit of a 32-bit signed integer. As of April 2026, over 50 videos have surpassed 2 billion views—including "Baby Shark Dance" at 16.85 billion and "Despacito" at 8.99 billion. Using
IntegerFieldwill cause data truncation or insertion failures for these videos. Switch toBigIntegerFieldto accommodate the full range of engagement metrics:- view_count = models.IntegerField(null=True, blank=True) - like_count = models.IntegerField(null=True, blank=True) - comment_count = models.IntegerField(null=True, blank=True) + view_count = models.BigIntegerField(null=True, blank=True) + like_count = models.BigIntegerField(null=True, blank=True) + comment_count = models.BigIntegerField(null=True, blank=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_youtube_script_tracker/models.py` around lines 54 - 56, The three engagement fields view_count, like_count, and comment_count currently use models.IntegerField which can overflow for large YouTube metrics; change each to models.BigIntegerField while preserving null=True and blank=True (i.e., replace IntegerField with BigIntegerField for view_count, like_count, comment_count in models.py) and then create and apply a Django migration (makemigrations and migrate) to update the database schema.clang_github_tracker/services.py-97-99 (1)
97-99:⚠️ Potential issue | 🟠 MajorValidate SHA contents, not just length.
A 40-character non-hex string currently passes both paths and can be stored as a commit key. That breaks the stated 40-hex contract and weakens data integrity.
Suggested fix
+import re @@ logger = logging.getLogger(__name__) DEFAULT_UPSERT_BATCH_SIZE = 500 +_SHA40_RE = re.compile(r"^[0-9a-f]{40}$") @@ def upsert_commit( sha: str, *, github_committed_at: datetime | None, ) -> tuple[ClangGithubCommit, bool]: @@ sha_clean = (sha or "").strip().lower() - if len(sha_clean) != 40: + if not _SHA40_RE.fullmatch(sha_clean): raise ValueError(f"commit sha must be 40 hex chars, got {sha_clean!r}") @@ for sha, dt in rows: s = (sha or "").strip().lower() - if len(s) != 40: + if not _SHA40_RE.fullmatch(s): continueAlso applies to: 178-181
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/services.py` around lines 97 - 99, The current check only enforces length for sha_clean and allows non-hex characters; update the validation to ensure sha_clean is exactly 40 lowercase hex chars (e.g. use a full-match against [0-9a-f]{40} or attempt hex decode) before raising the ValueError. Apply this stronger validation in both places where sha_clean is used (the block that currently sets sha_clean and raises ValueError and the second occurrence around lines 178-181), keep the same error message but only raise it when the string fails the hex-content check as well as length.clang_github_tracker/services.py-244-248 (1)
244-248:⚠️ Potential issue | 🟠 MajorGuard non-positive
batch_sizeinupsert_issue_items_batch.Unlike
upsert_commits_batch, this helper never normalizesbatch_size.0will crash at Line 277, and a negative value makes the loop skip every row.Suggested fix
def upsert_issue_items_batch( rows: Sequence[tuple[int, bool, datetime | None, datetime | None]], *, batch_size: int = DEFAULT_UPSERT_BATCH_SIZE, ) -> tuple[int, int]: @@ Returns: (inserted, updated) counts across all batches. """ + if batch_size <= 0: + logger.warning( + "batch_size must be positive, using %s", DEFAULT_UPSERT_BATCH_SIZE + ) + batch_size = DEFAULT_UPSERT_BATCH_SIZE merged: dict[int, tuple[bool, datetime | None, datetime | None]] = {}Also applies to: 277-278
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/services.py` around lines 244 - 248, The upsert_issue_items_batch helper doesn't normalize or guard batch_size, so a value of 0 or negative will crash or skip rows; modify upsert_issue_items_batch to normalize batch_size at the start (similar to upsert_commits_batch) by ensuring it's a positive integer (e.g., set batch_size = max(1, int(batch_size)) or default to DEFAULT_UPSERT_BATCH_SIZE when non-positive) before the loop that slices rows, so the loop and slicing at lines referencing batch_size cannot fail or skip items.wg21_paper_tracker/services.py-142-150 (1)
142-150:⚠️ Potential issue | 🟠 MajorMake author replacement atomic and distinguish
[]fromNone.
if author_names:means an explicit empty list never clears stale authors, and the delete/recreate sequence runs outside the transaction used for the paper upsert. If one insert fails midway, the paper is left with a partially rewritten author set.Suggested fix
- if author_names: - if not created: - for author in paper.authors.all(): - author.delete() - emails = author_emails or [] - for i, name in enumerate(author_names): - email = emails[i] if i < len(emails) else None - profile, _ = get_or_create_wg21_paper_author_profile(name, email=email) - get_or_create_paper_author(paper, profile, i + 1) + if author_names is not None: + with transaction.atomic(): + if not created: + paper.authors.all().delete() + emails = author_emails or [] + for i, name in enumerate(author_names): + email = emails[i] if i < len(emails) else None + profile, _ = get_or_create_wg21_paper_author_profile(name, email=email) + get_or_create_paper_author(paper, profile, i + 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/services.py` around lines 142 - 150, The current logic uses "if author_names:" which treats an explicit empty list as falsy and therefore never clears existing authors, and it performs delete/recreate outside the paper upsert transaction; change this to explicitly check "if author_names is not None" so [] clears authors, and perform the delete-and-create sequence inside the same atomic transaction used for the paper upsert (e.g., use transaction.atomic() or the existing upsert transaction scope) so that deleting existing authors (paper.authors.all().delete()) and creating new author rows via get_or_create_wg21_paper_author_profile(...) and get_or_create_paper_author(paper, profile, position) are all committed or rolled back together; also compute emails = author_emails or [] and map by index as before but keep the explicit None-vs-empty distinction.config/settings.py-338-361 (1)
338-361:⚠️ Potential issue | 🟠 MajorAdd fallback for legacy Slack token environment variables.
The token parsing functions at lines 360–361 now build token maps exclusively from
SLACK_TEAM_IDSand per-team env vars (e.g.,SLACK_BOT_TOKEN_<team_id>). Environments still using the legacy single-team format (SLACK_BOT_TOKEN+SLACK_TEAM_IDenv vars) will load empty credentials and fail at startup with "No teams configured." Keep a compatibility fallback to check the legacy format if the new format yields no tokens.Suggested fix
def _slack_per_team_tokens_from_env(env_key_prefix: str): """ Build team_id -> token from SLACK_TEAM_IDS and ``{prefix}_{team_id}`` env vars (e.g. prefix SLACK_BOT_TOKEN → SLACK_BOT_TOKEN_T123). """ out = {} for tid in _slack_team_ids_from_env(): key = f"{env_key_prefix}_{tid}" token = (env(key, default="") or "").strip() if token: out[tid] = token + if out: + return out + + legacy_token = (env(env_key_prefix, default="") or "").strip() + if legacy_token and SLACK_TEAM_ID: + return {SLACK_TEAM_ID: legacy_token} return out🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/settings.py` around lines 338 - 361, The current per-team token builder (_slack_per_team_tokens_from_env) only reads tokens for team ids listed in _slack_team_ids_from_env and so SLACK_BOT_TOKEN / SLACK_APP_TOKEN mappings are empty in legacy single-team setups; modify _slack_per_team_tokens_from_env to, after building out from per-team keys, check if out is empty and a legacy single-token env exists (env("SLACK_BOT_TOKEN") or env("SLACK_APP_TOKEN")) together with a legacy SLACK_TEAM_ID (env("SLACK_TEAM_ID")), and if so populate out with {legacy_team_id: legacy_token} (ensuring you strip whitespace) so SLACK_BOT_TOKEN and SLACK_APP_TOKEN variables (and SLACK_TEAM_ID) serve as fallbacks when the new per-team keys produce no entries; keep references to _slack_team_ids_from_env, _slack_per_team_tokens_from_env, SLACK_BOT_TOKEN and SLACK_APP_TOKEN.
🟡 Minor comments (11)
core/utils/text_processing.py-125-130 (1)
125-130:⚠️ Potential issue | 🟡 MinorNormalize ellipsis after HTML unescape to cover entity inputs.
Lines 125-130 normalize
\u2026beforehtml.unescape, so inputs like…stay as…instead of becoming....🛠️ Proposed fix
text = ( text.replace("\xad", "") .replace("\u200b", "") .replace("\u200c", "") .replace("\u200d", "") .replace("\xa0", " ") .replace("\u2002", " ") .replace("\u2003", " ") - .replace("\u2026", "...") .replace("\u202f", " ") ) text = html.unescape(text) + text = text.replace("\u2026", "...")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/utils/text_processing.py` around lines 125 - 130, The normalization of Unicode ellipsis and narrow no-break space currently runs before html.unescape so HTML entities like … remain as “…” after unescape; move the .replace("\u2026", "...").replace("\u202f", " ") calls to after the html.unescape(text) call (i.e., perform html.unescape(text) first, then apply those .replace calls on the unescaped text variable) so entities are converted to the normalized forms; update usages of text in the function (the variable named text and the call to html.unescape) accordingly.github_activity_tracker/preprocessors/github_preprocess.py-111-113 (1)
111-113:⚠️ Potential issue | 🟡 MinorRename parameter
typeto avoid shadowing built-in.The parameter
typeshadows Python's built-in and generates linter warnings. All call sites (lines 179, 224, 272, 330) use positional arguments, making the rename safe. Update the parameter name toentity_typeand the corresponding f-string reference.♻️ Proposed refactor
-def get_ids_for_pinecone(repo: str, type: Literal["issue", "pr"], number: int) -> str: +def get_ids_for_pinecone( + repo: str, entity_type: Literal["issue", "pr"], number: int +) -> str: """Get the ids for Pinecone from a repo, type, and number.""" - return f"{repo}:{type}:{number}" + return f"{repo}:{entity_type}:{number}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@github_activity_tracker/preprocessors/github_preprocess.py` around lines 111 - 113, Rename the parameter `type` in get_ids_for_pinecone to avoid shadowing the built-in: change the parameter name to `entity_type` and update the f-string inside get_ids_for_pinecone to use `{entity_type}`; ensure all call sites that pass arguments positionally remain unaffected (calls at the noted locations use positional args so no caller changes needed).boost_library_docs_tracker/html_to_md.py-12-12 (1)
12-12:⚠️ Potential issue | 🟡 MinorMinor: Use HYPHEN-MINUS instead of EN DASH in docstring.
Static analysis flagged an ambiguous character: the
–(EN DASH) should be-(HYPHEN-MINUS) for consistency in ASCII documentation.📝 Proposed fix
-3. _postprocess_markdown – strip residual HTML artefacts, rejoin split lines, then clean_text (unicode/line endings only) +3. _postprocess_markdown - strip residual HTML artefacts, rejoin split lines, then clean_text (unicode/line endings only)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/html_to_md.py` at line 12, The docstring for _postprocess_markdown contains an EN DASH character (–) that static analysis flagged; open the _postprocess_markdown function's docstring and replace the EN DASH with a standard HYPHEN-MINUS (-) so the comment reads "strip residual HTML artefacts, rejoin split lines, then clean_text (unicode/line endings only)" using the ASCII hyphen-minus character.cppa_pinecone_sync/ingestion.py-564-566 (1)
564-566:⚠️ Potential issue | 🟡 MinorRemove
table_idsassignment from sync.py; migration tosource_idsis incomplete.Preprocessors correctly populate
source_ids, butsync.pyline 97 still writesmetadata["table_ids"], creating redundant dual representation. The function_extract_source_ids_from_documents(line 119) reads from thistable_idsfield it just wrote, bypassing thesource_idsalready present. Remove thetable_idsassignment and update sync consumers to readsource_idsdirectly from preprocessor output.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/ingestion.py` around lines 564 - 566, Remove the legacy "table_ids" write and ensure all consumers use "source_ids": stop assigning metadata["table_ids"] in the sync writer (the block that builds updates with metadata and track_ids), change the logic that sets track_ids to use metadata.get("source_ids") only, and update the consumer function _extract_source_ids_from_documents to read source_ids directly from document metadata instead of reading table_ids; also remove any metadata.pop/usage related to "table_ids" so only source_ids is propagated.github_activity_tracker/tests/test_fetcher_commits_backward.py-63-108 (1)
63-108:⚠️ Potential issue | 🟡 MinorUse multi-item pages in the backward-traversal fixture.
Lines 79-95 only put one commit on each fetched page, so this still passes if the implementation walks pages in the right order but forgets to reverse commits within each page before yielding. Add at least two commits to one non-first page and assert the full oldest→newest sequence.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@github_activity_tracker/tests/test_fetcher_commits_backward.py` around lines 63 - 108, The test fixture uses single-commit pages so it won't catch a bug where fetch_commits_from_github fails to reverse commits within a page; update the side_effect responses for client.rest_request_url_with_all_links to include at least two commits on a non-first page (for example, make the Page 2 tuple contain two commit dicts with distinct shas and dates), then update the final expected sha sequence in the assertion to reflect the full oldest→newest ordering across commits within pages (and keep the check that rest_request_url_with_all_links.call_count == 2). Ensure you reference the same MagicMock calls (client.rest_request_with_all_links and client.rest_request_url_with_all_links) and the test function name test_fetch_commits_multiple_pages_backward_traversal when making the change.wg21_paper_tracker/workspace.py-22-35 (1)
22-35:⚠️ Potential issue | 🟡 MinorReject mismatched
year/mailing_datepairs.Right now
get_raw_dir("2024-05", 2025)will happily create.../2025/2024-05. That makes the raw layout depend on caller mistakes and can scatter one mailing across different directories.🛠️ Proposed fix
def get_raw_dir(mailing_date: str | None, year: int) -> Path: """Return workspace/raw/wg21_paper_tracker/<year>/<mailing_date>/; creates if missing.""" if mailing_date is not None and not _MAILING_DATE_RE.fullmatch(mailing_date): raise ValueError("mailing_date must be in YYYY-MM format") + if mailing_date is not None and int(mailing_date[:4]) != year: + raise ValueError("year must match mailing_date") if getattr(settings, "RAW_DIR", None): raw_root = Path(settings.RAW_DIR) / _APP_SLUG🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/workspace.py` around lines 22 - 35, The function get_raw_dir allows mismatched mailing_date and year (e.g., "2024-05" with year=2025); modify get_raw_dir to validate that when mailing_date is provided its year portion matches the year parameter and raise ValueError on mismatch. After the existing _MAILING_DATE_RE check, parse the year from mailing_date (e.g., take the YYYY prefix and int() it) and compare it to the year argument; if they differ raise ValueError("mailing_date year does not match year argument") so directories like <year>/<mailing_date> remain consistent.clang_github_tracker/tests/test_commands.py-69-93 (1)
69-93:⚠️ Potential issue | 🟡 MinorAssert the forwarded sync values, not just kwarg presence.
This currently passes even if
start_commit,start_item, orend_dateare wrong orNone. Patch the date resolver to return fixed sentinel values and assert those exact values are forwarded tosync_clang_github_activity.🛠️ Proposed fix
`@pytest.mark.django_db` def test_run_clang_github_tracker_calls_sync_clang_github_activity_when_not_dry_run( caplog, ): """Without --dry-run, command calls sync_clang_github_activity with start_item.""" - with patch( + start = object() + end = object() + with patch( + "clang_github_tracker.management.commands.run_clang_github_tracker.resolve_start_end_dates", + return_value=(start, start, end), + ), patch( "clang_github_tracker.management.commands.run_clang_github_tracker.sync_clang_github_activity", return_value=(0, [], []), ) as sync_mock: with caplog.at_level(logging.INFO): call_command( @@ sync_mock.assert_called_once() call_kw = sync_mock.call_args[1] - assert "start_commit" in call_kw - assert "start_item" in call_kw - assert "end_date" in call_kw + assert call_kw["start_commit"] is start + assert call_kw["start_item"] is start + assert call_kw["end_date"] is end assert "start_issue" not in call_kw assert any("commits=" in r.getMessage() for r in caplog.records)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 69 - 93, Update the test_run_clang_github_tracker_calls_sync_clang_github_activity_when_not_dry_run test to assert exact forwarded values by patching the date resolver used inside the run_clang_github_tracker management command to return fixed sentinel values (e.g., SENTINEL_COMMIT, SENTINEL_ITEM, SENTINEL_DATE), then call call_command(CMD_NAME, ...) as before and assert that sync_clang_github_activity was called with those exact sentinel values for start_commit, start_item and end_date (use sync_clang_github_activity from clang_github_tracker.management.commands.run_clang_github_tracker to inspect call args); keep the existing check that start_issue is not present and that log records contain "commits=" unchanged.wg21_paper_tracker/models.py-12-12 (1)
12-12:⚠️ Potential issue | 🟡 MinorAdd field-level validation for
mailing_dateto enforceYYYY-MMformat.The field accepts any 7-character string. While the scraper, workspace helper, and pipeline all validate
YYYY-MMformat externally, the model has no constraints. Invalid values can slip in through Django admin or direct database access, breaking ordering and workspace path generation.Use the regex pattern already employed in
workspace.pyto validate at the model level.🛠️ Proposed fix
+from django.core.validators import RegexValidator from django.db import models + +_MAILING_DATE_VALIDATOR = RegexValidator( + r"^\d{4}-(0[1-9]|1[0-2])$", + "mailing_date must be in YYYY-MM format", +) @@ - mailing_date = models.CharField(max_length=7, unique=True, db_index=True) + mailing_date = models.CharField( + max_length=7, + unique=True, + db_index=True, + validators=[_MAILING_DATE_VALIDATOR], + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/models.py` at line 12, The mailing_date CharField in models.py accepts any 7-char string; add a field-level RegexValidator using the same YYYY-MM regex from workspace.py to enforce format at the model layer (keep unique=True and db_index=True). Import RegexValidator from django.core.validators, attach validators=[RegexValidator(regex=… , message="Expected YYYY-MM")] to the mailing_date field, and run migrations if needed so invalid values cannot be saved via admin or direct model use.cppa_youtube_script_tracker/management/commands/run_cppa_youtube_script_tracker.py-291-291 (1)
291-291:⚠️ Potential issue | 🟡 MinorTypo in comment: "unkown" should be "unknown".
📝 Proposed fix
- # If we discovered a concrete speaker name, remove fallback "unkown" links first. + # If we discovered a concrete speaker name, remove fallback "unknown" links first.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_youtube_script_tracker/management/commands/run_cppa_youtube_script_tracker.py` at line 291, Fix the typo in the inline comment in run_cppa_youtube_script_tracker.py: change "unkown" to "unknown" in the comment that reads "# If we discovered a concrete speaker name, remove fallback "unkown" links first." so the comment correctly reads "unknown" (search for that exact comment text in the RunCppaYoutubeScriptTracker command or the run_cppa_youtube_script_tracker module to locate and update it).docs/Schema.md-915-927 (1)
915-927:⚠️ Potential issue | 🟡 MinorAppendix A is stale for the WG21 schema.
The table summary still describes
WG21Paperas(paper_id, url, title, publication_date)and does not listWG21Mailing, but Section 7 now documents a(paper_id, year)key plusdocument_date,mailing_id,subgroup, andis_downloaded. The appendix should match the schema above.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/Schema.md` around lines 915 - 927, The Appendix A table summary for WG21 is out of sync: update the WG21-related rows so they match the current schema in Section 7—replace the outdated WG21Paper signature `(paper_id, url, title, publication_date)` with the current WG21Paper key `(paper_id, year)` and include the additional columns `document_date, mailing_id, subgroup, is_downloaded`, and add a new WG21Mailing entry (or update existing row) reflecting its fields; ensure the WG21PaperAuthor/related rows still reference the correct FK names (paper_id, profile_id) so the appendix mirrors the schema in Section 7 exactly.docs/Schema.md-933-973 (1)
933-973:⚠️ Potential issue | 🟡 MinorAppendix B is missing new relationships.
Lines 937-938 still omit
BaseProfile → YoutubeSpeaker, and Lines 967-968 omitWG21Mailing → WG21Paper, so the relationship summary no longer matches the diagrams in Sections 1, 7, and 10.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/Schema.md` around lines 933 - 973, Appendix B's relationship table is missing two rows: add a row for BaseProfile -> YoutubeSpeaker (e.g., "BaseProfile | YoutubeSpeaker | May be a speaker / One profile can be a YouTube speaker") and a row for WG21Mailing -> WG21Paper (e.g., "WG21Mailing | WG21Paper | Has many / Mailing contains papers"), ensuring the relationship wording matches the diagrams in Sections 1, 7, and 10 and uses the exact symbols BaseProfile, YoutubeSpeaker, WG21Mailing, and WG21Paper.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Documentation