Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)#104
Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)#104snowfox1003 merged 23 commits intocppalliance:developfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a new Django app wg21_paper_tracker (models, migrations, admin, services, fetcher, pipeline, workspace, management commands), a containerized Cloud Run PDF→Markdown conversion job with multiple converters, tests, docs, CI/test-settings tweaks, and config constants for GCS/Cloud Run integration. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Management\nCommand
participant Fetch as Fetcher
participant DB as Database
participant GCS as Google\nCloud Storage
participant CR as Cloud Run\nJob
CLI->>Fetch: fetch_all_mailings()
Fetch-->>CLI: mailings[]
loop per new mailing
CLI->>Fetch: fetch_papers_for_mailing(year,date)
Fetch-->>CLI: papers[]
CLI->>DB: get_or_create_mailing / get_or_create_paper
DB-->>CLI: persisted
CLI->>CLI: _download_file(url)
CLI->>GCS: upload raw/wg21_paper_tracker/<year>/<mailing>/<file>
GCS-->>CLI: upload result
end
alt total_new_papers > 0 and CloudRun configured
CLI->>CR: trigger_cloud_run_job(project, location, job)
CR->>GCS: list raw PDFs
CR-->>CR: convert_pdf_to_md (Docling → PDFPlumber → OpenAI)
CR->>GCS: upload converted/wg21_papers/*.md
GCS-->>CR: upload result
end
CLI-->>CLI: return total_new_papers
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (8)
wg21_paper_tracker/tests/test_pipeline.py (1)
49-62: Usetmp_pathhere instead of/tmp/out.This writes outside pytest's sandbox and is not portable to Windows or restricted runners. Using
tmp_pathkeeps the test isolated.♻️ Proposed fix
-def test_download_file_uses_timeout(): +def test_download_file_uses_timeout(tmp_path): """_download_file calls requests.get with DOWNLOAD_TIMEOUT.""" url = "https://example.com/f" - filepath = Path("/tmp/out") + filepath = tmp_path / "out" resp = MagicMock() resp.raise_for_status = MagicMock() resp.headers = {"content-type": "text/plain"} resp.content = b"x" resp.apparent_encoding = "utf-8"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 49 - 62, The test test_download_file_uses_timeout writes to a hardcoded /tmp/out; change it to use pytest's tmp_path fixture to keep the test sandboxed and portable: update the test signature to accept tmp_path, replace filepath = Path("/tmp/out") with filepath = tmp_path / "out" (or similar), and keep the rest of the assertions the same; this targets the test function test_download_file_uses_timeout and the _download_file call so no other behavior changes are required.wg21_paper_tracker/tests/test_models.py (1)
57-76: Add the failing duplicate case for(paper_id, year).This only proves that different years are allowed. If the unique constraint disappears, this test still passes. Add a same-year duplicate insert that expects
IntegrityError.🧪 Suggested test hardening
Add this import near the top of the file:
from django.db import IntegrityErrorThen extend the test like this:
def test_wg21_paper_unique_together_paper_id_year(): """WG21Paper allows same paper_id with different year.""" @@ p2 = WG21Paper.objects.create( paper_id="sd-1", url="https://example.com/2.pdf", title="T2", mailing=m2, year=2025, ) assert p2.pk is not None + with pytest.raises(IntegrityError): + WG21Paper.objects.create( + paper_id="sd-1", + url="https://example.com/3.pdf", + title="T3", + mailing=m2, + year=2025, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_models.py` around lines 57 - 76, Test only covered different-year case; add a failing same-year duplicate to assert the unique constraint. Import IntegrityError from django.db at the top of the test file, then modify test_wg21_paper_unique_together_paper_id_year to attempt creating another WG21Paper with the same paper_id and same year (use the existing m1/year=2024) and assert it raises IntegrityError (e.g., with pytest.raises(IntegrityError)). Keep existing successful different-year creation (WG21Paper creation for year=2025) unchanged.wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)
220-232: Move import to module level.Importing
get_or_create_wg21_paper_author_profileinside the exception handler is an anti-pattern. If the import fails, it will mask the originalIntegrityErrorwith anImportError, making debugging harder. Move this import to the top of the file.♻️ Proposed fix
Add to imports at the top of the file (after line 24):
from cppa_user_tracker.services import get_or_create_wg21_paper_author_profileThen remove the inline import at lines 221-223.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 220 - 232, The inline import of get_or_create_wg21_paper_author_profile inside the block that creates WG21PaperAuthor can mask the original IntegrityError; move the import to module level by adding "from cppa_user_tracker.services import get_or_create_wg21_paper_author_profile" to the top imports and remove the inline import inside the loop so get_or_create_wg21_paper_author_profile is resolved at module import time and won’t convert an IntegrityError into an ImportError during exception handling.
233-242: Uselogging.exceptionfor better diagnostics.When catching exceptions and logging errors,
logger.exception()automatically includes the stack trace without needing to pass the exception object. This aids debugging for unexpected failures.♻️ Example change for line 235-239
except Exception as inner: stats["skipped"] += 1 - logger.error( + logger.exception( "Error for paper_id=%s (after IntegrityError): %s", paper_id, - inner, )Apply similarly to lines 165-169, 210, and 242.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 233 - 242, Replace the logger.error calls in the exception handlers with logger.exception so stack traces are captured: specifically, in the inner except Exception as inner block that currently does stats["skipped"] += 1 and logger.error("Error for paper_id=%s (after IntegrityError): %s", paper_id, inner), change that to logger.exception with the same message and remove passing the exception object; likewise change the outer except Exception as e handler (stats["skipped"] += 1; logger.error("Error for paper_id=%s: %s", paper_id, e)) to use logger.exception, and apply the same replacement to the other similar catch blocks earlier in the file (the blocks around the IntegrityError handling and the other exception catches referenced in the review) so all unexpected exceptions use logger.exception to include stack traces.wg21_paper_tracker/cloud_run_job/converters/docling_converter.py (1)
11-20:InputFormatis imported but never used.The
InputFormatsymbol is imported fromdocling.datamodel.base_modelsbut is not used anywhere in the function. Remove it to avoid confusion.♻️ Proposed fix
try: from docling.document_converter import DocumentConverter # type: ignore[import-untyped] - from docling.datamodel.base_models import InputFormat # type: ignore[import-untyped] DOCLING_AVAILABLE = True except ImportError: DocumentConverter = None # type: ignore[assignment,misc] - InputFormat = None # type: ignore[assignment,misc] DOCLING_AVAILABLE = False logger.warning("Docling not available. Install with: pip install docling")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py` around lines 11 - 20, The import of InputFormat from docling.datamodel.base_models is unused; remove InputFormat from the try-import block so only DocumentConverter is imported (and keep the DOCLING_AVAILABLE assignment and the except fallback). Update the top-of-file import statement around DocumentConverter/InputFormat to drop InputFormat and any related type-ignore comments for it, leaving DocumentConverter, DOCLING_AVAILABLE, and the logger.warning logic intact.wg21_paper_tracker/cloud_run_job/main.py (1)
36-42: Clarify the error pattern matching logic.The conditional
if pattern.startswith("error:") or pattern.startswith("exception:")is checking whether the literal pattern string (from the list) starts with these prefixes, not whether the content does. While this works correctly, it's confusing. Consider restructuring for clarity.♻️ Suggested clarification
- for pattern in error_patterns: - if pattern in first_part: - if pattern.startswith("error:") or pattern.startswith("exception:"): - return False - idx = first_part.find(pattern) - if idx < 100: - return False + # Patterns that invalidate anywhere in first 1000 chars + critical_patterns = ["error:", "exception:"] + # Patterns that only invalidate if in first 100 chars + early_patterns = ["traceback", "failed to", "unable to convert", + "conversion failed", "error processing"] + + for pattern in critical_patterns: + if pattern in first_part: + return False + for pattern in early_patterns: + idx = first_part.find(pattern) + if idx != -1 and idx < 100: + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 36 - 42, The logic in the loop over error_patterns is confusing because it checks pattern.startswith("error:")/("exception:") instead of whether the content (first_part) starts with those prefixes; update the conditional to test first_part (e.g. first_part.lower().startswith("error:") or first_part.lower().startswith("exception:")) so the intent is explicit, or alternatively document/rename entries in error_patterns if they are meant to be literal sentinel patterns; change the check inside the for-loop that references error_patterns and first_part accordingly and keep the subsequent idx/threshold logic unchanged.wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)
87-89: Remove redundant exception object fromlogging.exception.
logger.exception()automatically includes the exception info from the current context. Passingeexplicitly is redundant (TRY401).♻️ Proposed fix
except Exception as e: - logger.exception("WG21 Paper Tracker failed: %s", e) + logger.exception("WG21 Paper Tracker failed") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around lines 87 - 89, In the except block that catches Exception as e in run_wg21_paper_tracker (the try/except surrounding the WG21 Paper Tracker run), remove the redundant exception object from the logger.exception call so it reads logger.exception("WG21 Paper Tracker failed") instead of logger.exception("WG21 Paper Tracker failed: %s", e); keep the existing raise to re-raise the exception.wg21_paper_tracker/tests/test_services.py (1)
165-225: Add a same-paper_id/different-yearregression for download marking.This file already proves
paper_idcan repeat across years, butmark_paper_downloadedis only asserted against a single row. A multi-year case here would catch cross-year updates immediately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_services.py` around lines 165 - 225, Add a new test that creates two papers with the same paper_id but different years (use get_or_create_paper and two mailings), call mark_paper_downloaded using the normalized paper_id for one of them, then refresh both papers and assert that only the intended year's paper has is_downloaded=True while the other remains False; name it something like test_mark_paper_downloaded_respects_year_for_same_paper_id and reuse mark_paper_downloaded, get_or_create_paper, and get_or_create_mailing to locate the relevant logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cppa_user_tracker/services.py`:
- Around line 358-365: The code uses display_name as the lookup key in
get_or_create_wg21_paper_author_profile but WG21PaperAuthorProfile.display_name
is not unique, causing race-created duplicates and MultipleObjectsReturned; fix
by using a true unique identifier or making the author key unique. Update the
WG21PaperAuthorProfile model to include or mark a unique field (e.g.,
author_key, canonical_name, or external_id) with unique=True and create the
corresponding migration, then change get_or_create_wg21_paper_author_profile to
call WG21PaperAuthorProfile.objects.get_or_create(author_key=..., defaults={
'display_name': display_name_val, ... }) (or perform an atomic
transaction/select_for_update around creation) so lookups use the unique field
instead of display_name.
In `@docs/operations/WG21_Cloud_Run.md`:
- Around line 41-48: The doc shows embedding OPENROUTER_API_KEY via
--set-env-vars in the gcloud run jobs create wg21-convert command which leaks
secrets; update the example to remove OPENROUTER_API_KEY from --set-env-vars and
instead add a short note instructing operators to inject OPENROUTER_API_KEY via
Cloud Run secret injection (or the platform's secret manager/secret-mount flow),
e.g., reference using a secret resource rather than CLI env, and show the safer
pattern around the command that uses only WG21_GCS_BUCKET in --set-env-vars plus
a sentence pointing to the secret-management steps for OPENROUTER_API_KEY.
In `@docs/Schema.md`:
- Around line 637-640: Update the schema documentation to include the
WG21Paper.year field and explicitly state the composite uniqueness constraint
(paper_id, year) wherever WG21Paper is described; update the diagram and the
table rows that list paper_id to show year as a column and change any uniqueness
notes that imply paper_id is globally unique to instead document the composite
unique index on (paper_id, year). Ensure the same change is applied consistently
to the other occurrences mentioned (the other WG21Paper descriptions and summary
rows).
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 226-256: The code currently treats fallback error stubs as real
output because it only checks markdown_content length; update the conversion
loop (where page_markdown is appended to markdown_parts) to set a boolean flag
(e.g., had_model_output) to True whenever a real page_markdown is produced, and
leave it False for fallback stubs and exceptions; then, before logging success
and returning markdown_content in the end block, check that had_model_output is
True (and that markdown_content.strip() is non-empty) and only then log success
and return content, otherwise log the conversion-as-empty warning and return
None. Ensure the flag is defined before the page loop and reference symbols
page_markdown, markdown_parts, markdown_content, total_pages and pdf_path.name
in your changes.
- Line 1: Add "from __future__ import annotations" at the top of the module so
annotations like Image.Image are not evaluated at import time, and move the
import io to the top-level (outside the try/except) so the fallback path always
has io available; keep the optional pdf2image/Pillow import block setting
PDF2IMAGE_AVAILABLE and binding Image in the except, but with the future-import
and top-level io the functions that reference Image.Image in their annotations
(the functions defined around lines 34, 62 and 94) will no longer cause a
NameError during module import.
In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py`:
- Around line 55-69: The Markdown table output in the tables loop (in
pdfplumber_converter.py where you iterate "for table in tables" and append rows
into markdown_parts) never emits the required header separator row, so add logic
to treat the first non-empty row of each table as the header and immediately
after appending that first row insert a separator row with the same column count
(e.g., "| --- | --- | ... |"), generating one '---' cell per column (respecting
empty cells) before appending the remaining body rows; update the code around
the table row appends that reference markdown_parts and table to insert this
separator only once per table.
In `@wg21_paper_tracker/cloud_run_job/Dockerfile`:
- Around line 1-24: The image runs as root; create and switch to a dedicated
non-root user to reduce blast radius: add steps in the Dockerfile to create a
sysuser (e.g., "appuser" or UID 1000) with no-login, chown the WORKDIR (/app)
and any written paths to that user, and add a USER appuser directive before CMD
so the container runs unprivileged; keep apt-get and pip install steps running
as root, but ensure files copied into /app are chown'd to the new user (use
chown -R appuser:appuser /app) and then switch to USER appuser after copying and
permission changes.
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 19-43: The function is_content_valid accepts None but is annotated
as str; change its type annotation to Optional[str] (or Union[str, None] for
<3.10) and import Optional from typing if not present so the signature matches
actual behavior (update the def is_content_valid parameter type and ensure
typing import exists); no other logic changes required.
In `@wg21_paper_tracker/cloud_run_job/requirements.txt`:
- Line 4: Update the Pillow pin in requirements.txt to require a non-vulnerable
minimum version: replace the existing "Pillow>=10.0.0" requirement with a
minimum of "Pillow>=10.3.0" so the job parsing untrusted PDFs/images excludes
the known vulnerable releases (refer to the requirement entry "Pillow>=10.0.0"
to locate the line to change).
In `@wg21_paper_tracker/fetcher.py`:
- Around line 29-34: The current fetch code swallows network/HTTP errors by
logging and returning [], which hides failures; instead define and raise a typed
fetch exception (e.g., FetchError or HTTPFetchError) and propagate the original
exception (use exception chaining) instead of returning an empty list. Replace
the except block around requests.get(BASE_URL) to raise the new FetchError with
context (include the caught exception and response status when available), and
make the same change for the other fetch block (the similar try/except at lines
65-70) so callers can decide to retry or fail.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 68-70: Accessing settings.GCP_PROJECT_ID, settings.GCP_LOCATION,
and settings.WG21_CLOUD_RUN_JOB_NAME can raise AttributeError if not set; change
those direct attribute accesses (project_id, location, job_name) to use
getattr(settings, 'GCP_PROJECT_ID', <sensible_default>) etc., providing
appropriate defaults or failing fast with a clear error message; update the code
in run_wg21_paper_tracker.py where project_id, location, and job_name are
assigned so it uses getattr(settings, ...) instead of direct attribute access.
In `@wg21_paper_tracker/models.py`:
- Line 6: The current unique_together allows multiple (paper_id, NULL) rows;
replace it with explicit Meta.constraints using two UniqueConstraint entries:
one UniqueConstraint(fields=['paper','year'],
condition=models.Q(year__isnull=False), name='unique_paper_year_not_null') to
enforce uniqueness for non-null years, and a second
UniqueConstraint(fields=['paper'], condition=models.Q(year__isnull=True),
name='unique_paper_null_year') to ensure only one (paper, NULL) row per paper;
update the model Meta (remove unique_together) and import models.Q if needed so
get_or_create_paper(..., year=None) cannot create duplicates.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 200-209: best_paper["filename"] is taken from scraped HTML and
used directly to build local_path (raw_dir / filename) and gcs_path, allowing
path traversal (e.g., "../../etc/passwd") or unexpected blob names; before using
filename in _download_file and _upload_to_gcs, sanitize it by extracting a safe
basename and removing path separators/parent refs (e.g., use
pathlib.Path(filename).name or an equivalent sanitize_name function), reject
empty or suspicious names, and then build local_path and gcs_path from that
sanitized_name (keep mailing_date and raw/wg21_papers prefix as-is) so
_download_file(local_path) and _upload_to_gcs(...) cannot escape the workspace
or create malicious object keys.
- Around line 189-193: The loop currently skips downloads based only on paper_id
and .first(), causing different years with the same ID to be suppressed and
relying on arbitrary ordering; instead, for each paper in papers_by_id[p_id]
check WG21Paper.objects.filter(paper_id=pid, year=paper.year,
is_downloaded=True).exists() (avoid .first()), and increment skipped_downloaded
and skip only that specific paper when the (paper_id, year) tuple is already
downloaded so decisions are deterministic and year-aware.
In `@wg21_paper_tracker/services.py`:
- Around line 86-88: The mark_paper_downloaded function is currently only
filtering by paper_id and therefore updates all WG21Paper rows with that id
across years; change the function signature to accept (or require) a year
parameter and normalize it, then update the WG21Paper.objects.filter(...) call
to include both paper_id and year so you only set is_downloaded=True for the
specific (paper_id, year) row; validate/normalize the inputs (e.g., strip/lower
paper_id and ensure year is an int or valid string) and preserve existing
behavior when inputs are missing by raising or returning early as appropriate.
In `@wg21_paper_tracker/tests/test_fetcher.py`:
- Around line 16-31: Update both the implementation of fetch_all_mailings and
the tests so network errors use requests-specific exceptions: change the
fetch_all_mailings exception handlers that currently catch bare Exception to
catch requests.RequestException (or requests.HTTPError for raise_for_status
paths) and update tests in test_fetcher.py to set side_effect to
requests.RequestException("network error") and to a requests.HTTPError for the
raise_for_status case; ensure patches still target
wg21_paper_tracker.fetcher.requests.get and that response.raise_for_status
side_effect uses the appropriate requests exception class.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 64-74: The tests patching requests.get in
test_download_file_retries_on_failure and
test_download_file_succeeds_on_second_attempt should raise
requests.RequestException rather than a bare Exception so they model real
requests library failures; update the patch side_effects to use
requests.RequestException("connection error") (and the second test's side_effect
sequence) while keeping assertions on _download_file, DOWNLOAD_MAX_RETRIES, and
sleep behavior intact.
In `@wg21_paper_tracker/workspace.py`:
- Around line 18-23: Validate the mailing_date input in get_raw_dir before using
it as a path component: ensure it matches the strict YYYY-MM pattern (e.g., with
a regex like r'^\d{4}-\d{2}$') and reject any value containing path separators
or traversal sequences by raising a ValueError; then only build the path using
get_workspace_path(_RAW_APP_SLUG) and the validated mailing_date so callers
cannot escape the workspace/root directory.
---
Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py`:
- Around line 11-20: The import of InputFormat from
docling.datamodel.base_models is unused; remove InputFormat from the try-import
block so only DocumentConverter is imported (and keep the DOCLING_AVAILABLE
assignment and the except fallback). Update the top-of-file import statement
around DocumentConverter/InputFormat to drop InputFormat and any related
type-ignore comments for it, leaving DocumentConverter, DOCLING_AVAILABLE, and
the logger.warning logic intact.
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 36-42: The logic in the loop over error_patterns is confusing
because it checks pattern.startswith("error:")/("exception:") instead of whether
the content (first_part) starts with those prefixes; update the conditional to
test first_part (e.g. first_part.lower().startswith("error:") or
first_part.lower().startswith("exception:")) so the intent is explicit, or
alternatively document/rename entries in error_patterns if they are meant to be
literal sentinel patterns; change the check inside the for-loop that references
error_patterns and first_part accordingly and keep the subsequent idx/threshold
logic unchanged.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 220-232: The inline import of
get_or_create_wg21_paper_author_profile inside the block that creates
WG21PaperAuthor can mask the original IntegrityError; move the import to module
level by adding "from cppa_user_tracker.services import
get_or_create_wg21_paper_author_profile" to the top imports and remove the
inline import inside the loop so get_or_create_wg21_paper_author_profile is
resolved at module import time and won’t convert an IntegrityError into an
ImportError during exception handling.
- Around line 233-242: Replace the logger.error calls in the exception handlers
with logger.exception so stack traces are captured: specifically, in the inner
except Exception as inner block that currently does stats["skipped"] += 1 and
logger.error("Error for paper_id=%s (after IntegrityError): %s", paper_id,
inner), change that to logger.exception with the same message and remove passing
the exception object; likewise change the outer except Exception as e handler
(stats["skipped"] += 1; logger.error("Error for paper_id=%s: %s", paper_id, e))
to use logger.exception, and apply the same replacement to the other similar
catch blocks earlier in the file (the blocks around the IntegrityError handling
and the other exception catches referenced in the review) so all unexpected
exceptions use logger.exception to include stack traces.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 87-89: In the except block that catches Exception as e in
run_wg21_paper_tracker (the try/except surrounding the WG21 Paper Tracker run),
remove the redundant exception object from the logger.exception call so it reads
logger.exception("WG21 Paper Tracker failed") instead of logger.exception("WG21
Paper Tracker failed: %s", e); keep the existing raise to re-raise the
exception.
In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 57-76: Test only covered different-year case; add a failing
same-year duplicate to assert the unique constraint. Import IntegrityError from
django.db at the top of the test file, then modify
test_wg21_paper_unique_together_paper_id_year to attempt creating another
WG21Paper with the same paper_id and same year (use the existing m1/year=2024)
and assert it raises IntegrityError (e.g., with pytest.raises(IntegrityError)).
Keep existing successful different-year creation (WG21Paper creation for
year=2025) unchanged.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 49-62: The test test_download_file_uses_timeout writes to a
hardcoded /tmp/out; change it to use pytest's tmp_path fixture to keep the test
sandboxed and portable: update the test signature to accept tmp_path, replace
filepath = Path("/tmp/out") with filepath = tmp_path / "out" (or similar), and
keep the rest of the assertions the same; this targets the test function
test_download_file_uses_timeout and the _download_file call so no other behavior
changes are required.
In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 165-225: Add a new test that creates two papers with the same
paper_id but different years (use get_or_create_paper and two mailings), call
mark_paper_downloaded using the normalized paper_id for one of them, then
refresh both papers and assert that only the intended year's paper has
is_downloaded=True while the other remains False; name it something like
test_mark_paper_downloaded_respects_year_for_same_paper_id and reuse
mark_paper_downloaded, get_or_create_paper, and get_or_create_mailing to locate
the relevant logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 92d9fb12-9816-445e-85e9-0748c89867c7
📒 Files selected for processing (34)
config/settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/services.pydocs/Schema.mddocs/operations/WG21_Cloud_Run.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (5)
wg21_paper_tracker/fetcher.py (1)
29-34: Keep the graceful fallback, but log the actual request failure.These branches intentionally return
[], so the log is the only signal maintainers get during upstream outages. Right now the exception details are discarded, which makes diagnosis much harder.🪵 Proposed logging-only fix
- except requests.RequestException: - logger.error("Failed to fetch WG21 index.") + except requests.RequestException: + logger.exception("Failed to fetch WG21 index.") return [] @@ - except requests.RequestException: - logger.error("Failed to fetch year page %s.", year) + except requests.RequestException: + logger.exception("Failed to fetch year page %s.", year) return []Based on learnings, returning
[]here is intentional, so preserving the exception context in logs is especially important.Also applies to: 65-70
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/fetcher.py` around lines 29 - 34, The except blocks that catch requests.RequestException in fetcher.py (the try around requests.get(f"{BASE_URL}/", timeout=30) and the similar block later) swallow the exception details; update both handlers to capture the exception (e.g. except requests.RequestException as e) and log the real error context via logger.error including the exception (either via exc_info=True or by interpolating e into the message) while preserving the existing behavior of returning [] so the graceful fallback remains.wg21_paper_tracker/cloud_run_job/main.py (1)
67-71: Consider more informative exit on missing environment variable.Returning silently after logging an error may cause the Cloud Run job to appear successful (exit code 0). Consider raising an exception or using
sys.exit(1)to signal failure.♻️ Proposed fix
+import sys + def main(): bucket_name = os.getenv("WG21_GCS_BUCKET") if not bucket_name: logger.error("WG21_GCS_BUCKET env var not set.") - return + sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 67 - 71, The code in main() logs when WG21_GCS_BUCKET is unset but then returns (exit code 0); change this to signal failure by either raising an exception or calling sys.exit(1). Specifically, update the bucket_name check in main() (the logger.error call) so after logging it calls sys.exit(1) or raises a RuntimeError with a clear message about WG21_GCS_BUCKET being missing, and ensure you import sys if you choose sys.exit.wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)
60-68: Remove commented-out code or restore try/except.The commented-out exception handling around
parse_dateshould either be removed if unnecessary, or restored ifparse_datecan raise exceptions that need handling.♻️ Clean up
def _parse_document_date(date_str: str): """Return date or None from CSV date column (e.g. YYYY-MM-DD). Invalid values return None.""" cleaned = _norm(date_str) if not cleaned: return None - # try: return parse_date(cleaned) - # except (ValueError, TypeError): - # return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 60 - 68, The _parse_document_date function currently contains commented-out try/except around parse_date; either remove the dead commented code or reintroduce error handling: decide whether parse_date can raise ValueError/TypeError in your environment and if so, wrap the call to parse_date(cleaned) in a try/except that returns None on ValueError or TypeError; otherwise delete the commented-out except block and its comments so _parse_document_date contains only the normalized check and the direct return of parse_date(cleaned).
164-171: Uselogger.exceptionfor better debugging in error handlers.Using
logger.exceptioninstead oflogger.errorwill include the full traceback, which is valuable for debugging batch import failures.♻️ Proposed fix
except Exception as e: stats["skipped"] += 1 - logger.error( + logger.exception( "Error parsing document date for paper_id=%s: %s", paper_id, e, ) continueApply similar changes at lines 211, 233-237, and 240.
Also applies to: 231-240
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 164 - 171, Replace calls to logger.error inside the except Exception as e handlers with logger.exception so the traceback is logged; specifically update the block that currently does stats["skipped"] += 1 followed by logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e) (and the other similar except blocks that increment stats["skipped"]) to call logger.exception("Error parsing document date for paper_id=%s", paper_id) (or an analogous message) and keep the stats increment and continue; apply the same change to the other except handlers currently using logger.error in this file (the occurrences around the other try/except blocks that increment stats["skipped"]).wg21_paper_tracker/services.py (1)
42-46: Consider extracting year normalization to a helper.The year parsing logic is duplicated between
get_or_create_paper(lines 42-46) andmark_paper_downloaded(lines 112-116) with slightly different truthiness checks (if year:vsif year is not None:). While both currently work correctly, extracting this to a shared helper would improve maintainability.♻️ Suggested helper
def _normalize_year(year: int | str | None) -> int: """Normalize year input to int, defaulting to 0.""" if year is None: return 0 s = (year if isinstance(year, str) else str(year)).strip()[:4] return int(s) if s.isdigit() else 0Also applies to: 112-116
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/services.py` around lines 42 - 46, Extract the duplicated year-parsing logic into a shared helper (e.g., _normalize_year) and use it from both get_or_create_paper and mark_paper_downloaded; implement _normalize_year(year: int | str | None) -> int to return 0 for None/invalid input, trim strings to first 4 chars, and return int(s) if s.isdigit() else 0, then replace the inline blocks in get_or_create_paper and mark_paper_downloaded with year_val = _normalize_year(year).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alas.py`:
- Around line 13-18: Migration 0006 created a field named author_alas (likely a
typo) on model wg21paperauthorprofile; to fix, rename the model field in
models.py from author_alas to author_alias, then delete this migration file
(0006_wg21paperauthorprofile_author_alas.py) and regenerate migrations by
running python manage.py makemigrations so Django produces a migration that adds
author_alias (or a proper RenameField migration if preserving data), and commit
the regenerated migration.
In `@cppa_user_tracker/models.py`:
- Line 168: Rename the typo'd model field author_alas to author_alias in
cppa_user_tracker.models (update the Field declaration in the model class where
author_alas is defined), then regenerate the migration so the DB schema matches:
delete or replace the incorrect migration
0006_wg21paperauthorprofile_author_alas.py and run makemigrations/migrate to
create a migration that reflects the rename (or use makemigrations --name ... to
produce a rename operation), and update any references in code/tests that used
author_alas to use author_alias.
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 240-246: The except block in openai_converter.py currently appends
raw exception text into markdown_parts (using page_num and str(e)), which may
leak internal details; keep the detailed error in logs (logger.error/exception
with exc_info=True) but replace the persisted content with a generic placeholder
such as "Error processing this page" (no exception text) when appending to
markdown_parts so only non-sensitive, user-facing text is stored.
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 38-44: The logic in the loop over error_patterns incorrectly calls
pattern.startswith("error:")/("exception:"), which tests the pattern string
rather than whether the content (first_part) begins with those markers; update
the condition in the for-loop that inspects first_part so you check
first_part.startswith("error:") or first_part.startswith("exception:") (or
equivalently test if idx == 0 after finding pattern in first_part) instead of
pattern.startswith(...); alternatively, if the intent was to always reject the
literal patterns regardless of position, test if pattern in {"error:",
"exception:"} and return False unconditionally—modify the block using variables
error_patterns, first_part, pattern accordingly.
In `@wg21_paper_tracker/cloud_run_job/requirements.txt`:
- Line 4: The requirements file pins Pillow too low (Pillow>=10.3.0) which is
vulnerable; update the Pillow spec in
wg21_paper_tracker/cloud_run_job/requirements.txt to "Pillow>=12.1.1" (or later)
to address GHSA-cfh3-3jmp-rvhc, then rebuild your environment/container and
re-run tests to ensure compatibility with the updated Pillow release.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 204-220: The IntegrityError handler currently recovers by fetching
WG21Paper.objects.filter(paper_id=paper_id).first(), which can pick the wrong
row when multiple years exist; change the lookup to include year (e.g.,
WG21Paper.objects.filter(paper_id=paper_id, year=year).first()) and handle the
case where year is None by using an appropriate disambiguation (such as skipping
update, logging the ambiguity, or selecting the most relevant year) so you
update the correct row in the except block around IntegrityError and adjust
stats/logging (stats["skipped"] and logger.error) accordingly; update the code
paths that set paper.url/title/document_date/mailing/subgroup/year and call
paper.save() only after a correct, unambiguous paper is found.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 64-83: The Cloud Run trigger and the "Downloaded and uploaded" log
must be gated when GCS uploads are disabled: update the logic around
run_tracker_pipeline()/total_new_papers so you only call
trigger_cloud_run_job(project_id, location, job_name) when uploads actually
occurred (e.g., check settings.WG21_GCS_BUCKET is set or, preferably, change
run_tracker_pipeline to return an explicit uploaded_papers count/flag and use
that). Also update the logger.info message to accurately reflect whether papers
were uploaded vs. only downloaded (use uploaded_papers or the WG21_GCS_BUCKET
check to choose the message), and keep using trigger_cloud_run_job and logger.*
as the referenced symbols to locate where to change behavior.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 158-160: Normalize the parsed year once when extracting the
mailing info and reuse that single normalized value for deduplication and
persistence; replace the current ad-hoc conversion (where a missing year becomes
0 in some places and None in others) by computing year_normalized =
int(m_info["year"]) if m_info["year"] else None (or equivalent) where
mailing_date/title/year are extracted, then use year_normalized everywhere
downstream (including the dedupe key construction and the persistence call) and
remove any places that treat missing years as 0 so the same canonical key is
used consistently (also update the other occurrences referenced around the other
similar blocks).
- Around line 219-225: The current branch in pipeline.py sets uploaded=True when
bucket_name is missing, causing is_downloaded to be marked true even though GCS
upload never happened; instead, keep uploaded=False (or introduce a separate
local_only flag) so records requiring GCS remain eligible for backfill.
Concretely, in the block using bucket_name and _upload_to_gcs, do not simulate
success — set uploaded based on an actual upload result (or set a distinct
local_only state) and ensure the code that flips is_downloaded only does so when
uploaded is true; apply the same fix to the similar branch around the second
occurrence (the block that mirrors lines 254-257).
In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 58-76: Update the test_wg21_paper_unique_together_paper_id_year
function to also assert that inserting a WG21Paper with the same paper_id and
the same year fails: after creating the first 2024 WG21Paper (using WG21Mailing
m1), attempt a second WG21Paper with paper_id="sd-1" and year=2024 wrapped in
pytest.raises(IntegrityError) to confirm the (paper_id, year) uniqueness
constraint, while keeping the existing 2025 insert (p2) as the successful
positive case; use WG21Mailing (m2) as needed for the positive 2025 row.
In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 153-165: The test and service are forcing unknown years to the
sentinel 0 which collapses distinct imports; instead preserve NULL for unknown
years: update get_or_create_paper to stop converting year=None to 0 and pass
None through so the DB stores NULL (ensure the Paper model/column is nullable),
and adjust the test test_get_or_create_paper_year_none_stored_as_zero to expect
paper.year is None (or alternatively enforce a validation at the service
boundary to raise when year is missing if you prefer rejecting None). Refer to
get_or_create_paper and the failing test name when making the change.
- Around line 76-103: The test currently sets mock_profile.return_value to a
single MagicMock so both authors appear to get the same profile; change to
return distinct profile tuples using mock_profile.side_effect = [(profile1,
True), (profile2, True)] where profile1.pk = 1 and profile2.pk = 2, and then
update the mock_get_or_create_paper_author assertions to expect calls with paper
and profile1 for position 1 and paper and profile2 for position 2 (keep
mock_get_or_create_paper_author return as before). This ensures
get_or_create_paper actually links each author to a distinct profile; reference
the test function test_get_or_create_paper_calls_author_profile_for_each_author,
mock_profile, mock_get_or_create_paper_author, get_or_create_paper, and
get_or_create_mailing when making the edits.
---
Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 67-71: The code in main() logs when WG21_GCS_BUCKET is unset but
then returns (exit code 0); change this to signal failure by either raising an
exception or calling sys.exit(1). Specifically, update the bucket_name check in
main() (the logger.error call) so after logging it calls sys.exit(1) or raises a
RuntimeError with a clear message about WG21_GCS_BUCKET being missing, and
ensure you import sys if you choose sys.exit.
In `@wg21_paper_tracker/fetcher.py`:
- Around line 29-34: The except blocks that catch requests.RequestException in
fetcher.py (the try around requests.get(f"{BASE_URL}/", timeout=30) and the
similar block later) swallow the exception details; update both handlers to
capture the exception (e.g. except requests.RequestException as e) and log the
real error context via logger.error including the exception (either via
exc_info=True or by interpolating e into the message) while preserving the
existing behavior of returning [] so the graceful fallback remains.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 60-68: The _parse_document_date function currently contains
commented-out try/except around parse_date; either remove the dead commented
code or reintroduce error handling: decide whether parse_date can raise
ValueError/TypeError in your environment and if so, wrap the call to
parse_date(cleaned) in a try/except that returns None on ValueError or
TypeError; otherwise delete the commented-out except block and its comments so
_parse_document_date contains only the normalized check and the direct return of
parse_date(cleaned).
- Around line 164-171: Replace calls to logger.error inside the except Exception
as e handlers with logger.exception so the traceback is logged; specifically
update the block that currently does stats["skipped"] += 1 followed by
logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e)
(and the other similar except blocks that increment stats["skipped"]) to call
logger.exception("Error parsing document date for paper_id=%s", paper_id) (or an
analogous message) and keep the stats increment and continue; apply the same
change to the other except handlers currently using logger.error in this file
(the occurrences around the other try/except blocks that increment
stats["skipped"]).
In `@wg21_paper_tracker/services.py`:
- Around line 42-46: Extract the duplicated year-parsing logic into a shared
helper (e.g., _normalize_year) and use it from both get_or_create_paper and
mark_paper_downloaded; implement _normalize_year(year: int | str | None) -> int
to return 0 for None/invalid input, trim strings to first 4 chars, and return
int(s) if s.isdigit() else 0, then replace the inline blocks in
get_or_create_paper and mark_paper_downloaded with year_val =
_normalize_year(year).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: eeb3955d-3988-4c0b-af91-76b020089677
📒 Files selected for processing (38)
config/settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alas.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydocs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (3)
wg21_paper_tracker/tests/test_pipeline.py (2)
50-62: Consider usingtmp_pathfixture for the timeout test.Line 53 uses a hardcoded
/tmp/outpath. While the mock prevents actual file writes, usingtmp_pathwould be more consistent with other tests and avoid the static analysis warning.♻️ Suggested fix
-def test_download_file_uses_timeout(): +def test_download_file_uses_timeout(tmp_path): """_download_file calls requests.get with DOWNLOAD_TIMEOUT.""" url = "https://example.com/f" - filepath = Path("/tmp/out") + filepath = tmp_path / "out" resp = MagicMock()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 50 - 62, The test test_download_file_uses_timeout should use the pytest tmp_path fixture instead of a hardcoded Path("/tmp/out"); modify the test signature to accept tmp_path and replace filepath = Path("/tmp/out") with filepath = tmp_path / "out" (or similar filename), keep the rest of the test unchanged so it still patches wg21_paper_tracker.pipeline.requests.get and asserts timeout == DOWNLOAD_TIMEOUT when calling _download_file.
147-166: Deeply nested context managers reduce readability.Consider using
unittest.mock.patch.multipleorcontextlib.ExitStackto flatten the nesting, or usepytest-mock'smockerfixture for cleaner syntax.♻️ Example using ExitStack
+from contextlib import ExitStack + `@pytest.mark.django_db` def test_run_tracker_pipeline_downloads_new_papers(tmp_path): """run_tracker_pipeline downloads papers for new mailings and returns count.""" from wg21_paper_tracker.models import WG21Mailing WG21Mailing.objects.create(mailing_date="2025-01", title="Previous") mailings = [...] papers = [...] - with patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings): - with patch( - "wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers - ): - with patch( - "wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path - ): - # ... more nesting - n = run_tracker_pipeline() + with ExitStack() as stack: + stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings)) + stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers)) + stack.enter_context(patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path)) + stack.enter_context(patch("wg21_paper_tracker.pipeline._download_file", return_value=True)) + stack.enter_context(patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket")) + stack.enter_context(patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True)) + n = run_tracker_pipeline() assert n == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 147 - 166, The test uses deeply nested patch context managers around run_tracker_pipeline (patching fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, and _upload_to_gcs); flatten these by replacing the nested with either unittest.mock.patch.multiple or contextlib.ExitStack (or pytest-mock's mocker) to apply all patches in one block: locate the test function in tests/test_pipeline.py and replace the nested with a single patch.multiple call (or an ExitStack that enters each patch) that sets return_value/side_effect for fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file, _upload_to_gcs and sets WG21_GCS_BUCKET to "test-bucket", then call run_tracker_pipeline() and assert n == 1.wg21_paper_tracker/models.py (1)
18-18: Prefer tuples forMetaconstants.These values are static config, so tuples fit better than lists here and clear the current Ruff
RUF012warnings without changing behavior.Also applies to: 45-46, 74-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/models.py` at line 18, The Meta class uses list literals for static ordering values (e.g., ordering = ["-mailing_date"]) which triggers RUF012; replace these list literals with equivalent tuples (ordering = ("-mailing_date",)) across all Meta definitions in this file (also update the other occurrences that currently use lists at the other Meta blocks referenced) so the behavior is unchanged but the constants are tuples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/settings.py`:
- Around line 223-225: The WG21_CLOUD_RUN_JOB_NAME default makes the job name
always truthy and causes run_wg21_paper_tracker to try invoking Cloud Run even
when no job is configured; change settings so WG21_CLOUD_RUN_JOB_NAME defaults
to an empty string (no implicit "wg21-convert") and add/use an explicit boolean
flag (e.g., WG21_CLOUD_RUN_ENABLED from env) that run_wg21_paper_tracker checks
along with GCP_PROJECT_ID and WG21_GCS_BUCKET before attempting Cloud Run;
update the code paths that reference WG21_CLOUD_RUN_JOB_NAME and
run_wg21_paper_tracker to gate execution on the new WG21_CLOUD_RUN_ENABLED flag
(and leave existing GCP_PROJECT_ID/WG21_GCS_BUCKET checks intact).
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 230-261: The current logic treats any single successful page as a
successful conversion and returns partial Markdown; change the success check so
only fully converted documents are considered successful: replace the condition
that checks "if successful_pages > 0 and markdown_content.strip()" with a
stricter check that successful_pages equals total_pages (e.g., "if
successful_pages == total_pages and markdown_content.strip()") and when pages
are missing log a clear warning including successful_pages/total_pages and
return None so partial results are not persisted; update the log messages around
pdf_path.name and the extracted character count to run only on full success.
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 81-110: The loop over blobs currently allows any exception from
blob.exists(), blob.download_to_filename(), convert_pdf_to_md, or
md_blob.upload_from_string() to abort the whole job; wrap the per-blob
processing (from checking md_blob.exists() through upload and file deletion) in
a try/except block that catches Exception, logs the blob name and the exception
via logger.error("... %s", blob.name, exc_info=True) and then continues to the
next blob; ensure local_pdf_path.unlink() is executed in a finally block (or
check and remove after the except) so the temp file is always cleaned up even on
errors; keep calls to convert_pdf_to_md, md_blob.exists(),
download_to_filename(), and upload_from_string() inside that try so a single bad
file is skipped without aborting the job.
In `@wg21_paper_tracker/fetcher.py`:
- Around line 29-34: The exception handlers in fetcher.py that catch
requests.RequestException (around the requests.get/response.raise_for_status
block) and the similar handler later (lines ~65-70) currently call logger.error
and drop the original exception details; replace those logger.error(...) calls
with logger.exception(...) so the full traceback and HTTP/timeout details are
preserved when handling failures from requests.get and related code paths.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 75-81: The current except block catches exceptions from
trigger_cloud_run_job(project_id, location, job_name) and only logs them, which
swallows failures; change the handler in run_wg21_paper_tracker.py so that after
logging the failure (logger.error(...)) it re-raises the exception (raise) so
the scheduler marks the run failed and retries/alerts instead of silently
succeeding; locate the try/except around trigger_cloud_run_job and add a
re-raise after the logger.error call.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 205-224: The format priority ordering is inverted: format_priority
currently assigns lower numbers to less-desired formats so
p_list.sort(key=lambda x: format_priority(x["type"])) picks ".adoc" over ".pdf";
fix by making the sort choose the preferred format (PDF) first — either invert
the priorities mapping in format_priority (give "pdf" the lowest value and
"adoc" a higher value) or keep the mapping but change the sort to
p_list.sort(key=..., reverse=True) so best_paper = p_list[0] yields the PDF;
update format_priority and/or the p_list.sort(...) call accordingly.
In `@wg21_paper_tracker/services.py`:
- Around line 42-57: The current code uses year_val in the get_or_create key so
placeholder rows with year=0 block later updates; change to create/get by
paper_id only (call WG21Paper.objects.get_or_create(paper_id=paper_id,
defaults={...}) without year) and then, after obtaining the paper instance, if a
valid numeric year (year_val > 0) is present and paper.year differs, set
paper.year = year_val and save() to promote the placeholder; also update the
existing year-update branch that references year transitions (the branch using
year_val and the code that tries to update year later) to rely on this
single-instance update flow.
- Around line 81-86: The current author-sync loop only upserts current authors
and never removes stale WG21PaperAuthor rows; modify the author-sync to (1)
build the desired list by pairing author_names with author_emails (emails =
author_emails or []), (2) for each pair call
get_or_create_wg21_paper_author_profile(name, email=...) and collect the profile
IDs and desired positions, (3) query existing WG21PaperAuthor rows for the paper
and delete any whose profile ID is not in the desired set (this also handles
empty author_names by deleting all), and (4) ensure you then create/upsert
remaining desired WG21PaperAuthor entries with correct positions via
get_or_create_paper_author(paper, profile, position). Use the existing helpers
get_or_create_wg21_paper_author_profile and get_or_create_paper_author and
ensure ordering is preserved.
---
Nitpick comments:
In `@wg21_paper_tracker/models.py`:
- Line 18: The Meta class uses list literals for static ordering values (e.g.,
ordering = ["-mailing_date"]) which triggers RUF012; replace these list literals
with equivalent tuples (ordering = ("-mailing_date",)) across all Meta
definitions in this file (also update the other occurrences that currently use
lists at the other Meta blocks referenced) so the behavior is unchanged but the
constants are tuples.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 50-62: The test test_download_file_uses_timeout should use the
pytest tmp_path fixture instead of a hardcoded Path("/tmp/out"); modify the test
signature to accept tmp_path and replace filepath = Path("/tmp/out") with
filepath = tmp_path / "out" (or similar filename), keep the rest of the test
unchanged so it still patches wg21_paper_tracker.pipeline.requests.get and
asserts timeout == DOWNLOAD_TIMEOUT when calling _download_file.
- Around line 147-166: The test uses deeply nested patch context managers around
run_tracker_pipeline (patching fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, and _upload_to_gcs);
flatten these by replacing the nested with either unittest.mock.patch.multiple
or contextlib.ExitStack (or pytest-mock's mocker) to apply all patches in one
block: locate the test function in tests/test_pipeline.py and replace the nested
with a single patch.multiple call (or an ExitStack that enters each patch) that
sets return_value/side_effect for fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, _upload_to_gcs and sets WG21_GCS_BUCKET to
"test-bucket", then call run_tracker_pipeline() and assert n == 1.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e5209104-7b70-42e6-b1cc-ce64b45283f1
📒 Files selected for processing (38)
config/settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydocs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
…r=0 promotion, logging cppalliance#24
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (2)
cppa_user_tracker/services.py (1)
383-391:⚠️ Potential issue | 🟠 MajorDon’t collapse same-name authors when the email doesn’t match.
When
email_valis present but no candidate has that email, this falls back to an arbitrary existing profile. That merges distinct WG21 authors who share a display name and also loses the newly discovered email. If you reuse a single candidate, attach the email; if multiple candidates exist and none matches, create a new profile instead.💡 Minimal fix
if len(candidates) == 1: - return candidates[0], False + profile = candidates[0] + if email_val and not profile.emails.filter(email=email_val).exists(): + add_email( + profile, + email_val, + is_primary=not profile.emails.filter(is_active=True).exists(), + ) + return profile, False # Two or more: disambiguate by email if provided if email_val: for p in candidates: if p.emails.filter(email=email_val).exists(): return p, False + profile = WG21PaperAuthorProfile.objects.create(display_name=display_name_val) + add_email(profile, email_val, is_primary=True) + return profile, True return candidates[0], False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/services.py` around lines 383 - 391, The current logic in the profile-disambiguation block returns an arbitrary existing candidate when email_val is provided but no candidate has that email; instead, if email_val is present and exactly one candidate exists reuse that candidate and attach the new email (use the profile object referenced as p/candidates[0] and call the model method used elsewhere to add/save emails), but if multiple candidates exist and none match email_val, do not collapse them—create and return a new profile populated with the discovered email; update the branch around the candidates/email_val check (the code that inspects candidates, email_val, and uses p.emails.filter(...).exists()) to implement: 1) if len(candidates)==1 and email_val add and persist the email then return that profile, 2) if email_val and no candidate matches and len(candidates)>1 create a new profile with the email and return it, otherwise preserve the existing fallback behavior.wg21_paper_tracker/pipeline.py (1)
205-208:⚠️ Potential issue | 🟠 MajorRestore the intended source-first format ordering.
With
min(...), this mapping now makes PDF win again. That changes which artifact is persisted for any paper that has multiple formats and looks like a regression to the previously discussed policy.♻️ Proposed fix
def format_priority(ext: str) -> int: - # Prefer PDF (Cloud Run converts PDFs); then html, adoc, ps - priorities = {"pdf": 1, "html": 2, "adoc": 3, "ps": 4} + # Prefer source/markup formats first: adoc, html, ps, then pdf + priorities = {"adoc": 1, "html": 2, "ps": 3, "pdf": 4} return priorities.get(ext.lower(), 100)Based on learnings, in
wg21_paper_tracker/pipeline.py,format_priority()is intentionally source-first:adoc>html>ps>Also applies to: 223-224
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/pipeline.py` around lines 205 - 208, format_priority() currently ranks PDF highest (pdf:1) which, combined with the min(...) selection, makes PDFs win; restore the intended source-first ordering by changing the priorities mapping in format_priority (and the duplicate mapping at the other occurrence) to rank source formats first: set "adoc" highest, then "html", then "ps", then "pdf" (e.g. adoc:1, html:2, ps:3, pdf:4 with the same default), so the existing selection logic continues to pick source formats before PDF.
🧹 Nitpick comments (4)
docs/service_api/cppa_user_tracker.md (1)
46-48: Document what “the first” means.The implementation is deterministic here: it orders matches by
idand returns the lowest-id profile when email does not disambiguate. Calling that out would make the tie-breaker contract clearer for API consumers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/service_api/cppa_user_tracker.md` around lines 46 - 48, The docs for get_or_create_wg21_paper_author_profile lack a precise definition of "the first" when multiple profiles match; update the description to state the deterministic tie-breaker: when multiple profiles match and email does not disambiguate, the function orders matching WG21PaperAuthorProfile rows by id ascending and returns the lowest-id profile (i.e., the minimum id), so callers can rely on that deterministic behavior. Ensure the wording references get_or_create_wg21_paper_author_profile and WG21PaperAuthorProfile so the tie-breaker is explicit.workflow/management/commands/run_all_collectors.py (1)
23-23: Consider gating the WG21 collector in the aggregate runner.
run_wg21_paper_trackerimmediately runs the WG21 fetch pipeline, so adding it unconditionally here makes everyrun_all_collectorsrun depend on WG21 availability, and on GCP as well when Cloud Run triggering is enabled. A setting or--include-wg21flag would keep unrelated batch runs from failing in environments that are not ready for this collector.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workflow/management/commands/run_all_collectors.py` at line 23, The aggregate runner currently always includes "run_wg21_paper_tracker", making runs depend on WG21; change this by gating its inclusion behind a configuration or CLI flag: add a Django setting (e.g., RUN_WG21_COLLECTOR default False) and/or a management command option (e.g., add_argument("--include-wg21", action="store_true")) in the Command class for run_all_collectors.py, then only append or call run_wg21_paper_tracker when the flag or setting is True; also update the command help text to document the new flag and ensure default behavior excludes the WG21 collector so unrelated runs don’t fail.cppa_user_tracker/tests/test_services.py (1)
578-648: Add one normalization regression for trimmed inputs.The implementation strips both
display_nameand🧪 Possible test addition
+@pytest.mark.django_db +def test_get_or_create_wg21_paper_author_profile_strips_inputs(): + profile, created = services.get_or_create_wg21_paper_author_profile( + display_name=" Trimmed Author ", + email=" author@example.com ", + ) + assert created is True + assert profile.display_name == "Trimmed Author" + assert profile.emails.filter(email="author@example.com").exists()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/tests/test_services.py` around lines 578 - 648, Add a test that ensures get_or_create_wg21_paper_author_profile normalizes trimmed inputs: create a WG21PaperAuthorProfile (e.g., first) and/or add an email via services.add_email(second, "match@example.com"), then call services.get_or_create_wg21_paper_author_profile with display_name and/or email containing leading/trailing whitespace (e.g., " Same Name ", " match@example.com ") and assert created is False and the returned profile.id matches the existing profile (and that profile.emails.filter(email="match@example.com").exists() when testing email matching); this covers the trimming/normalization path for get_or_create_wg21_paper_author_profile and prevents duplicate profiles.wg21_paper_tracker/tests/test_services.py (1)
221-253: Assert that the non-matching year stays untouched.These tests prove the target
(paper_id, year)flips to downloaded, but they would still pass ifmark_paper_downloaded()accidentally updated every year variant for the samepaper_id. Seed a secondWG21Paperwith the same ID and a different year, then assert it remainsFalseafter the call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_services.py` around lines 221 - 253, Add a test case that ensures mark_paper_downloaded only updates the specific (paper_id, year) pair by creating a second WG21Paper via get_or_create_paper with the same paper_id ("p1000r0") but a different year (e.g., 2024), call mark_paper_downloaded("p1000r0", year=2025), refresh both paper instances with refresh_from_db, and assert the paper for 2025 has is_downloaded True while the paper for the other year remains False; reference the existing test functions and the mark_paper_downloaded and get_or_create_paper helpers to locate where to add this assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py`:
- Around line 13-18: The migration currently adds
WG21PaperAuthorProfile.author_alias with default="" and preserve_default=True
which leaves a permanent default in the migration state; change the AddField
call for model_name="wg21paperauthorprofile" and name="author_alias" to use
preserve_default=False so the "" default is only used as a one-time backfill for
existing rows and does not persist as the model's default.
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The list_blobs prefix is wrong: raw_prefix is set to
"raw/wg21_paper_tracker/" but the uploader writes to
"raw/wg21_papers/{mailing_date}/{filename}"; update raw_prefix to
"raw/wg21_papers/" so client.list_blobs(bucket, prefix=raw_prefix) will discover
newly uploaded PDFs (leave converted_prefix unchanged), then re-run the job to
verify the converter sees files.
- Around line 69-72: The code currently logs and returns when the
WG21_GCS_BUCKET env var is missing (bucket_name), causing the Cloud Run job to
exit successfully; change this to terminate with a non-zero exit so the
scheduler detects failure: after checking os.getenv("WG21_GCS_BUCKET") and
logging via logger.error, call sys.exit(1) or raise a RuntimeError with a clear
message (e.g., include "WG21_GCS_BUCKET not set") instead of returning; update
the block that references bucket_name in main (or the module entrypoint) to
perform this non-zero exit so the job fails fast and surfaces the
misconfiguration.
In `@wg21_paper_tracker/fetcher.py`:
- Around line 95-106: The scraper currently accepts absolute hrefs directly and
sets paper_url (in the loop over first_cell.find_all("a", href=True)) which
allows off-origin links; change this to always canonicalize the href using
urllib.parse.urljoin(base=url or BASE_URL, href) to produce a resolved_url, then
parse it with urllib.parse.urlparse and only accept it if parsed.scheme is
"http" or "https" and parsed.netloc equals "www.open-std.org" before assigning
to paper_url or appending to the download queue; otherwise skip the link. Ensure
you reference the existing variables paper_pattern, href, paper_url, BASE_URL,
and url (the page base) when implementing the check.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 120-122: Replace the silent failure in
import_wg21_metadata_from_csv.py where the code currently checks
csv_path.exists() and calls logger.error + return: raise
django.core.management.base.CommandError with a clear message (e.g. "CSV file
not found: <path>") instead of returning; update the command entry-point (the
class derived from BaseCommand / its handle() method that references csv_path)
to propagate the exception. Also add a unit test that calls
django.core.management.call_command('import_wg21_metadata_from_csv', csv_path)
and asserts that CommandError is raised when the file is missing. Use the
csv_path variable and the command class/handle() name to locate the change.
In `@wg21_paper_tracker/workspace.py`:
- Around line 24-25: The helper currently forces raw storage under
get_workspace_path(_RAW_APP_SLUG) which ignores the overrideable
settings.RAW_DIR; change the computation of raw_root in workspace.py so it
prefers settings.RAW_DIR (e.g., if settings.RAW_DIR is set/non-empty use
Path(settings.RAW_DIR) as raw_root) and only falls back to
get_workspace_path(_RAW_APP_SLUG) otherwise, then compute path = raw_root /
mailing_date; reference symbols: settings.RAW_DIR, get_workspace_path,
_RAW_APP_SLUG, raw_root, path.
---
Duplicate comments:
In `@cppa_user_tracker/services.py`:
- Around line 383-391: The current logic in the profile-disambiguation block
returns an arbitrary existing candidate when email_val is provided but no
candidate has that email; instead, if email_val is present and exactly one
candidate exists reuse that candidate and attach the new email (use the profile
object referenced as p/candidates[0] and call the model method used elsewhere to
add/save emails), but if multiple candidates exist and none match email_val, do
not collapse them—create and return a new profile populated with the discovered
email; update the branch around the candidates/email_val check (the code that
inspects candidates, email_val, and uses p.emails.filter(...).exists()) to
implement: 1) if len(candidates)==1 and email_val add and persist the email then
return that profile, 2) if email_val and no candidate matches and
len(candidates)>1 create a new profile with the email and return it, otherwise
preserve the existing fallback behavior.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 205-208: format_priority() currently ranks PDF highest (pdf:1)
which, combined with the min(...) selection, makes PDFs win; restore the
intended source-first ordering by changing the priorities mapping in
format_priority (and the duplicate mapping at the other occurrence) to rank
source formats first: set "adoc" highest, then "html", then "ps", then "pdf"
(e.g. adoc:1, html:2, ps:3, pdf:4 with the same default), so the existing
selection logic continues to pick source formats before PDF.
---
Nitpick comments:
In `@cppa_user_tracker/tests/test_services.py`:
- Around line 578-648: Add a test that ensures
get_or_create_wg21_paper_author_profile normalizes trimmed inputs: create a
WG21PaperAuthorProfile (e.g., first) and/or add an email via
services.add_email(second, "match@example.com"), then call
services.get_or_create_wg21_paper_author_profile with display_name and/or email
containing leading/trailing whitespace (e.g., " Same Name ", "
match@example.com ") and assert created is False and the returned profile.id
matches the existing profile (and that
profile.emails.filter(email="match@example.com").exists() when testing email
matching); this covers the trimming/normalization path for
get_or_create_wg21_paper_author_profile and prevents duplicate profiles.
In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The docs for get_or_create_wg21_paper_author_profile lack a
precise definition of "the first" when multiple profiles match; update the
description to state the deterministic tie-breaker: when multiple profiles match
and email does not disambiguate, the function orders matching
WG21PaperAuthorProfile rows by id ascending and returns the lowest-id profile
(i.e., the minimum id), so callers can rely on that deterministic behavior.
Ensure the wording references get_or_create_wg21_paper_author_profile and
WG21PaperAuthorProfile so the tie-breaker is explicit.
In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 221-253: Add a test case that ensures mark_paper_downloaded only
updates the specific (paper_id, year) pair by creating a second WG21Paper via
get_or_create_paper with the same paper_id ("p1000r0") but a different year
(e.g., 2024), call mark_paper_downloaded("p1000r0", year=2025), refresh both
paper instances with refresh_from_db, and assert the paper for 2025 has
is_downloaded True while the paper for the other year remains False; reference
the existing test functions and the mark_paper_downloaded and
get_or_create_paper helpers to locate where to add this assertion.
In `@workflow/management/commands/run_all_collectors.py`:
- Line 23: The aggregate runner currently always includes
"run_wg21_paper_tracker", making runs depend on WG21; change this by gating its
inclusion behind a configuration or CLI flag: add a Django setting (e.g.,
RUN_WG21_COLLECTOR default False) and/or a management command option (e.g.,
add_argument("--include-wg21", action="store_true")) in the Command class for
run_all_collectors.py, then only append or call run_wg21_paper_tracker when the
flag or setting is True; also update the command help text to document the new
flag and ensure default behavior excludes the WG21 collector so unrelated runs
don’t fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d5221e33-cecc-4089-99f2-28b0f07fc98c
📒 Files selected for processing (38)
config/settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydocs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
… use raw/wg21_paper_tracker/YYYY/<mailing_date>/ cppalliance#24
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (1)
wg21_paper_tracker/cloud_run_job/main.py (1)
77-80:⚠️ Potential issue | 🔴 CriticalFix the raw GCS prefix before this ships.
The uploader stores PDFs under
raw/wg21_papers/..., but this job is listingraw/wg21_paper_tracker/. With the current prefix, the converter never sees newly uploaded papers, so the end-to-end WG21 flow is effectively broken.🐛 Minimal fix
- raw_prefix = "raw/wg21_paper_tracker/" + raw_prefix = "raw/wg21_papers/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The GCS listing prefix is incorrect: update the raw_prefix variable currently set to "raw/wg21_paper_tracker/" to the actual uploader path "raw/wg21_papers/" so that blobs = client.list_blobs(bucket, prefix=raw_prefix) will find newly uploaded PDFs; leave converted_prefix as-is and verify the uploader writes to "raw/wg21_papers/" if needed.
🧹 Nitpick comments (6)
wg21_paper_tracker/tests/test_workspace.py (1)
43-86: Add a regression test for themailing_date=Nonepath.
get_raw_dir()acceptsstr | None, but the suite only covers valid strings and invalid inputs. A positive case forget_raw_dir(None, 2025)would lock in the year-only behavior and catch the currentwg21_paper_tracker/workspace.pyLine 24fullmatch(None)crash.🧪 Suggested test
+def test_get_raw_dir_returns_year_dir_when_mailing_date_is_none(mock_workspace_path): + with patch("wg21_paper_tracker.workspace.settings") as mock_settings: + mock_settings.RAW_DIR = mock_workspace_path + path = get_raw_dir(None, 2025) + + expected = mock_workspace_path / "wg21_paper_tracker" / "2025" + assert path == expected + assert path.is_dir() +🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_workspace.py` around lines 43 - 86, Add a regression test that ensures get_raw_dir accepts mailing_date=None and returns the year-only directory instead of crashing: create a new test (e.g., test_get_raw_dir_accepts_none_mailing_date) that patches wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, calls get_raw_dir(None, 2025), and asserts the returned Path == mock_workspace_path / "wg21_paper_tracker" / "2025" (and that it exists/is_dir and path.name == "2025"); also update get_raw_dir in wg21_paper_tracker.workspace to guard the re.fullmatch check (avoid calling fullmatch on None) by treating None as the "no mailing_date" case that yields the year-only directory.wg21_paper_tracker/tests/test_models.py (1)
35-46: Assert the retained prefix, not just the length.This only proves the rendered title is at most 60 characters.
WG21Paper.__str__()could return the wrong slice—or even an empty title—and this test would still pass.🧪 Tighten the assertion
- assert len(str(p).split(": ", 1)[-1]) <= 60 + rendered = str(p) + assert rendered.startswith("p1: ") + assert rendered.split(": ", 1)[-1] == long_title[:60]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_models.py` around lines 35 - 46, Update the test to assert the actual retained prefix of the title, not just its length: after computing rendered_title = str(p).split(": ", 1)[-1], assert that rendered_title == long_title[:60] (or at minimum startswith long_title[:60]) so WG21Paper.__str__ is returning the correct leading slice of the original title; reference the test function test_wg21_paper_str_truncates_long_title and the WG21Paper.__str__ output extraction (str(p).split(": ", 1)[-1]) when making the change.wg21_paper_tracker/tests/test_fetcher.py (1)
91-206: Add a regression test for off-origin paper links.
fetch_papers_for_mailing()now skips resolved URLs outsidewww.open-std.org, but none of these tests hit that branch. A small fixture here would keep the ingest/SSRF guard from disappearing in a future refactor.🛡️ Suggested test
+def test_fetch_papers_for_mailing_skips_off_origin_links(): + """fetch_papers_for_mailing skips papers hosted outside open-std.org.""" + html = """ + <html><body> + <span id="mailing2025-01"></span> + <table> + <tr><td><a href="https://example.com/p1000r0.pdf">p1000r0.pdf</a></td></tr> + </table> + </body></html> + """ + with patch("wg21_paper_tracker.fetcher.requests.get") as m: + resp = MagicMock() + resp.text = html + resp.raise_for_status = MagicMock() + m.return_value = resp + result = fetch_papers_for_mailing("2025", "2025-01") + assert result == []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_fetcher.py` around lines 91 - 206, Add a regression test that verifies fetch_papers_for_mailing skips papers whose resolved URLs are off-origin (not under www.open-std.org). Create a test like the others that patches wg21_paper_tracker.fetcher.requests.get to return HTML with the mailing anchor (id or name) and a table row containing an absolute href pointing to another host (e.g., "http://evil.com/p1234.pdf"); call fetch_papers_for_mailing("2025","2025-01") and assert the result does not include that paper (e.g., result == []). Reference fetch_papers_for_mailing and the requests.get patch used in existing tests to locate where to add this new test.wg21_paper_tracker/admin.py (1)
19-32: Uselist_select_relatedfor the related objects shown in the changelist.Both admins render foreign keys in
list_display, so the changelist will do extra queries formailing,paper, andprofileas row counts grow.⚡ Suggested tweak
`@admin.register`(WG21Paper) class WG21PaperAdmin(admin.ModelAdmin): + list_select_related = ("mailing",) list_display = ( "paper_id", "year", "title", "document_date", @@ `@admin.register`(WG21PaperAuthor) class WG21PaperAuthorAdmin(admin.ModelAdmin): + list_select_related = ("paper", "profile") list_display = ("paper", "profile", "created_at") search_fields = ("paper__paper_id", "profile__display_name") raw_id_fields = ("paper", "profile")Also applies to: 35-39
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/admin.py` around lines 19 - 32, Add list_select_related to the admin classes to prefetch related FK objects shown in list_display: in WG21PaperAdmin add list_select_related = ("mailing", "subgroup") (or include any other FK fields like "year" if it's a related model) to avoid per-row queries for mailing/subgroup, and in WG21PaperAuthorAdmin (the admin responsible for author rows) add list_select_related = ("paper", "profile") so the changelist uses joins instead of extra queries for paper/profile; place these attributes on the respective ModelAdmin classes (WG21PaperAdmin and WG21PaperAuthorAdmin).wg21_paper_tracker/tests/test_services.py (2)
157-170: Assert the placeholder row is promoted, not just stored.The docstring says this
year=Nonerow can be updated later, but the test only checks the intermediate0sentinel. Adding a secondget_or_create_paper(...)call with the real year would protect theyear=0 -> real yearpromotion path inwg21_paper_tracker/services.pyfrom regressing silently.🧪 Suggested test strengthening
`@pytest.mark.django_db` -def test_get_or_create_paper_year_none_stored_as_zero(db): +def test_get_or_create_paper_year_none_stored_as_zero(): """get_or_create_paper with year=None stores 0 so records can be updated later.""" mailing, _ = get_or_create_mailing("2025-01", "Title") paper, _ = get_or_create_paper( paper_id="n5034", url="https://example.com/n5034.html", @@ mailing=mailing, year=None, ) assert paper.year == 0 + + promoted, created = get_or_create_paper( + paper_id="n5034", + url="https://example.com/n5034.html", + title="Draft", + document_date=None, + mailing=mailing, + year=2025, + ) + assert created is False + assert promoted.pk == paper.pk + assert promoted.year == 2025🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_services.py` around lines 157 - 170, The test test_get_or_create_paper_year_none_stored_as_zero only asserts that year=None becomes the sentinel 0 but doesn't exercise the promotion path; update the test to call get_or_create_paper a second time with the same paper_id/url but with document_date or year set to the real year (e.g., year=2025), verify it returns the same Paper instance (identity or id) and assert paper.year is updated from 0 to the real year, ensuring the promotion logic in get_or_create_paper in wg21_paper_tracker/services.py is exercised.
51-51: Remove the unuseddbfixture parameters.These signatures never reference
db, so Ruff will keep reportingARG001on them.Also applies to: 77-77, 112-112, 127-127, 173-173, 199-199, 222-222, 240-240
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_services.py` at line 51, The listed test functions include an unused db fixture parameter (e.g., test_get_or_create_paper_creates_new) which triggers ARG001; remove the unused "db" parameter from each test signature so the tests only accept fixtures they actually use, and run/verify the tests to ensure no other references to "db" remain in those functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cppa_user_tracker/services.py`:
- Around line 383-391: When reusing an existing WG21PaperAuthorProfile from
candidates, ensure a newly provided email_val is backfilled into that profile:
in the single-candidate branch (when len(candidates) == 1) and in the final
tie-break return (returning candidates[0]), check if email_val is truthy and the
chosen profile does not already have that email (use
p.emails.filter(email=email_val).exists()); if missing, create the email record
(e.g., p.emails.create(email=email_val)) or otherwise add and save the profile
before returning. This ensures name-only profiles get updated when richer
metadata arrives.
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 234-245: The page-level fallback messages in openai_converter.py
currently use phrases like "Conversion failed" and "Error processing" which
trigger is_content_valid() rejection; update both places that append to
markdown_parts (the else branch that logs "Failed to convert page {page_num}
with OpenAI" and the except block that logs "Error processing page {page_num}")
to use a neutral placeholder such as "## Page {page_num}\n\n*[Content
unavailable for this page]*\n\n" (or similar wording without
"error"/"failed"/"processing") so that page 1 falling into these branches won’t
mark the assembled Markdown as a converter failure. Ensure you change both
markdown_parts.append calls shown in the diff.
In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py`:
- Around line 62-68: The table cell rendering currently treats any falsy value
as empty in pdfplumber_converter.py (the generator inside the
markdown_parts.append that builds table rows), which removes 0 and False; change
the conditional to only treat None as empty (use "if cell is not None" instead
of "if cell") so that values like 0 and False are preserved by converting them
with str(cell).
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 203-223: The code increments stats["papers_updated"] at the start
of the IntegrityError handler before attempting the lookup/update, which
overstates updates when the recovery fails or the row is skipped; move the
stats["papers_updated"] increment so it runs only after a successful recovery
save (i.e., after paper.save() in the except IntegrityError block where you find
paper via WG21Paper.objects.filter(...).first()), and ensure stats["skipped"] is
incremented when paper is None or any subsequent failure prevents the save.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 197-223: The comments around the papers_by_id grouping and
selection are misleading: format_priority() returns lower numbers for adoc/html
and the min(p_list, key=lambda x: format_priority(x["type"])) picks adoc/html
before pdf, so update the surrounding comments (references: papers_by_id,
format_priority, best_paper, min(...)) to state that adoc > html > ps > pdf is
preferred and that we intentionally pick adoc/html first (not “prioritize PDF”
or “PDF first for conversion”).
- Around line 127-145: The current logic only re-checks latest_mailing, which
lets earlier mailings with transient failures never get retried; instead,
identify and requeue any mailing that appears incomplete by checking
fetch_papers_for_mailing() results for each candidate mailing and append those
with empty/failed results to new_mailings. Concretely, change the filter logic
around new_mailings/latest_mailing: for each mailing in all_mailings with
mailing_date > latest_date (and also include any mailing equal to
latest_mailing), call fetch_papers_for_mailing(m) and if it returns an empty
list (the defined failure mode) treat that mailing as incomplete and ensure it
stays in new_mailings so it will be retried; keep using the existing symbols
new_mailings, latest_mailing, all_mailings, and fetch_papers_for_mailing to
locate and implement the change.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 50-63: Update the test to use pytest's tmp_path fixture instead of
a hard-coded POSIX path: change the test signature to accept tmp_path (def
test_download_file_uses_timeout(tmp_path):), replace filepath = Path("/tmp/out")
with filepath = tmp_path / "out", and remove the now-unused Path import; keep
the rest of the assertions and the patched call to
wg21_paper_tracker.pipeline.requests.get and the reference to DOWNLOAD_TIMEOUT
unchanged.
In `@wg21_paper_tracker/workspace.py`:
- Around line 22-25: get_raw_dir currently calls
_MAILING_DATE_RE.fullmatch(mailing_date) even when mailing_date is None, causing
a TypeError; update the validation to first check for None before applying the
regex (e.g., only call _MAILING_DATE_RE.fullmatch when mailing_date is not None)
so that the existing year-only fallback logic in get_raw_dir still executes when
mailing_date is None.
---
Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The GCS listing prefix is incorrect: update the raw_prefix
variable currently set to "raw/wg21_paper_tracker/" to the actual uploader path
"raw/wg21_papers/" so that blobs = client.list_blobs(bucket, prefix=raw_prefix)
will find newly uploaded PDFs; leave converted_prefix as-is and verify the
uploader writes to "raw/wg21_papers/" if needed.
---
Nitpick comments:
In `@wg21_paper_tracker/admin.py`:
- Around line 19-32: Add list_select_related to the admin classes to prefetch
related FK objects shown in list_display: in WG21PaperAdmin add
list_select_related = ("mailing", "subgroup") (or include any other FK fields
like "year" if it's a related model) to avoid per-row queries for
mailing/subgroup, and in WG21PaperAuthorAdmin (the admin responsible for author
rows) add list_select_related = ("paper", "profile") so the changelist uses
joins instead of extra queries for paper/profile; place these attributes on the
respective ModelAdmin classes (WG21PaperAdmin and WG21PaperAuthorAdmin).
In `@wg21_paper_tracker/tests/test_fetcher.py`:
- Around line 91-206: Add a regression test that verifies
fetch_papers_for_mailing skips papers whose resolved URLs are off-origin (not
under www.open-std.org). Create a test like the others that patches
wg21_paper_tracker.fetcher.requests.get to return HTML with the mailing anchor
(id or name) and a table row containing an absolute href pointing to another
host (e.g., "http://evil.com/p1234.pdf"); call
fetch_papers_for_mailing("2025","2025-01") and assert the result does not
include that paper (e.g., result == []). Reference fetch_papers_for_mailing and
the requests.get patch used in existing tests to locate where to add this new
test.
In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 35-46: Update the test to assert the actual retained prefix of the
title, not just its length: after computing rendered_title = str(p).split(": ",
1)[-1], assert that rendered_title == long_title[:60] (or at minimum startswith
long_title[:60]) so WG21Paper.__str__ is returning the correct leading slice of
the original title; reference the test function
test_wg21_paper_str_truncates_long_title and the WG21Paper.__str__ output
extraction (str(p).split(": ", 1)[-1]) when making the change.
In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 157-170: The test
test_get_or_create_paper_year_none_stored_as_zero only asserts that year=None
becomes the sentinel 0 but doesn't exercise the promotion path; update the test
to call get_or_create_paper a second time with the same paper_id/url but with
document_date or year set to the real year (e.g., year=2025), verify it returns
the same Paper instance (identity or id) and assert paper.year is updated from 0
to the real year, ensuring the promotion logic in get_or_create_paper in
wg21_paper_tracker/services.py is exercised.
- Line 51: The listed test functions include an unused db fixture parameter
(e.g., test_get_or_create_paper_creates_new) which triggers ARG001; remove the
unused "db" parameter from each test signature so the tests only accept fixtures
they actually use, and run/verify the tests to ensure no other references to
"db" remain in those functions.
In `@wg21_paper_tracker/tests/test_workspace.py`:
- Around line 43-86: Add a regression test that ensures get_raw_dir accepts
mailing_date=None and returns the year-only directory instead of crashing:
create a new test (e.g., test_get_raw_dir_accepts_none_mailing_date) that
patches wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, calls
get_raw_dir(None, 2025), and asserts the returned Path == mock_workspace_path /
"wg21_paper_tracker" / "2025" (and that it exists/is_dir and path.name ==
"2025"); also update get_raw_dir in wg21_paper_tracker.workspace to guard the
re.fullmatch check (avoid calling fullmatch on None) by treating None as the "no
mailing_date" case that yields the year-only directory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 63584f18-f1ea-46a1-95ea-5e18b05dafb3
📒 Files selected for processing (40)
config/settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
wg21_paper_tracker/tests/test_services.py (1)
157-169:⚠️ Potential issue | 🟠 MajorSentinel value
0for unknown year may cause collision issues.The test asserts that
year=Noneis stored as0. Since the composite unique key is(paper_id, year), two imports of the samepaper_idwithout a resolved year will target the same row (both mapping to year=0), preventing them from being distinguishable.Consider whether this is the intended behavior. If unknown years should remain distinct, either:
- Keep
yearnullable in the model- Reject missing years at the service boundary with a validation error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_services.py` around lines 157 - 169, The test and service currently map year=None to 0 causing collisions due to the composite unique key (paper_id, year); update the behavior so unknown years remain distinct by removing the sentinel mapping: change get_or_create_paper to stop coercing document_date/year None to 0 (leave year as None) and ensure the Paper model's year field allows null (nullable) or adjust the unique constraint to exclude NULLs as appropriate, and then update tests (e.g., test_get_or_create_paper_year_none_stored_as_zero) to assert year is None (or to expect a validation error if you choose to reject missing years at the service boundary) so imports of the same paper_id without resolved years do not collide.cppa_user_tracker/services.py (1)
358-381:⚠️ Potential issue | 🟠 MajorRace condition on profile creation with non-unique
display_name.
WG21PaperAuthorProfile.display_nameis indexed but not unique. The check at line 371-375 (finding candidates) and creation at line 378 are not atomic, so concurrent calls with the samedisplay_namecan create duplicate profiles.Consider wrapping the lookup-and-create logic in
select_for_update()or usingget_or_createwith a proper unique constraint ondisplay_name(or anauthor_aliasfield if disambiguation is needed).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/services.py` around lines 358 - 381, The current get_or_create_wg21_paper_author_profile function can race when multiple requests create a WG21PaperAuthorProfile with the same non-unique display_name; make the lookup-and-create atomic by wrapping the operation in a transaction (transaction.atomic) and locking the relevant rows (e.g., use WG21PaperAuthorProfile.objects.select_for_update().filter(display_name=display_name_val) inside the transaction) or use WG21PaperAuthorProfile.objects.get_or_create with a unique constraint on the canonical key (or an author_alias) and handle IntegrityError with a retry to avoid duplicates; ensure add_email(profile, email_val, is_primary=True) still runs only for the created profile and return the existing profile when found.
🧹 Nitpick comments (8)
wg21_paper_tracker/tests/test_models.py (1)
35-46: Tighten the truncation assertion.Line 46 only checks the rendered title is short enough, so this test would still pass if
WG21Paper.__str__stopped including the title entirely. Assert the expected truncated content as well so the test locks down the actual contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_models.py` around lines 35 - 46, The test test_wg21_paper_str_truncates_long_title currently only checks length; update it to assert the actual truncated content from WG21Paper.__str__ instead of just length: compute rendered = str(p).split(": ", 1)[-1] and assert that rendered startswith the first 60 characters of long_title (long_title[:60]) so the title is present and correctly truncated; keep the existing setup (WG21Mailing m, WG21Paper p) and replace the len(...) assertion with this content check.wg21_paper_tracker/cloud_run_job/Dockerfile (1)
22-23: Add a.dockerignorenext to this Dockerfile to enforce proper build context.The documented build process correctly limits build context by running
docker buildfrom withinwg21_paper_tracker/cloud_run_job/(perdocs/operations/WG21_Cloud_Run.md), so the current risk is low. However, adding a.dockerignorewould defensively prevent accidental inclusion of unrelated repo content if the build is ever invoked from the repo root without proper context handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/Dockerfile` around lines 22 - 23, Add a .dockerignore next to the Dockerfile to prevent accidental inclusion of the whole repo when the Docker build context is wrong: create a .dockerignore in the same directory as the Dockerfile and exclude large/unrelated items (e.g., .git, node_modules, .venv, .env, build/ dist/ caches, IDE files) and any parent-folder paths, and then explicitly allow the files your Docker build needs (so the COPY --chown=app:app . . step still copies only the app sources, Dockerfile, requirements.txt/pyproject.toml or package.json and other build inputs referenced by your Dockerfile). Ensure the patterns cover common heavy folders and that required build artifacts referenced by the Dockerfile are whitelisted.wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)
95-97: Minor: Redundant exception object inlogging.exception.The exception object
eis automatically included bylogging.exception. You can simplify this.♻️ Optional cleanup
- except Exception as e: - logger.exception("WG21 Paper Tracker failed: %s", e) + except Exception: + logger.exception("WG21 Paper Tracker failed.") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around lines 95 - 97, In the except block that currently reads "except Exception as e:" remove the redundant exception object from the logging call by replacing logger.exception("WG21 Paper Tracker failed: %s", e) with logger.exception("WG21 Paper Tracker failed") so the exception is still logged (logger.exception automatically includes the traceback) and keep the subsequent raise unchanged; locate this change around the except block where the variable e is defined and logger.exception is called.wg21_paper_tracker/tests/test_pipeline.py (1)
124-165: Consider usingpytest.ExitStackorcontextlib.ExitStackto reduce nesting.The deeply nested
with patch(...)blocks (6 levels) make this test harder to read. Consider usingExitStackor multiple@patchdecorators.♻️ Example using decorators
`@pytest.mark.django_db` `@patch`("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True) `@patch`("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket") `@patch`("wg21_paper_tracker.pipeline._download_file", return_value=True) `@patch`("wg21_paper_tracker.pipeline.fetch_papers_for_mailing") `@patch`("wg21_paper_tracker.pipeline.fetch_all_mailings") `@patch`("wg21_paper_tracker.pipeline.get_raw_dir") def test_run_tracker_pipeline_downloads_new_papers( mock_get_raw_dir, mock_fetch_mailings, mock_fetch_papers, mock_download, mock_bucket, mock_upload, tmp_path ): # ... setup mocks and assertions🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 124 - 165, The test test_run_tracker_pipeline_downloads_new_papers has six nested with patch(...) blocks which reduce readability; refactor it to use contextlib.ExitStack or pytest `@patch` decorators to flatten the nesting. Locate the test function and replace the nested patches for fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET and _upload_to_gcs with either a single ExitStack that enters each patcher or with stacked `@patch` decorators (keeping the same return_value or replacement values) and adjust the test signature to accept the mock objects in the correct order.wg21_paper_tracker/cloud_run_job/converters/docling_converter.py (1)
59-63: Consider using%sformatting or!sconversion for exception in f-string.Per Ruff RUF010,
str(e)in f-strings can be replaced with{e!s}for explicit string conversion.♻️ Proposed fix
except Exception as e: logger.error( - f"Docling conversion failed for {pdf_path.name}: {str(e)}", exc_info=True + f"Docling conversion failed for {pdf_path.name}: {e!s}", exc_info=True ) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py` around lines 59 - 63, The exception log in the except block of docling_converter.py uses str(e) inside an f-string; update the logger.error call to use explicit string conversion {e!s} (or change to %-formatting with %s) instead of str(e) so it complies with Ruff RUF010 — locate the except block where logger.error(f"Docling conversion failed for {pdf_path.name}: {str(e)}", exc_info=True) and replace the str(e) usage with {e!s} while keeping pdf_path.name and exc_info=True unchanged.wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (1)
163-170: Uselogger.exceptionto preserve stack traces in error logs.When catching exceptions in data processing, using
logger.exceptioninstead oflogger.errorautomatically includes the stack trace, which aids debugging.♻️ Proposed fix
except Exception as e: stats["skipped"] += 1 - logger.error( + logger.exception( "Error parsing document date for paper_id=%s: %s", paper_id, e, ) continueApply similar changes to lines 212, 235-239, and 242.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 163 - 170, Replace the logger.error calls inside the except Exception as e handlers in import_wg21_metadata_from_csv.py with logger.exception so the stack trace is preserved; specifically update the except block that logs "Error parsing document date for paper_id=%s" as well as the other similar except blocks referenced (around the blocks currently at lines 212, 235-239, and 242) to call logger.exception(...) while keeping the existing stats["skipped"] increment and continue behavior. Ensure the message text remains descriptive and pass the same context (e.g., paper_id) to the logger.exception call.wg21_paper_tracker/services.py (2)
152-158: Consider requiringyearinmark_paper_downloaded().Filtering by
(paper_id, year)is the right fix, but the defaultyear=Nonenow normalizes to0. A caller that forgets to pass the year silently targets only placeholder rows, which is easy to miss because the helper name still reads like a complete action. Makingyearrequired would surface that mistake at the call site.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/services.py` around lines 152 - 158, The function mark_paper_downloaded currently defaults year to None which _normalize_year converts to 0, allowing callers to silently hit placeholder rows; change the signature of mark_paper_downloaded to require year (remove the = None default) and validate/raise a clear exception if year is missing before calling _normalize_year; update any callers to pass an explicit year and keep the WG21Paper.objects.filter(paper_id=..., year=year_val).update(...) logic unchanged so we no longer risk silently updating placeholder rows.
40-50: Normalizedocument_dateat the service boundary.
paper.document_datecomes back from Django as adate, but this helper still typesdocument_dateasOptional[str]and passes it through untouched even thoughrun_tracker_pipeline()already supplies a parseddate | None. That makes the contract ambiguous, and any caller that sends"YYYY-MM-DD"will keep the row dirty even when the stored value is the same logical date. Normalize once inside the service and use that value everywhere in the update/default paths.Also applies to: 62-64, 87-89, 101-103, 113-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/services.py` around lines 40 - 50, The service should normalize the incoming document_date to a native date once at the boundary (in get_or_create_paper) and use that normalized value for all comparisons, defaults and updates; change the parameter handling in get_or_create_paper to parse/convert Optional[str] or date inputs into a single date | None variable and then use that variable in calls that set or compare paper.document_date (including create_or_update logic), and apply the same normalization pattern to the other helper functions that accept document_date (the other occurrences noted in the review) so callers supplying "YYYY-MM-DD" strings won't keep rows dirty and the contract is unambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/operations/WG21_Cloud_Run.md`:
- Line 5: Update the wording to clarify that Cloud Run invocation is conditional
rather than automatic: change the sentence referencing run_wg21_paper_tracker
and wg21-convert to state that the Django tracker can be configured to trigger
the wg21-convert Cloud Run job, but this is optional (e.g., GCS-only deployments
or environments that use a different job name will not trigger it), and mirror
this conditional language in the intro and the flow summary so operators
understand the trigger is configurable rather than mandatory.
In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The docs for get_or_create_wg21_paper_author_profile
currently imply lookup-only behavior but the function also mutates returned
profiles by attaching a provided email when the profile is missing one; update
the table description for get_or_create_wg21_paper_author_profile to explicitly
state that if an email argument is supplied and the resolved/created
WG21PaperAuthorProfile lacks an email, the function will associate that email
with the profile and return the profile plus a boolean indicating creation;
mention the side-effect so callers know it can modify existing profiles.
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 37-59: pdf_to_images currently calls convert_from_path() which
materializes all pages into PIL Images and keeps them alive across
convert_with_openai, causing memory spikes for large PDFs; change pdf_to_images
(and adjust convert_with_openai usage) to avoid loading the entire PDF by either
calling convert_from_path(..., paths_only=True, output_folder=tmp_dir) to get
per-page file paths and yield/process images one-at-a-time, or implement
batching using convert_from_path with first_page/last_page to convert and
process N pages at a time, ensuring you create a temporary output folder
(tmp_dir) and clean up intermediate files so only the current batch/page is
resident in memory; update references to pdf_to_images and convert_with_openai
to consume images/paths as a generator or in small batches instead of relying on
a full list.
In `@wg21_paper_tracker/migrations/0001_initial.py`:
- Around line 59-63: The migration currently sets the IntegerField "year" with
default=0 which permits invalid rows; change the field definition in the initial
migration (0001_initial.py) for the "year" field
(models.IntegerField(db_index=True, default=0)) to be required by removing
default=0 and ensuring null=False (e.g. models.IntegerField(db_index=True) with
no default and no null=True), so missing parsed years will raise errors instead
of inserting a bogus 0 and thereby keep the uniqueness constraint on (paper_id,
year) meaningful.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 201-205: In the for p in papers loop compute pid = (p["paper_id"]
or "").strip().lower() and if pid is empty, emit a warning (e.g.,
logging.warning or the module's logger) including enough context from p (title
or source) and continue so you never insert an empty key into papers_by_id; this
prevents creating the "" bucket that collapses malformed entries and affects the
downstream (paper_id, year) uniqueness and the downloaded check around the later
code (near the current line 216).
---
Duplicate comments:
In `@cppa_user_tracker/services.py`:
- Around line 358-381: The current get_or_create_wg21_paper_author_profile
function can race when multiple requests create a WG21PaperAuthorProfile with
the same non-unique display_name; make the lookup-and-create atomic by wrapping
the operation in a transaction (transaction.atomic) and locking the relevant
rows (e.g., use
WG21PaperAuthorProfile.objects.select_for_update().filter(display_name=display_name_val)
inside the transaction) or use WG21PaperAuthorProfile.objects.get_or_create with
a unique constraint on the canonical key (or an author_alias) and handle
IntegrityError with a retry to avoid duplicates; ensure add_email(profile,
email_val, is_primary=True) still runs only for the created profile and return
the existing profile when found.
In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 157-169: The test and service currently map year=None to 0 causing
collisions due to the composite unique key (paper_id, year); update the behavior
so unknown years remain distinct by removing the sentinel mapping: change
get_or_create_paper to stop coercing document_date/year None to 0 (leave year as
None) and ensure the Paper model's year field allows null (nullable) or adjust
the unique constraint to exclude NULLs as appropriate, and then update tests
(e.g., test_get_or_create_paper_year_none_stored_as_zero) to assert year is None
(or to expect a validation error if you choose to reject missing years at the
service boundary) so imports of the same paper_id without resolved years do not
collide.
---
Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py`:
- Around line 59-63: The exception log in the except block of
docling_converter.py uses str(e) inside an f-string; update the logger.error
call to use explicit string conversion {e!s} (or change to %-formatting with %s)
instead of str(e) so it complies with Ruff RUF010 — locate the except block
where logger.error(f"Docling conversion failed for {pdf_path.name}: {str(e)}",
exc_info=True) and replace the str(e) usage with {e!s} while keeping
pdf_path.name and exc_info=True unchanged.
In `@wg21_paper_tracker/cloud_run_job/Dockerfile`:
- Around line 22-23: Add a .dockerignore next to the Dockerfile to prevent
accidental inclusion of the whole repo when the Docker build context is wrong:
create a .dockerignore in the same directory as the Dockerfile and exclude
large/unrelated items (e.g., .git, node_modules, .venv, .env, build/ dist/
caches, IDE files) and any parent-folder paths, and then explicitly allow the
files your Docker build needs (so the COPY --chown=app:app . . step still copies
only the app sources, Dockerfile, requirements.txt/pyproject.toml or
package.json and other build inputs referenced by your Dockerfile). Ensure the
patterns cover common heavy folders and that required build artifacts referenced
by the Dockerfile are whitelisted.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 163-170: Replace the logger.error calls inside the except
Exception as e handlers in import_wg21_metadata_from_csv.py with
logger.exception so the stack trace is preserved; specifically update the except
block that logs "Error parsing document date for paper_id=%s" as well as the
other similar except blocks referenced (around the blocks currently at lines
212, 235-239, and 242) to call logger.exception(...) while keeping the existing
stats["skipped"] increment and continue behavior. Ensure the message text
remains descriptive and pass the same context (e.g., paper_id) to the
logger.exception call.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-97: In the except block that currently reads "except Exception
as e:" remove the redundant exception object from the logging call by replacing
logger.exception("WG21 Paper Tracker failed: %s", e) with logger.exception("WG21
Paper Tracker failed") so the exception is still logged (logger.exception
automatically includes the traceback) and keep the subsequent raise unchanged;
locate this change around the except block where the variable e is defined and
logger.exception is called.
In `@wg21_paper_tracker/services.py`:
- Around line 152-158: The function mark_paper_downloaded currently defaults
year to None which _normalize_year converts to 0, allowing callers to silently
hit placeholder rows; change the signature of mark_paper_downloaded to require
year (remove the = None default) and validate/raise a clear exception if year is
missing before calling _normalize_year; update any callers to pass an explicit
year and keep the WG21Paper.objects.filter(paper_id=...,
year=year_val).update(...) logic unchanged so we no longer risk silently
updating placeholder rows.
- Around line 40-50: The service should normalize the incoming document_date to
a native date once at the boundary (in get_or_create_paper) and use that
normalized value for all comparisons, defaults and updates; change the parameter
handling in get_or_create_paper to parse/convert Optional[str] or date inputs
into a single date | None variable and then use that variable in calls that set
or compare paper.document_date (including create_or_update logic), and apply the
same normalization pattern to the other helper functions that accept
document_date (the other occurrences noted in the review) so callers supplying
"YYYY-MM-DD" strings won't keep rows dirty and the contract is unambiguous.
In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 35-46: The test test_wg21_paper_str_truncates_long_title currently
only checks length; update it to assert the actual truncated content from
WG21Paper.__str__ instead of just length: compute rendered = str(p).split(": ",
1)[-1] and assert that rendered startswith the first 60 characters of long_title
(long_title[:60]) so the title is present and correctly truncated; keep the
existing setup (WG21Mailing m, WG21Paper p) and replace the len(...) assertion
with this content check.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 124-165: The test test_run_tracker_pipeline_downloads_new_papers
has six nested with patch(...) blocks which reduce readability; refactor it to
use contextlib.ExitStack or pytest `@patch` decorators to flatten the nesting.
Locate the test function and replace the nested patches for fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET
and _upload_to_gcs with either a single ExitStack that enters each patcher or
with stacked `@patch` decorators (keeping the same return_value or replacement
values) and adjust the test signature to accept the mock objects in the correct
order.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 68568e8a-8113-4e08-9aed-2cf2b329bd64
📒 Files selected for processing (40)
config/settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (4)
cppa_user_tracker/services.py (2)
358-375:⚠️ Potential issue | 🟠 MajorConcurrent profile creation can produce duplicates.
Since
display_nameis only indexed (not unique), two concurrent calls with the same name can both seecandidates=[]and each create a new profile. Later lookups will see multiple candidates and may silently pickcandidates[0], merging data that should stay separate.Consider wrapping the creation in
select_for_update()or usingget_or_createwith a unique constraint on a canonicalized key (e.g., lowercased display_name + email hash).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/services.py` around lines 358 - 375, The get_or_create_wg21_paper_author_profile function can create duplicate WG21PaperAuthorProfile rows under concurrent calls; fix by performing the existence check and potential creation inside a DB transaction with row-level locking or by using a canonical unique key and get_or_create: wrap the candidate lookup/creation in transaction.atomic(), re-query with select_for_update() on WG21PaperAuthorProfile.objects.filter(display_name=display_name_val) to lock rows before deciding to create, or alternatively add a canonical unique field (e.g., canonical_display_name or canonical_display_name_plus_email) on WG21PaperAuthorProfile and use WG21PaperAuthorProfile.objects.get_or_create(canonical_field=..., defaults={...}) so concurrent creates cannot produce duplicates; ensure you update the function to use the chosen approach and remove the race by referencing display_name, email, candidates and WG21PaperAuthorProfile in the change.
393-405:⚠️ Potential issue | 🟠 MajorAttaching an unmatched email to an arbitrary same-name profile can merge distinct authors.
When multiple
display_namematches exist and none already ownsemail_val, the fallback pickscandidates[0]and adds the new email there. This silently merges distinct authors whenever two people share a name. The safer fallback is to create a fresh profile for that email instead of mutating an arbitrary candidate.Suggested fix
# Two or more: disambiguate by email if provided if email_val: for p in candidates: if p.emails.filter(email=email_val).exists(): return p, False - profile = candidates[0] - if email_val and not profile.emails.filter(email=email_val).exists(): - add_email( - profile, - email_val, - is_primary=not profile.emails.filter(is_active=True).exists(), - ) - return profile, False + # No existing profile owns this email; create a new one + profile = WG21PaperAuthorProfile.objects.create(display_name=display_name_val) + add_email(profile, email_val, is_primary=True) + return profile, True + # No email provided; return first match + return candidates[0], False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/services.py` around lines 393 - 405, When multiple candidates match display_name but none own email_val, do not pick candidates[0] and attach the email; instead create a fresh profile for that email and attach the email to it. Replace the fallback block that sets profile = candidates[0] and calls add_email(...) with logic that constructs a new profile (using the existing profile-creation helper in this module or Profile model factory), call add_email(new_profile, email_val, is_primary=...), and return the new profile (and the created flag, e.g. True) rather than mutating an arbitrary candidate; keep references to candidates, email_val and add_email to locate and update the code.wg21_paper_tracker/cloud_run_job/main.py (2)
77-80:⚠️ Potential issue | 🔴 CriticalUse the uploader's raw GCS prefix.
Line 77 still points at
raw/wg21_paper_tracker/, but the uploader stores PDFs underraw/wg21_papers/.... With the current prefix, newly uploaded papers are never discovered, so the conversion stage never runs.🐛 Minimal fix
- raw_prefix = "raw/wg21_paper_tracker/" + raw_prefix = "raw/wg21_papers/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The GCS prefix used to list new uploads is incorrect: update the raw_prefix variable in main.py from "raw/wg21_paper_tracker/" to the uploader's actual prefix "raw/wg21_papers/" so client.list_blobs(bucket, prefix=raw_prefix) will discover newly uploaded PDFs; verify converted_prefix remains "converted/wg21_papers/" and re-run the listing logic (the change affects the raw_prefix symbol used before calling client.list_blobs).
39-45:⚠️ Potential issue | 🟠 MajorDon't reject valid output just because it mentions
error:orexception:.Line 41 bypasses the proximity check, so a legitimate paper that contains those tokens anywhere in the first 1000 characters is treated as a converter failure. That can trigger unnecessary PDFPlumber/OpenAI fallbacks and still drop valid conversions for papers about error handling or exception safety.
💡 Minimal fix
for pattern in error_patterns: if pattern in first_part: - if pattern in ("error:", "exception:"): - return False idx = first_part.find(pattern) if idx < 100: return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The current loop over error_patterns incorrectly treats "error:" and "exception:" as immediate failures (the `if pattern in ("error:", "exception:"): return False` branch), bypassing the proximity check; remove that special-case and apply the same proximity logic to all patterns by finding `idx = first_part.find(pattern)` and returning False only if `idx` is within the intended proximity threshold (use 1000 chars as per review or the existing threshold variable), i.e., replace the special-case with the same `idx` check for `error_patterns`, referencing the loop variables `error_patterns`, `first_part`, `pattern`, and `idx`.
🧹 Nitpick comments (5)
wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py (1)
97-102: Minor style suggestion: use f-string conversion flag.Static analysis suggests using the explicit conversion flag
!sinstead of wrapping withstr().♻️ Optional style improvement
except Exception as e: logger.error( - f"PDFPlumber conversion failed for {pdf_path.name}: {str(e)}", + f"PDFPlumber conversion failed for {pdf_path.name}: {e!s}", exc_info=True, ) return NoneThe same applies to line 81 (
str(e)→{e!s}).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py` around lines 97 - 102, The log message uses str(e) which static analysis flags; update the exception formatting in the PDFPlumber conversion catch blocks to use the f-string conversion flag (!s) instead of str(), i.e., replace occurrences of str(e) with {e!s} in the logger.error calls that reference pdf_path.name and the earlier similar logger call; keep the rest of the logger.error parameters (exc_info=True) unchanged and ensure you update both places where the exception variable e is interpolated.wg21_paper_tracker/migrations/0001_initial.py (1)
28-31: Minor:db_index=Trueis redundant onmailing_datesinceunique=Truealready creates an index.PostgreSQL automatically creates a B-tree index for unique constraints. This is a minor redundancy and doesn't affect functionality.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/migrations/0001_initial.py` around lines 28 - 31, Remove the redundant db_index=True from the mailing_date field definition (the models.CharField named "mailing_date" with max_length=7 and unique=True) in the migration so the unique constraint remains but no duplicate index is requested; edit the tuple for "mailing_date" in 0001_initial.py to drop db_index=True and leave unique=True so PostgreSQL creates only the unique index.wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)
48-97: Add command-level tests for the new orchestration branches.This command now owns the dry-run short circuit, Cloud Run gating, and failure propagation, but there’s no coverage here for those branches. A small
call_command()suite with mockedrun_tracker_pipeline()/trigger_cloud_run_job()would make the intended behavior much safer to change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around lines 48 - 97, Add unit tests that exercise the management command's branches by calling the command via Django's call_command and mocking run_tracker_pipeline and trigger_cloud_run_job: test dry-run by passing --dry-run and asserting run_tracker_pipeline/trigger_cloud_run_job are not called; test no-new-papers by mocking run_tracker_pipeline to return 0 and asserting trigger_cloud_run_job is not called; test Cloud Run trigger by mocking run_tracker_pipeline to return >0 and configuring settings (GCP_PROJECT_ID, WG21_CLOUD_RUN_JOB_NAME, WG21_GCS_BUCKET, WG21_CLOUD_RUN_ENABLED=True) and asserting trigger_cloud_run_job is called with (project_id, location, job_name); and test failure propagation by having trigger_cloud_run_job raise and asserting the command raises the exception. Use mocks/patch for run_tracker_pipeline and trigger_cloud_run_job and override_settings for the required WG21_* settings to isolate behavior.cppa_user_tracker/tests/test_services.py (1)
638-648: Assert the email-association side effect in this fallback case.This branch only verifies that the first profile is returned. It would still pass if
get_or_create_wg21_paper_author_profile()stopped attachingnomatch@example.comto that returned profile, even though the public API now documents that side effect.🧪 Tighten the assertion
def test_get_or_create_wg21_paper_author_profile_two_candidates_email_matches_none_returns_first(): """get_or_create_wg21_paper_author_profile returns first when email provided but no match.""" first = WG21PaperAuthorProfile.objects.create(display_name="Other Name") second = WG21PaperAuthorProfile.objects.create(display_name="Other Name") services.add_email(second, "other@example.com", is_primary=True) profile, created = services.get_or_create_wg21_paper_author_profile( display_name="Other Name", email="nomatch@example.com", ) assert created is False assert profile.id == first.id + assert first.emails.filter(email="nomatch@example.com").exists()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_user_tracker/tests/test_services.py` around lines 638 - 648, The test must also assert the documented side effect that nomatch@example.com is attached to the returned profile: after calling services.get_or_create_wg21_paper_author_profile(...), add an assertion that the returned profile has that email associated (e.g. assert services.get_emails(profile) contains "nomatch@example.com" or assert profile.emails.filter(address="nomatch@example.com").exists()), referencing WG21PaperAuthorProfile and services.get_or_create_wg21_paper_author_profile to locate where to add the assertion.wg21_paper_tracker/tests/test_pipeline.py (1)
146-165: Assert the upload side effect in the happy-path test.Right now this still passes if
run_tracker_pipeline()increments its count before_upload_to_gcs()or stops calling upload entirely. Since GCS upload is the new behavior here, make the mock observable.🧪 Tighten the assertion
- with patch( - "wg21_paper_tracker.pipeline._upload_to_gcs", - return_value=True, - ): - n = run_tracker_pipeline() + with patch( + "wg21_paper_tracker.pipeline._upload_to_gcs", + return_value=True, + ) as upload_mock: + n = run_tracker_pipeline() assert n == 1 + upload_mock.assert_called_once()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 165, The test currently patches _upload_to_gcs but doesn't assert it was called; update the happy-path test to make the mock observable by assigning the patch to a variable (e.g., mock_upload) when patching "wg21_paper_tracker.pipeline._upload_to_gcs", run run_tracker_pipeline(), then assert mock_upload was called the expected number of times and/or with the expected args (e.g., mock_upload.assert_called_once() or mock_upload.assert_called_with(...)) to ensure run_tracker_pipeline() actually performs the GCS upload side effect.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 182-239: The IntegrityError except block can be triggered by a
race creating the mailing, so before attempting to recover/update the paper you
must ensure `mailing` is re-resolved; inside the `except IntegrityError as e:`
(where `get_or_create_paper` and `get_or_create_mailing` are used) fetch or
recreate the mailing (e.g., re-run `get_or_create_mailing(mailing_date,
mailing_title)` or lookup by those fields) and assign it to the `mailing`
variable before using it to update `paper` in the recovery path (ensure you
handle failure to find/create mailing and only then proceed to lookup/update
`WG21Paper`), so the recovery no longer relies on an uninitialized `mailing`
when `get_or_create_mailing` raced.
---
Duplicate comments:
In `@cppa_user_tracker/services.py`:
- Around line 358-375: The get_or_create_wg21_paper_author_profile function can
create duplicate WG21PaperAuthorProfile rows under concurrent calls; fix by
performing the existence check and potential creation inside a DB transaction
with row-level locking or by using a canonical unique key and get_or_create:
wrap the candidate lookup/creation in transaction.atomic(), re-query with
select_for_update() on
WG21PaperAuthorProfile.objects.filter(display_name=display_name_val) to lock
rows before deciding to create, or alternatively add a canonical unique field
(e.g., canonical_display_name or canonical_display_name_plus_email) on
WG21PaperAuthorProfile and use
WG21PaperAuthorProfile.objects.get_or_create(canonical_field=...,
defaults={...}) so concurrent creates cannot produce duplicates; ensure you
update the function to use the chosen approach and remove the race by
referencing display_name, email, candidates and WG21PaperAuthorProfile in the
change.
- Around line 393-405: When multiple candidates match display_name but none own
email_val, do not pick candidates[0] and attach the email; instead create a
fresh profile for that email and attach the email to it. Replace the fallback
block that sets profile = candidates[0] and calls add_email(...) with logic that
constructs a new profile (using the existing profile-creation helper in this
module or Profile model factory), call add_email(new_profile, email_val,
is_primary=...), and return the new profile (and the created flag, e.g. True)
rather than mutating an arbitrary candidate; keep references to candidates,
email_val and add_email to locate and update the code.
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The GCS prefix used to list new uploads is incorrect:
update the raw_prefix variable in main.py from "raw/wg21_paper_tracker/" to the
uploader's actual prefix "raw/wg21_papers/" so client.list_blobs(bucket,
prefix=raw_prefix) will discover newly uploaded PDFs; verify converted_prefix
remains "converted/wg21_papers/" and re-run the listing logic (the change
affects the raw_prefix symbol used before calling client.list_blobs).
- Around line 39-45: The current loop over error_patterns incorrectly treats
"error:" and "exception:" as immediate failures (the `if pattern in ("error:",
"exception:"): return False` branch), bypassing the proximity check; remove that
special-case and apply the same proximity logic to all patterns by finding `idx
= first_part.find(pattern)` and returning False only if `idx` is within the
intended proximity threshold (use 1000 chars as per review or the existing
threshold variable), i.e., replace the special-case with the same `idx` check
for `error_patterns`, referencing the loop variables `error_patterns`,
`first_part`, `pattern`, and `idx`.
---
Nitpick comments:
In `@cppa_user_tracker/tests/test_services.py`:
- Around line 638-648: The test must also assert the documented side effect that
nomatch@example.com is attached to the returned profile: after calling
services.get_or_create_wg21_paper_author_profile(...), add an assertion that the
returned profile has that email associated (e.g. assert
services.get_emails(profile) contains "nomatch@example.com" or assert
profile.emails.filter(address="nomatch@example.com").exists()), referencing
WG21PaperAuthorProfile and services.get_or_create_wg21_paper_author_profile to
locate where to add the assertion.
In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py`:
- Around line 97-102: The log message uses str(e) which static analysis flags;
update the exception formatting in the PDFPlumber conversion catch blocks to use
the f-string conversion flag (!s) instead of str(), i.e., replace occurrences of
str(e) with {e!s} in the logger.error calls that reference pdf_path.name and the
earlier similar logger call; keep the rest of the logger.error parameters
(exc_info=True) unchanged and ensure you update both places where the exception
variable e is interpolated.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 48-97: Add unit tests that exercise the management command's
branches by calling the command via Django's call_command and mocking
run_tracker_pipeline and trigger_cloud_run_job: test dry-run by passing
--dry-run and asserting run_tracker_pipeline/trigger_cloud_run_job are not
called; test no-new-papers by mocking run_tracker_pipeline to return 0 and
asserting trigger_cloud_run_job is not called; test Cloud Run trigger by mocking
run_tracker_pipeline to return >0 and configuring settings (GCP_PROJECT_ID,
WG21_CLOUD_RUN_JOB_NAME, WG21_GCS_BUCKET, WG21_CLOUD_RUN_ENABLED=True) and
asserting trigger_cloud_run_job is called with (project_id, location, job_name);
and test failure propagation by having trigger_cloud_run_job raise and asserting
the command raises the exception. Use mocks/patch for run_tracker_pipeline and
trigger_cloud_run_job and override_settings for the required WG21_* settings to
isolate behavior.
In `@wg21_paper_tracker/migrations/0001_initial.py`:
- Around line 28-31: Remove the redundant db_index=True from the mailing_date
field definition (the models.CharField named "mailing_date" with max_length=7
and unique=True) in the migration so the unique constraint remains but no
duplicate index is requested; edit the tuple for "mailing_date" in
0001_initial.py to drop db_index=True and leave unique=True so PostgreSQL
creates only the unique index.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-165: The test currently patches _upload_to_gcs but doesn't
assert it was called; update the happy-path test to make the mock observable by
assigning the patch to a variable (e.g., mock_upload) when patching
"wg21_paper_tracker.pipeline._upload_to_gcs", run run_tracker_pipeline(), then
assert mock_upload was called the expected number of times and/or with the
expected args (e.g., mock_upload.assert_called_once() or
mock_upload.assert_called_with(...)) to ensure run_tracker_pipeline() actually
performs the GCS upload side effect.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b12acd34-fd02-4c42-8618-7660b0726e29
📒 Files selected for processing (42)
.github/workflows/actions.ymlconfig/settings.pyconfig/test_settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
…g race recovery, logging and Ruf cppalliance#24
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
wg21_paper_tracker/cloud_run_job/main.py (2)
77-80:⚠️ Potential issue | 🔴 CriticalUse the uploader’s raw prefix here.
main()listsraw/wg21_paper_tracker/, but the pipeline uploads PDFs underraw/wg21_papers/.... With the current prefix, this job never sees newly uploaded files, so the conversion flow is effectively disconnected.🐛 Minimal fix
- raw_prefix = "raw/wg21_paper_tracker/" + raw_prefix = "raw/wg21_papers/" converted_prefix = "converted/wg21_papers/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The list_blobs call in main() is using the wrong raw prefix ("raw/wg21_paper_tracker/") so newly uploaded PDFs never get picked up; update the raw_prefix variable used by client.list_blobs to the uploader’s prefix ("raw/wg21_papers/") (and keep converted_prefix as-is) so client.list_blobs(bucket, prefix=raw_prefix) enumerates the actual uploaded files.
39-45:⚠️ Potential issue | 🟠 MajorDon’t reject valid content just because it mentions
error:orexception:later in the page.Line 41 currently returns
Falsefor any occurrence of those substrings anywhere in the first 1000 characters. That can discard legitimate papers whose opening text contains headings or examples likeException:/error:. The special-case should only reject when those markers appear at the start of the content.💡 Tighten the heuristic
for pattern in error_patterns: - if pattern in first_part: - if pattern in ("error:", "exception:"): - return False - idx = first_part.find(pattern) - if idx < 100: - return False + idx = first_part.find(pattern) + if idx == -1: + continue + if pattern in ("error:", "exception:"): + if idx == 0: + return False + continue + if idx < 100: + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The loop over error_patterns is rejecting content whenever "error:" or "exception:" appears anywhere early in first_part; tighten it so those two markers only reject if they start the content: replace the inner special-case with a start check (e.g., if pattern in ("error:", "exception:"): if first_part.lstrip().lower().startswith(pattern): return False) while leaving the existing idx-based rejection (idx < 100) for other patterns; operate on the same variables error_patterns, first_part and idx.
🧹 Nitpick comments (5)
wg21_paper_tracker/migrations/0001_initial.py (1)
28-31: Consider removing redundantdb_index=Trueon unique field.The
mailing_datefield hasunique=True, which automatically creates a B-tree index in PostgreSQL. The explicitdb_index=Trueis redundant. Based on project learning, unique constraints already provide indexing.♻️ Proposed fix
( "mailing_date", - models.CharField(db_index=True, max_length=7, unique=True), + models.CharField(max_length=7, unique=True), ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/migrations/0001_initial.py` around lines 28 - 31, The migration defines the "mailing_date" models.CharField with both db_index=True and unique=True; remove the redundant db_index=True from that field declaration (leave unique=True) so the unique constraint creates the index implicitly; update the migration operation that contains the "mailing_date" CharField (in the initial migration class) to drop the db_index argument and regenerate or confirm migrations if necessary.wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)
241-243: Uselogging.exceptionhere as well.Same issue -
logging.errorloses the traceback context.♻️ Proposed fix
except Exception as e: stats["skipped"] += 1 - logger.error("Error for paper_id=%s: %s", paper_id, e) + logger.exception("Error for paper_id=%s", paper_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 241 - 243, The except block in import_wg21_metadata_from_csv.py is logging exceptions with logger.error which drops traceback; replace the logger.error("Error for paper_id=%s: %s", paper_id, e) call in the except Exception as e handler with logger.exception("Error for paper_id=%s", paper_id) (or logger.exception with an explicit message) so the full traceback is recorded; keep the stats["skipped"] += 1 line unchanged and do not re-raise unless intended.
163-170: Uselogging.exceptionfor full traceback visibility.The static analysis correctly flags that
logging.errorwith manual%sformatting loses the traceback. Uselogging.exceptionto capture the full stack trace for debugging.♻️ Proposed fix
except Exception as e: stats["skipped"] += 1 - logger.error( - "Error parsing document date for paper_id=%s: %s", + logger.exception( + "Error parsing document date for paper_id=%s", paper_id, - e, ) continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 163 - 170, The except block currently logs the error with logger.error losing the traceback; replace the logger.error call inside the except Exception as e in import_wg21_metadata_from_csv.py with logger.exception(...) so the full stack trace is recorded (keep the same message and paper_id argument), leaving stats["skipped"] increment and the continue intact; reference symbols: logger, stats, paper_id, and the except Exception as e block where document date parsing is handled.wg21_paper_tracker/tests/test_pipeline.py (1)
146-165: Consider usingpytest-mockto reduce nesting depth.The deeply nested
with patch(...)blocks work correctly but reduce readability. Usingpytest-mock'smockerfixture would flatten this significantly.♻️ Example with mocker fixture
`@pytest.mark.django_db` def test_run_tracker_pipeline_downloads_new_papers(tmp_path, mocker): """run_tracker_pipeline downloads papers for new mailings and returns count.""" from wg21_paper_tracker.models import WG21Mailing WG21Mailing.objects.create(mailing_date="2025-01", title="Previous") mailings = [...] papers = [...] mocker.patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings) mocker.patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers) mocker.patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path) mocker.patch("wg21_paper_tracker.pipeline._download_file", return_value=True) mocker.patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket") mocker.patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True) n = run_tracker_pipeline() assert n == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 165, The nested with-patch blocks in the test should be flattened by using the pytest-mock "mocker" fixture: in the test function (e.g., test_run_tracker_pipeline_downloads_new_papers) replace the multiple with patch(...) contexts for wg21_paper_tracker.pipeline.fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, and _upload_to_gcs with mocker.patch calls that return the same values (mailings, papers, tmp_path, True, "test-bucket", True respectively), then call run_tracker_pipeline() and assert the result remains 1; this reduces nesting while keeping the same mocked behavior.wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)
95-97: Remove redundant exception object fromlogging.exception.
logging.exceptionautomatically includes the exception info from the current context. Passingeis redundant.♻️ Proposed fix
except Exception as e: - logger.exception("WG21 Paper Tracker failed: %s", e) + logger.exception("WG21 Paper Tracker failed.") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around lines 95 - 97, In the except block inside the RunWG21PaperTracker command (in run_wg21_paper_tracker.py) remove the redundant exception object from the logger.exception call: replace logger.exception("WG21 Paper Tracker failed: %s", e) with logger.exception("WG21 Paper Tracker failed") so the current exception context is logged automatically; leave the subsequent raise in place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The docs for get_or_create_wg21_paper_author_profile
currently say that when multiple profiles share the same display_name and none
match the supplied email the function "returns the first", but tests and
implementation actually create a new WG21PaperAuthorProfile in that case; update
the description to explicitly state that when multiple name-matching profiles
exist and the provided email matches none, the function will create and return a
new WG21PaperAuthorProfile (and return created=True), and also note the
side-effect that if email is supplied and the resolved or newly created profile
lacks that email it will be associated with the profile.
In `@wg21_paper_tracker/models.py`:
- Line 12: Remove the redundant db_index on the mailing_date field: since
mailing_date is declared with unique=True (models.CharField(..., unique=True)),
drop the db_index=True argument so the field is defined as mailing_date =
models.CharField(max_length=7, unique=True); update any related migrations
accordingly so only the unique index remains.
In `@wg21_paper_tracker/pipeline.py`:
- Around line 231-270: The code assumes each scraped paper dict has required
keys and may KeyError during selection or creation; first filter p_list to only
entries that have a valid "type" key usable by format_priority and required
fields ("url","title","subgroup","authors","filename") with correct types,
logging and skipping any malformed entries (use .get and isinstance checks);
then compute best_paper from the filtered list (so min(..., key=lambda x:
format_priority(x["type"])) cannot KeyError). When building filename and calling
get_or_create_paper, access fields via .get and validate/normalize author list
and subgroup before passing them into get_or_create_paper, and if any required
field is missing or invalid log a warning with pid and continue instead of
indexing.
In `@wg21_paper_tracker/services.py`:
- Around line 154-158: In mark_paper_downloaded, the sanitized paper_id can be
empty and should be rejected before proceeding; after the existing paper_id =
(paper_id or "").strip().lower() line, add a validation that raises ValueError
(e.g., "paper_id is required") if paper_id == "", so malformed/blank IDs never
reach _normalize_year or the update query; keep the existing year validation and
use the same normalized year flow afterward.
---
Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The list_blobs call in main() is using the wrong raw prefix
("raw/wg21_paper_tracker/") so newly uploaded PDFs never get picked up; update
the raw_prefix variable used by client.list_blobs to the uploader’s prefix
("raw/wg21_papers/") (and keep converted_prefix as-is) so
client.list_blobs(bucket, prefix=raw_prefix) enumerates the actual uploaded
files.
- Around line 39-45: The loop over error_patterns is rejecting content whenever
"error:" or "exception:" appears anywhere early in first_part; tighten it so
those two markers only reject if they start the content: replace the inner
special-case with a start check (e.g., if pattern in ("error:", "exception:"):
if first_part.lstrip().lower().startswith(pattern): return False) while leaving
the existing idx-based rejection (idx < 100) for other patterns; operate on the
same variables error_patterns, first_part and idx.
---
Nitpick comments:
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 241-243: The except block in import_wg21_metadata_from_csv.py is
logging exceptions with logger.error which drops traceback; replace the
logger.error("Error for paper_id=%s: %s", paper_id, e) call in the except
Exception as e handler with logger.exception("Error for paper_id=%s", paper_id)
(or logger.exception with an explicit message) so the full traceback is
recorded; keep the stats["skipped"] += 1 line unchanged and do not re-raise
unless intended.
- Around line 163-170: The except block currently logs the error with
logger.error losing the traceback; replace the logger.error call inside the
except Exception as e in import_wg21_metadata_from_csv.py with
logger.exception(...) so the full stack trace is recorded (keep the same message
and paper_id argument), leaving stats["skipped"] increment and the continue
intact; reference symbols: logger, stats, paper_id, and the except Exception as
e block where document date parsing is handled.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-97: In the except block inside the RunWG21PaperTracker command
(in run_wg21_paper_tracker.py) remove the redundant exception object from the
logger.exception call: replace logger.exception("WG21 Paper Tracker failed: %s",
e) with logger.exception("WG21 Paper Tracker failed") so the current exception
context is logged automatically; leave the subsequent raise in place.
In `@wg21_paper_tracker/migrations/0001_initial.py`:
- Around line 28-31: The migration defines the "mailing_date" models.CharField
with both db_index=True and unique=True; remove the redundant db_index=True from
that field declaration (leave unique=True) so the unique constraint creates the
index implicitly; update the migration operation that contains the
"mailing_date" CharField (in the initial migration class) to drop the db_index
argument and regenerate or confirm migrations if necessary.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-165: The nested with-patch blocks in the test should be
flattened by using the pytest-mock "mocker" fixture: in the test function (e.g.,
test_run_tracker_pipeline_downloads_new_papers) replace the multiple with
patch(...) contexts for wg21_paper_tracker.pipeline.fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET,
and _upload_to_gcs with mocker.patch calls that return the same values
(mailings, papers, tmp_path, True, "test-bucket", True respectively), then call
run_tracker_pipeline() and assert the result remains 1; this reduces nesting
while keeping the same mocked behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f538d9c1-4cc2-44ea-98cf-95ec72879f2b
📒 Files selected for processing (42)
.github/workflows/actions.ymlconfig/settings.pyconfig/test_settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/service_api/cppa_user_tracker.md (1)
46-48:⚠️ Potential issue | 🟡 MinorDocumentation doesn't match implementation for the multiple-match + unmatched-email case.
The description says "otherwise returns the first" when multiple profiles exist and no email matches. However, reviewing
cppa_user_tracker/services.py(lines 394-400), when multiple candidates exist with the samedisplay_nameand the supplied email matches none, the implementation creates a new profile rather than returning the first candidate.📘 Suggested doc fix
-| `get_or_create_wg21_paper_author_profile` | `display_name: str`, `email: str \| None = None` | `tuple[WG21PaperAuthorProfile, bool]` | Resolve by display_name (optional email for disambiguation). If no profile exists, creates one and adds email if provided. If one exists, or multiple exist and one matches the email, returns that profile; otherwise returns the first. **Side effect:** if `email` is supplied and the resolved or created profile does not already have that email, the function associates it with the profile (so existing profiles may be updated). Returns the profile and a boolean indicating creation. Use when linking paper authors so that same name + same email link to the same profile. | +| `get_or_create_wg21_paper_author_profile` | `display_name: str`, `email: str \| None = None` | `tuple[WG21PaperAuthorProfile, bool]` | Resolve by display_name (optional email for disambiguation). If no profile exists, creates one and adds email if provided. If one exists, returns it (and adds email if needed). If multiple exist and one matches the email, returns that profile. If multiple exist and no email is provided, returns the first. If multiple exist and the supplied email matches none, creates a new profile with that email. **Side effect:** if `email` is supplied and the resolved or created profile does not already have that email, the function associates it with the profile (so existing profiles may be updated). Returns the profile and a boolean indicating creation. Use when linking paper authors so that same name + same email link to the same profile. |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/service_api/cppa_user_tracker.md` around lines 46 - 48, The documentation for get_or_create_wg21_paper_author_profile is incorrect about the multiple-match + unmatched-email case: update the description to state that when multiple profiles share the same display_name and the supplied email does not match any existing profile, the implementation creates a new profile (and returns it with created=True) rather than returning the first existing candidate; also ensure the "Side effect" text clarifies that supplying an email may either associate it with an existing resolved profile or lead to creation of a new profile when no email match is found.
🧹 Nitpick comments (7)
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)
241-243: Uselogger.exceptionfor better tracebacks.Same as above—
logger.exceptionwill capture the full stack trace.🧹 Suggested fix
except Exception as e: stats["skipped"] += 1 - logger.error("Error for paper_id=%s: %s", paper_id, e) + logger.exception("Error for paper_id=%s", paper_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 241 - 243, In the except block inside import_wg21_metadata_from_csv.py (where stats["skipped"] is incremented and the current line is logger.error("Error for paper_id=%s: %s", paper_id, e)), replace logger.error with logger.exception so the full traceback is captured; keep the stats["skipped"] increment and the contextual message including paper_id to preserve existing metrics and logs while providing the full exception stack trace for debugging.
163-170: Uselogger.exceptionfor better tracebacks.When catching exceptions,
logger.exceptionautomatically includes the stack trace, which aids debugging.🧹 Suggested fix
except Exception as e: stats["skipped"] += 1 - logger.error( - "Error parsing document date for paper_id=%s: %s", + logger.exception( + "Error parsing document date for paper_id=%s", paper_id, - e, ) continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 163 - 170, The except block in import_wg21_metadata_from_csv.py that handles date parsing (inside the function handling CSV import where stats["skipped"] is incremented) should use logger.exception instead of logger.error so the stack trace is captured; replace the logger.error call in the except Exception as e block with logger.exception("Error parsing document date for paper_id=%s", paper_id) (you can still increment stats["skipped"] and continue as before).wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)
95-96: Remove redundant exception object fromlogging.exception.
logging.exceptionautomatically includes exception info, so passingeis redundant.🧹 Suggested fix
except Exception as e: - logger.exception("WG21 Paper Tracker failed: %s", e) + logger.exception("WG21 Paper Tracker failed.") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around lines 95 - 96, The except block in run_wg21_paper_tracker.py currently calls logger.exception("WG21 Paper Tracker failed: %s", e) which redundantly passes the exception object; change the call in the except handler to logger.exception("WG21 Paper Tracker failed") (i.e., remove the "%s" placeholder and the e argument) so the exception info is logged via logging.exception's built-in traceback handling; apply the same change if similar logger.exception usages exist in the same command.wg21_paper_tracker/tests/test_pipeline.py (1)
146-165: Consider usingExitStackto reduce nesting depth.The seven nested
with patch(...)statements make this test harder to read.contextlib.ExitStackcan flatten this.♻️ Suggested refactor
+from contextlib import ExitStack + `@pytest.mark.django_db` def test_run_tracker_pipeline_downloads_new_papers(tmp_path): """run_tracker_pipeline downloads papers for new mailings and returns count.""" from wg21_paper_tracker.models import WG21Mailing WG21Mailing.objects.create(mailing_date="2025-01", title="Previous") mailings = [...] papers = [...] - with patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings): - with patch( - "wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers - ): - with patch( - "wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path - ): - with patch( - "wg21_paper_tracker.pipeline._download_file", return_value=True - ): - with patch( - "wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", - "test-bucket", - ): - with patch( - "wg21_paper_tracker.pipeline._upload_to_gcs", - return_value=True, - ): - n = run_tracker_pipeline() + with ExitStack() as stack: + stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings)) + stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers)) + stack.enter_context(patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path)) + stack.enter_context(patch("wg21_paper_tracker.pipeline._download_file", return_value=True)) + stack.enter_context(patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket")) + stack.enter_context(patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True)) + n = run_tracker_pipeline() assert n == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 165, The test currently nests seven with-patch blocks around run_tracker_pipeline (patching fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, _upload_to_gcs) which is hard to read; refactor the test to use contextlib.ExitStack to enter all patches in a single flattened block, creating Patchers for "wg21_paper_tracker.pipeline.fetch_all_mailings", "wg21_paper_tracker.pipeline.fetch_papers_for_mailing", "wg21_paper_tracker.pipeline.get_raw_dir", "wg21_paper_tracker.pipeline._download_file", "wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", and "wg21_paper_tracker.pipeline._upload_to_gcs", set their return_value or value as before, and then call run_tracker_pipeline() inside the ExitStack context before asserting n == 1.wg21_paper_tracker/tests/test_services.py (1)
49-70: Remove redundantdbfixture parameter.The
@pytest.mark.django_dbmarker already enables database access. Thedbparameter is unnecessary and triggers static analysis warnings across multiple tests.♻️ Suggested change (apply to all similar tests)
`@pytest.mark.django_db` `@patch`("wg21_paper_tracker.services.get_or_create_wg21_paper_author_profile") -def test_get_or_create_paper_creates_new(mock_profile, db): +def test_get_or_create_paper_creates_new(mock_profile):Apply the same change to:
test_get_or_create_paper_calls_author_profile_for_each_author,test_get_or_create_paper_normalizes_paper_id_lowercase,test_get_or_create_paper_gets_existing_and_updates,test_get_or_create_paper_year_none_stored_as_zero,test_get_or_create_paper_same_paper_id_different_year_creates_two,test_get_or_create_paper_sets_author_order,test_mark_paper_downloaded_requires_year,test_mark_paper_downloaded_sets_flag,test_mark_paper_downloaded_normalizes_paper_id.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_services.py` around lines 49 - 70, Remove the redundant `db` fixture parameter from the affected test function signatures (it’s unnecessary because `@pytest.mark.django_db` already provides DB access). Update the tests named test_get_or_create_paper_creates_new, test_get_or_create_paper_calls_author_profile_for_each_author, test_get_or_create_paper_normalizes_paper_id_lowercase, test_get_or_create_paper_gets_existing_and_updates, test_get_or_create_paper_year_none_stored_as_zero, test_get_or_create_paper_same_paper_id_different_year_creates_two, test_get_or_create_paper_sets_author_order, test_mark_paper_downloaded_requires_year, test_mark_paper_downloaded_sets_flag, and test_mark_paper_downloaded_normalizes_paper_id by removing the trailing ", db" parameter from their definitions so only their mocked/needed parameters (e.g., mock_profile) remain; keep the `@pytest.mark.django_db` marker in place and run tests to confirm no behavior changes.wg21_paper_tracker/cloud_run_job/converters/docling_converter.py (1)
11-20: Consider removing unusedInputFormatimport.
InputFormatis imported but never used in the module. If it's not needed for future expansion, removing it would clean up the code.♻️ Suggested change
try: from docling.document_converter import DocumentConverter # type: ignore[import-untyped] - from docling.datamodel.base_models import InputFormat # type: ignore[import-untyped] DOCLING_AVAILABLE = True except ImportError: DocumentConverter = None # type: ignore[assignment,misc] - InputFormat = None # type: ignore[assignment,misc] DOCLING_AVAILABLE = False logger.warning("Docling not available. Install with: pip install docling")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py` around lines 11 - 20, The module imports InputFormat but never uses it; remove the unused import to clean up the code by deleting "from docling.datamodel.base_models import InputFormat" and any related type: ignore comments, leaving the try/except to only import DocumentConverter and set DocumentConverter = None, InputFormat = None assignments should be removed as well (keep DocumentConverter, DOCLING_AVAILABLE, and the logger.warning unchanged).wg21_paper_tracker/services.py (1)
80-109: Potential race condition in placeholder upgrade.When
year_val > 0, the code first queries for an exact match, then queries for a placeholder (year=0). Two concurrent requests with the samepaper_idbut a real year could both find the placeholder and both attempt to promote it—one succeeds, the other hits anIntegrityErroron the(paper_id, year)unique constraint.Consider using
select_for_update()on the placeholder lookup to serialize concurrent promotions:💡 Suggested fix
placeholder = WG21Paper.objects.filter(paper_id=paper_id, year=0).first() + placeholder = WG21Paper.objects.select_for_update().filter( + paper_id=paper_id, year=0 + ).first()Alternatively, wrap the promotion in a try/except for
IntegrityErrorand re-fetch the existing row if it was concurrently created.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/services.py` around lines 80 - 109, The placeholder-to-real-year promotion can race: when year_val > 0, two threads can both fetch the placeholder (WG21Paper with year=0) and attempt to set year=year_val causing an IntegrityError on the unique (paper_id, year) constraint; fix by serializing or handling the race—either acquire a DB row lock when reading the placeholder using select_for_update() on the WG21Paper placeholder query (inside a transaction) before updating, or wrap the promotion block in a try/except catching IntegrityError, and on IntegrityError re-query WG21Paper for the exact (paper_id, year_val) row and use that (or call _update_paper) so only one caller creates/promotes the record.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The documentation for
get_or_create_wg21_paper_author_profile is incorrect about the multiple-match +
unmatched-email case: update the description to state that when multiple
profiles share the same display_name and the supplied email does not match any
existing profile, the implementation creates a new profile (and returns it with
created=True) rather than returning the first existing candidate; also ensure
the "Side effect" text clarifies that supplying an email may either associate it
with an existing resolved profile or lead to creation of a new profile when no
email match is found.
---
Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py`:
- Around line 11-20: The module imports InputFormat but never uses it; remove
the unused import to clean up the code by deleting "from
docling.datamodel.base_models import InputFormat" and any related type: ignore
comments, leaving the try/except to only import DocumentConverter and set
DocumentConverter = None, InputFormat = None assignments should be removed as
well (keep DocumentConverter, DOCLING_AVAILABLE, and the logger.warning
unchanged).
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 241-243: In the except block inside
import_wg21_metadata_from_csv.py (where stats["skipped"] is incremented and the
current line is logger.error("Error for paper_id=%s: %s", paper_id, e)), replace
logger.error with logger.exception so the full traceback is captured; keep the
stats["skipped"] increment and the contextual message including paper_id to
preserve existing metrics and logs while providing the full exception stack
trace for debugging.
- Around line 163-170: The except block in import_wg21_metadata_from_csv.py that
handles date parsing (inside the function handling CSV import where
stats["skipped"] is incremented) should use logger.exception instead of
logger.error so the stack trace is captured; replace the logger.error call in
the except Exception as e block with logger.exception("Error parsing document
date for paper_id=%s", paper_id) (you can still increment stats["skipped"] and
continue as before).
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-96: The except block in run_wg21_paper_tracker.py currently
calls logger.exception("WG21 Paper Tracker failed: %s", e) which redundantly
passes the exception object; change the call in the except handler to
logger.exception("WG21 Paper Tracker failed") (i.e., remove the "%s" placeholder
and the e argument) so the exception info is logged via logging.exception's
built-in traceback handling; apply the same change if similar logger.exception
usages exist in the same command.
In `@wg21_paper_tracker/services.py`:
- Around line 80-109: The placeholder-to-real-year promotion can race: when
year_val > 0, two threads can both fetch the placeholder (WG21Paper with year=0)
and attempt to set year=year_val causing an IntegrityError on the unique
(paper_id, year) constraint; fix by serializing or handling the race—either
acquire a DB row lock when reading the placeholder using select_for_update() on
the WG21Paper placeholder query (inside a transaction) before updating, or wrap
the promotion block in a try/except catching IntegrityError, and on
IntegrityError re-query WG21Paper for the exact (paper_id, year_val) row and use
that (or call _update_paper) so only one caller creates/promotes the record.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-165: The test currently nests seven with-patch blocks around
run_tracker_pipeline (patching fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, _upload_to_gcs) which is
hard to read; refactor the test to use contextlib.ExitStack to enter all patches
in a single flattened block, creating Patchers for
"wg21_paper_tracker.pipeline.fetch_all_mailings",
"wg21_paper_tracker.pipeline.fetch_papers_for_mailing",
"wg21_paper_tracker.pipeline.get_raw_dir",
"wg21_paper_tracker.pipeline._download_file",
"wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", and
"wg21_paper_tracker.pipeline._upload_to_gcs", set their return_value or value as
before, and then call run_tracker_pipeline() inside the ExitStack context before
asserting n == 1.
In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 49-70: Remove the redundant `db` fixture parameter from the
affected test function signatures (it’s unnecessary because
`@pytest.mark.django_db` already provides DB access). Update the tests named
test_get_or_create_paper_creates_new,
test_get_or_create_paper_calls_author_profile_for_each_author,
test_get_or_create_paper_normalizes_paper_id_lowercase,
test_get_or_create_paper_gets_existing_and_updates,
test_get_or_create_paper_year_none_stored_as_zero,
test_get_or_create_paper_same_paper_id_different_year_creates_two,
test_get_or_create_paper_sets_author_order,
test_mark_paper_downloaded_requires_year, test_mark_paper_downloaded_sets_flag,
and test_mark_paper_downloaded_normalizes_paper_id by removing the trailing ",
db" parameter from their definitions so only their mocked/needed parameters
(e.g., mock_profile) remain; keep the `@pytest.mark.django_db` marker in place and
run tests to confirm no behavior changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 3fb307e5-8b22-4b80-933d-1df3e7d059ca
📒 Files selected for processing (42)
.github/workflows/actions.ymlconfig/settings.pyconfig/test_settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
…ne validation, mailing race recovery, logging, and API docs cppalliance#24
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
wg21_paper_tracker/cloud_run_job/main.py (2)
77-80:⚠️ Potential issue | 🔴 CriticalStill using the wrong raw GCS prefix.
This job lists
raw/wg21_paper_tracker/, but the uploader writes PDFs underraw/wg21_papers/. With the current prefix, the conversion job never sees newly uploaded papers, so the end-to-end pipeline is effectively broken.💡 Suggested fix
- raw_prefix = "raw/wg21_paper_tracker/" + raw_prefix = "raw/wg21_papers/"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The code sets raw_prefix = "raw/wg21_paper_tracker/" but the uploader stores PDFs under "raw/wg21_papers/", so client.list_blobs(bucket, prefix=raw_prefix) never finds new files; change the raw_prefix value to "raw/wg21_papers/" (leave converted_prefix as-is) so the list_blobs call in main.py uses the correct GCS prefix and the conversion job sees uploaded PDFs.
39-45:⚠️ Potential issue | 🟠 MajorDon't reject any paper that merely mentions
error:orexception:.Line 41 now marks content invalid whenever either token appears anywhere in the first 1000 characters. WG21 papers on exceptions/error handling can legitimately contain those strings near the top, so a successful conversion gets discarded and falls through to slower fallbacks or an empty result.
💡 Suggested fix
for pattern in error_patterns: - if pattern in first_part: - if pattern in ("error:", "exception:"): - return False - idx = first_part.find(pattern) - if idx < 100: - return False + idx = first_part.find(pattern) + if idx != -1 and idx < 100: + return False return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The current loop over error_patterns wrongfully rejects content when pattern is "error:" or "exception:"; change the branch inside the for-loop that checks pattern in ("error:", "exception:") to skip rejecting (i.e., continue) instead of returning False so papers mentioning those tokens near the top are not discarded; keep the existing idx = first_part.find(pattern) and idx < 100 check for all other patterns (use variables error_patterns, first_part, pattern, idx to locate the code).
🧹 Nitpick comments (5)
wg21_paper_tracker/tests/test_workspace.py (1)
43-86: Add a regression test for the year-only raw path.
get_raw_dir()now supportsmailing_date=None, but this suite only covers dated and invalid inputs. A smallget_raw_dir(None, 2025)case would lock in the<raw>/wg21_paper_tracker/2025/fallback that was just added.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_workspace.py` around lines 43 - 86, Add a regression test that calls get_raw_dir(None, 2025) to verify the year-only fallback path is created: patch wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, call get_raw_dir with mailing_date=None and year 2025, and assert the returned Path equals the expected RAW_DIR / "wg21_paper_tracker" / "2025" and that it exists/is a directory (and optionally assert path.name == "2025"); reference the existing test functions (e.g., test_get_raw_dir_returns_mailing_date_subdir, test_get_raw_dir_creates_parents) to place the new test alongside them in tests/test_workspace.py.wg21_paper_tracker/tests/test_pipeline.py (1)
146-164: Consider usingExitStackor pytest-mock'smockerfixture to flatten nested context managers.The deeply nested
with patch(...)blocks are functional but reduce readability. This is a minor style suggestion.♻️ Example using contextlib.ExitStack
from contextlib import ExitStack with ExitStack() as stack: stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings)) stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers)) stack.enter_context(patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path)) stack.enter_context(patch("wg21_paper_tracker.pipeline._download_file", return_value=True)) stack.enter_context(patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket")) stack.enter_context(patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True)) n = run_tracker_pipeline()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 164, Replace the deeply nested with-patch blocks in the test around run_tracker_pipeline() by flattening them using contextlib.ExitStack (or pytest-mock's mocker fixture) to improve readability; enter_context for each patch targeting wg21_paper_tracker.pipeline.fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, and _upload_to_gcs, then call run_tracker_pipeline() inside the single ExitStack context so all patches are active without nested indentation.wg21_paper_tracker/pipeline.py (3)
299-305: Moveparse_dateimport to module level.Importing inside the loop executes the import machinery repeatedly. Moving it to the top of the file with other imports improves clarity and avoids repeated lookup overhead.
♻️ Suggested fix
Add to imports at top of file:
from django.utils.dateparse import parse_dateThen remove line 300 and use
parse_datedirectly:doc_date_str = best_paper.get("document_date") - from django.utils.dateparse import parse_date - doc_date = None if doc_date_str: try:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/pipeline.py` around lines 299 - 305, The local import of parse_date inside the loop should be moved to the module-level imports to avoid repeated import overhead: add "from django.utils.dateparse import parse_date" at the top of the file alongside other imports, then remove the inline import inside the block that processes best_paper (the lines around doc_date_str = best_paper.get("document_date") and the try/if that calls parse_date) so the code uses the module-level parse_date directly.
94-100: Uselogging.exceptionfor final download failure to capture full traceback.After exhausting retries, the traceback would help diagnose connection issues, SSL errors, etc.
♻️ Suggested fix
else: - logger.error( + logger.exception( "Failed to download %s after %d attempts: %s", url, DOWNLOAD_MAX_RETRIES, e, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/pipeline.py` around lines 94 - 100, Replace the final logger.error call that logs the failed download (the block referencing logger, url, DOWNLOAD_MAX_RETRIES and exception variable e) with a call that records the full traceback — e.g. use logger.exception (or logger.error(..., exc_info=True)) in the download retry/failure handler so the exception trace is captured when the download in the pipeline download/retry logic fails after DOWNLOAD_MAX_RETRIES.
49-51: Uselogging.exceptionto preserve traceback for GCS upload failures.
logging.errordiscards the stack trace; switching tologging.exception(or passingexc_info=True) preserves the full traceback, which is valuable for debugging GCS permission or network issues.♻️ Suggested fix
except Exception as e: - logger.error("Failed to upload to GCS: %s", e) + logger.exception("Failed to upload to GCS: %s", e) return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/pipeline.py` around lines 49 - 51, The except block that catches exceptions during GCS upload uses logger.error("Failed to upload to GCS: %s", e) which discards the traceback; update that handler in pipeline.py (the except block around the GCS upload logic) to call logger.exception("Failed to upload to GCS") or logger.error("Failed to upload to GCS", exc_info=True) so the full traceback is preserved for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 186-200: convert_page_with_openai currently makes a single
requests.post call and immediately returns None on any exception; add a bounded
retry loop around the requests.post/response.raise_for_status/json parsing (in
function convert_page_with_openai) to handle transient failures (timeouts,
requests.exceptions like Timeout/ConnectionError, HTTP 429 and 5xx responses)
with exponential backoff (e.g., 2 retries with delays like 1s then 2s) before
falling back to returning None; ensure you still call
response.raise_for_status() after each attempt, break and return
markdown_content on success, and keep the existing logger.error with exc_info if
all retries fail.
In `@wg21_paper_tracker/cloud_run_job/Dockerfile`:
- Around line 17-28: The Dockerfile in wg21_paper_tracker/cloud_run_job assumes
the build context is that subdirectory (uses COPY requirements.txt . and CMD
["python","main.py"]) but docs tell users to build from the repo root, causing
wrong files to be copied and the entrypoint to fail; fix by either updating the
documentation to instruct building with the cloud_run_job directory as context
(e.g., docker build -f wg21_paper_tracker/cloud_run_job/Dockerfile
wg21_paper_tracker/cloud_run_job/) or modify the Dockerfile to use
repo-root-aware paths (replace COPY requirements.txt . with COPY
wg21_paper_tracker/cloud_run_job/requirements.txt . and change CMD
["python","main.py"] to CMD
["python","wg21_paper_tracker/cloud_run_job/main.py"]) so both build contexts
work correctly.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 152-160: The code currently computes year from the raw CSV strings
(mailing_date or document_date) before calling _parse_document_date, which
allows invalid dates like "2025-99-99" to produce a spurious year; move the call
to document_date = _parse_document_date(row.get("date", "")) before computing
year and then derive year only from the validated document_date
(document_date[:4]) or, if that is None, from mailing_date but only when
MAILING_DATE_PATTERN.match(mailing_date) succeeds; update the year_str/year
assignment that references mailing_date, MAILING_DATE_PATTERN, document_date,
and paper_id accordingly so year is None for invalid parsed dates.
In `@wg21_paper_tracker/services.py`:
- Around line 80-109: Promote placeholder rows (WG21Paper with year=0) inside a
try/except to handle concurrent unique constraint violations: import
IntegrityError from django.db, wrap the placeholder modification and
placeholder.save() in a try block, on IntegrityError re-fetch the canonical
record with WG21Paper.objects.filter(paper_id=paper_id, year=year_val).first()
(or use update_or_create) and then update that canonical record via
_update_paper or field assignments, setting created=False; ensure fallback logic
covers both the successful placeholder.save() path and the IntegrityError
recovery path so no IntegrityError bubbles up.
---
Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The code sets raw_prefix = "raw/wg21_paper_tracker/" but
the uploader stores PDFs under "raw/wg21_papers/", so client.list_blobs(bucket,
prefix=raw_prefix) never finds new files; change the raw_prefix value to
"raw/wg21_papers/" (leave converted_prefix as-is) so the list_blobs call in
main.py uses the correct GCS prefix and the conversion job sees uploaded PDFs.
- Around line 39-45: The current loop over error_patterns wrongfully rejects
content when pattern is "error:" or "exception:"; change the branch inside the
for-loop that checks pattern in ("error:", "exception:") to skip rejecting
(i.e., continue) instead of returning False so papers mentioning those tokens
near the top are not discarded; keep the existing idx = first_part.find(pattern)
and idx < 100 check for all other patterns (use variables error_patterns,
first_part, pattern, idx to locate the code).
---
Nitpick comments:
In `@wg21_paper_tracker/pipeline.py`:
- Around line 299-305: The local import of parse_date inside the loop should be
moved to the module-level imports to avoid repeated import overhead: add "from
django.utils.dateparse import parse_date" at the top of the file alongside other
imports, then remove the inline import inside the block that processes
best_paper (the lines around doc_date_str = best_paper.get("document_date") and
the try/if that calls parse_date) so the code uses the module-level parse_date
directly.
- Around line 94-100: Replace the final logger.error call that logs the failed
download (the block referencing logger, url, DOWNLOAD_MAX_RETRIES and exception
variable e) with a call that records the full traceback — e.g. use
logger.exception (or logger.error(..., exc_info=True)) in the download
retry/failure handler so the exception trace is captured when the download in
the pipeline download/retry logic fails after DOWNLOAD_MAX_RETRIES.
- Around line 49-51: The except block that catches exceptions during GCS upload
uses logger.error("Failed to upload to GCS: %s", e) which discards the
traceback; update that handler in pipeline.py (the except block around the GCS
upload logic) to call logger.exception("Failed to upload to GCS") or
logger.error("Failed to upload to GCS", exc_info=True) so the full traceback is
preserved for debugging.
In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-164: Replace the deeply nested with-patch blocks in the test
around run_tracker_pipeline() by flattening them using contextlib.ExitStack (or
pytest-mock's mocker fixture) to improve readability; enter_context for each
patch targeting wg21_paper_tracker.pipeline.fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET,
and _upload_to_gcs, then call run_tracker_pipeline() inside the single ExitStack
context so all patches are active without nested indentation.
In `@wg21_paper_tracker/tests/test_workspace.py`:
- Around line 43-86: Add a regression test that calls get_raw_dir(None, 2025) to
verify the year-only fallback path is created: patch
wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, call
get_raw_dir with mailing_date=None and year 2025, and assert the returned Path
equals the expected RAW_DIR / "wg21_paper_tracker" / "2025" and that it
exists/is a directory (and optionally assert path.name == "2025"); reference the
existing test functions (e.g., test_get_raw_dir_returns_mailing_date_subdir,
test_get_raw_dir_creates_parents) to place the new test alongside them in
tests/test_workspace.py.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 896ea723-15e9-4f30-b95e-bd13b4771210
📒 Files selected for processing (42)
.github/workflows/actions.ymlconfig/settings.pyconfig/test_settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
wg21_paper_tracker/cloud_run_job/main.py (1)
39-45:⚠️ Potential issue | 🟠 MajorDon't reject every early
error:/exception:literal unconditionally.Lines 41-42 invalidate content whenever either token appears anywhere in the first 1,000 characters. WG21 papers can legitimately contain diagnostics or code snippets with
error:/exception:near the top, so a valid conversion can be thrown away and force a worse fallback or an empty upload. Gate those markers by position instead of rejecting them unconditionally.Suggested fix
content_lower = content_stripped.lower() first_part = content_lower[:1000] + if first_part.startswith(("error:", "exception:")): + return False + for pattern in error_patterns: - if pattern in first_part: - if pattern in ("error:", "exception:"): - return False - idx = first_part.find(pattern) - if idx < 100: - return False + if pattern in {"error:", "exception:"}: + continue + idx = first_part.find(pattern) + if idx != -1 and idx < 100: + return False return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The loop over error_patterns in main.py currently rejects the input unconditionally when it finds the literal "error:" or "exception:"; change this to a positional check instead: inside the for pattern in error_patterns loop, compute idx = first_part.find(pattern) and only return False if idx != -1 and idx is within the early-position threshold (e.g., idx < 1000) for the "error:" and "exception:" patterns; keep the existing early-position check for other patterns (the existing idx < 100 behavior) but unify the logic so no pattern is rejected solely because it appears anywhere in first_part without checking its index. Ensure you update the branches around error_patterns, first_part, and the idx checks accordingly.
🧹 Nitpick comments (5)
wg21_paper_tracker/cloud_run_job/main.py (1)
9-11: Relative imports alone won't fix import brittleness; the underlying issue requires clarification of execution model.The current
from converters...imports work fine for script execution from thecloud_run_jobdirectory (which is the apparent deployment scenario based on the Dockerfile). However, the suggested fix of using.converters...relative imports won't work without also makingcloud_run_joba proper Python package (i.e., adding an__init__.pyfile).If the goal is to make this code robustly importable in multiple ways, consider either:
- Keeping the current approach and documenting the expected execution directory, or
- Adding
__init__.pytocloud_run_job/and updating imports to use relative paths likefrom .converters...The latter approach would require verifying that treating
cloud_run_jobas a package doesn't conflict with other parts of the codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 9 - 11, The current top-level imports in main.py (from converters.docling_converter, from converters.pdfplumber_converter, from converters.openai_converter) are brittle depending on how the module is run; either keep the absolute-style imports and add documentation/instructions that the process must be started with cloud_run_job as the working package root, or convert to package-relative imports (from .converters.docling_converter, .pdfplumber_converter, .openai_converter) and make cloud_run_job a proper package by adding an __init__.py (then verify no other code breaks by treating cloud_run_job as a package). Ensure whichever approach you choose is applied consistently and update deployment/run docs or CI to match.wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)
95-97: Minor: Redundant exception object inlogging.exception.
logging.exceptionautomatically includes the exception info, so passingeis redundant.♻️ Proposed fix
except Exception as e: - logger.exception("WG21 Paper Tracker failed: %s", e) + logger.exception("WG21 Paper Tracker failed.") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around lines 95 - 97, In the except block in run_wg21_paper_tracker.py where logger.exception("WG21 Paper Tracker failed: %s", e) is called, remove the redundant exception object from the logging call and just call logger.exception("WG21 Paper Tracker failed") so the exception info is logged automatically (leave the surrounding except Exception as e / raise as-is).wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)
162-168: Consider usinglogger.exceptionfor better debugging.When catching exceptions,
logger.exceptionautomatically includes the stack trace, which aids debugging. The currentlogger.errorcall loses this context.♻️ Proposed fix
except Exception as e: stats["skipped"] += 1 - logger.error( + logger.exception( "Error parsing document date for paper_id=%s: %s", paper_id, e, ) continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 162 - 168, The except block that increments stats["skipped"] uses logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e) which omits the stack trace; replace this logger.error call with logger.exception(...) in the same except Exception as e handler so the error message for paper_id and the full traceback are logged (keep the stats["skipped"] increment and the surrounding exception handler intact).
240-242: Uselogger.exceptionfor consistent error logging.Same as above—using
logger.exceptionpreserves the stack trace for the outer exception handler.♻️ Proposed fix
except Exception as e: stats["skipped"] += 1 - logger.error("Error for paper_id=%s: %s", paper_id, e) + logger.exception("Error for paper_id=%s", paper_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py` around lines 240 - 242, Replace the logger.error call in the exception handler with logger.exception so the stack trace is preserved; specifically, in import_wg21_metadata_from_csv.py where the except Exception as e block increments stats["skipped"] and logs the error for paper_id, change the logging call that currently uses logger.error("Error for paper_id=%s: %s", paper_id, e) to use logger.exception with a descriptive message that includes paper_id (e.g., logger.exception("Error for paper_id=%s", paper_id)) so the full traceback is captured.wg21_paper_tracker/pipeline.py (1)
49-51: Consider usinglogging.exceptionfor better tracebacks.When logging errors from caught exceptions,
logging.exceptionautomatically includes the traceback, which aids debugging.♻️ Proposed fix
except Exception as e: - logger.error("Failed to upload to GCS: %s", e) + logger.exception("Failed to upload to GCS: %s", e) return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wg21_paper_tracker/pipeline.py` around lines 49 - 51, Replace the current exception logging call that uses logger.error in the except block with logger.exception so the traceback is captured; locate the except Exception as e block in pipeline.py (the GCS upload handler) and change logger.error("Failed to upload to GCS: %s", e) to logger.exception("Failed to upload to GCS") (or logger.error(..., exc_info=True)) and keep the existing return False behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 39-45: The loop over error_patterns in main.py currently rejects
the input unconditionally when it finds the literal "error:" or "exception:";
change this to a positional check instead: inside the for pattern in
error_patterns loop, compute idx = first_part.find(pattern) and only return
False if idx != -1 and idx is within the early-position threshold (e.g., idx <
1000) for the "error:" and "exception:" patterns; keep the existing
early-position check for other patterns (the existing idx < 100 behavior) but
unify the logic so no pattern is rejected solely because it appears anywhere in
first_part without checking its index. Ensure you update the branches around
error_patterns, first_part, and the idx checks accordingly.
---
Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 9-11: The current top-level imports in main.py (from
converters.docling_converter, from converters.pdfplumber_converter, from
converters.openai_converter) are brittle depending on how the module is run;
either keep the absolute-style imports and add documentation/instructions that
the process must be started with cloud_run_job as the working package root, or
convert to package-relative imports (from .converters.docling_converter,
.pdfplumber_converter, .openai_converter) and make cloud_run_job a proper
package by adding an __init__.py (then verify no other code breaks by treating
cloud_run_job as a package). Ensure whichever approach you choose is applied
consistently and update deployment/run docs or CI to match.
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 162-168: The except block that increments stats["skipped"] uses
logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e)
which omits the stack trace; replace this logger.error call with
logger.exception(...) in the same except Exception as e handler so the error
message for paper_id and the full traceback are logged (keep the
stats["skipped"] increment and the surrounding exception handler intact).
- Around line 240-242: Replace the logger.error call in the exception handler
with logger.exception so the stack trace is preserved; specifically, in
import_wg21_metadata_from_csv.py where the except Exception as e block
increments stats["skipped"] and logs the error for paper_id, change the logging
call that currently uses logger.error("Error for paper_id=%s: %s", paper_id, e)
to use logger.exception with a descriptive message that includes paper_id (e.g.,
logger.exception("Error for paper_id=%s", paper_id)) so the full traceback is
captured.
In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-97: In the except block in run_wg21_paper_tracker.py where
logger.exception("WG21 Paper Tracker failed: %s", e) is called, remove the
redundant exception object from the logging call and just call
logger.exception("WG21 Paper Tracker failed") so the exception info is logged
automatically (leave the surrounding except Exception as e / raise as-is).
In `@wg21_paper_tracker/pipeline.py`:
- Around line 49-51: Replace the current exception logging call that uses
logger.error in the except block with logger.exception so the traceback is
captured; locate the except Exception as e block in pipeline.py (the GCS upload
handler) and change logger.error("Failed to upload to GCS: %s", e) to
logger.exception("Failed to upload to GCS") (or logger.error(...,
exc_info=True)) and keep the existing return False behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0c92301a-595c-4725-92d2-d15ad8dab74f
📒 Files selected for processing (42)
.github/workflows/actions.ymlconfig/settings.pyconfig/test_settings.pycppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.pycppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.pycppa_user_tracker/models.pycppa_user_tracker/services.pycppa_user_tracker/tests/test_services.pydev-24error:docs/Schema.mddocs/operations/WG21_Cloud_Run.mddocs/service_api/cppa_user_tracker.mdrequirements.txtwg21_paper_tracker/__init__.pywg21_paper_tracker/admin.pywg21_paper_tracker/apps.pywg21_paper_tracker/cloud_run_job/Dockerfilewg21_paper_tracker/cloud_run_job/converters/__init__.pywg21_paper_tracker/cloud_run_job/converters/docling_converter.pywg21_paper_tracker/cloud_run_job/converters/openai_converter.pywg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.pywg21_paper_tracker/cloud_run_job/main.pywg21_paper_tracker/cloud_run_job/requirements.txtwg21_paper_tracker/fetcher.pywg21_paper_tracker/management/__init__.pywg21_paper_tracker/management/commands/__init__.pywg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.pywg21_paper_tracker/management/commands/run_wg21_paper_tracker.pywg21_paper_tracker/migrations/0001_initial.pywg21_paper_tracker/migrations/__init__.pywg21_paper_tracker/models.pywg21_paper_tracker/pipeline.pywg21_paper_tracker/services.pywg21_paper_tracker/tests/__init__.pywg21_paper_tracker/tests/test_commands.pywg21_paper_tracker/tests/test_fetcher.pywg21_paper_tracker/tests/test_models.pywg21_paper_tracker/tests/test_pipeline.pywg21_paper_tracker/tests/test_services.pywg21_paper_tracker/tests/test_workspace.pywg21_paper_tracker/workspace.pyworkflow/management/commands/run_all_collectors.py
Summary
Introduces the WG21 paper tracker: fetches mailings from open-std.org, downloads papers, uploads to GCS, and triggers the Cloud Run conversion job. Includes tests and several robustness improvements.
Changes
New app:
wg21_paper_trackeridorname); case-insensitivepaper_id.WG21Mailing,WG21Paper(withyearas int,unique_togetheron(paper_id, year)),WG21PaperAuthor; single migration0001_initial.get_or_create_mailing,get_or_create_paper(bypaper_id+year),mark_paper_downloaded; paper_id/year normalized (lowercase, int year).run_wg21_paper_tracker: runs pipeline, triggers Cloud Run if new papers;--dry-run; docstrings added.import_wg21_metadata_from_csv: imports metadata from CSV; handles duplicatepaper_id(IntegrityError) by updating existing row; derivesyearfrommailing_date.WG21Paperlist_display and list_filter includeyear.Summary by CodeRabbit
New Features
New Fields
Documentation
Tests
Chores