Update discord_activity tracker app#142
Update discord_activity tracker app#142leostar0412 wants to merge 3 commits intocppalliance:developfrom
Conversation
📝 WalkthroughWalkthroughThis PR introduces a comprehensive Discord data export and sync pipeline. It adds per-day channel exports via DiscordChatExporter, GitHub markdown publishing, Pinecone integration for message indexing, JSON backfill importing, new management commands, and supporting utilities for workspace path resolution and CLI bootstrapping. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI Command
participant Exporter as DiscordChatExporter
participant DB as Database
participant GitHub as GitHub API
participant Pinecone as Pinecone API
CLI->>Exporter: run export_guild_by_day.py<br/>(per-day or chunked)
Exporter-->>CLI: JSON files (channels/messages)
CLI->>DB: persist_exporter_channel_payloads<br/>(batch import)
DB-->>CLI: records created/updated
alt markdown export enabled
CLI->>CLI: generate markdown<br/>from DB messages
CLI->>GitHub: push_discord_markdown_to_github<br/>(commit to branch)
GitHub-->>CLI: success/failure
end
alt pinecone sync enabled
CLI->>Pinecone: run_cppa_pinecone_sync<br/>(preprocess & index)
Pinecone-->>CLI: documents indexed
end
CLI-->>CLI: summary report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Only repository collaborators, contributors, or members can run CodeRabbit commands. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (11)
discord_activity_tracker/sync/backfill_paths.py (1)
44-47:_as_datereturnsNonefordateinputs.If
dis adateobject (notdatetime),isinstance(d, datetime)isFalseand the function returnsNone. Sincedatetimeis a subclass ofdate, not vice versa, this could cause issues if callers ever passdateobjects directly.Currently callers pass results from
parse_iso_datetimewhich returnsdatetime, so this works in practice.♻️ Handle both date and datetime inputs
def _as_date(d: Optional[datetime]) -> Optional[date]: if d is None: return None - return d.date() if isinstance(d, datetime) else None + if isinstance(d, datetime): + return d.date() + if isinstance(d, date): + return d + return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/sync/backfill_paths.py` around lines 44 - 47, _as_date currently returns None for inputs that are date instances (only checks isinstance(d, datetime)); update _as_date to accept both datetime and date by returning d.date() when d is a datetime and returning d when d is a date, while still returning None for None inputs so callers like parse_iso_datetime continue to work; adjust the function signature/logic in _as_date to use isinstance checks against datetime and date accordingly.discord_activity_tracker/sync/importer.py (2)
70-76: Minor: Duplicate message conversion.
convert_exporter_message_to_dict(messages[-1])on line 71 re-converts the last message that was already converted inconvertedon line 66. Consider reusing:♻️ Reuse converted message
converted = [convert_exporter_message_to_dict(msg) for msg in messages] processed = await _process_messages_in_batches(channel, converted) - if messages: - last_msg = convert_exporter_message_to_dict(messages[-1]) + if converted: + last_msg = converted[-1] last_time = parse_datetime(last_msg.get("created_at"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/sync/importer.py` around lines 70 - 76, The code re-converts the last message redundantly; instead of calling convert_exporter_message_to_dict(messages[-1]) again, reuse the already converted item in converted (e.g. use converted[-1] as last_msg), parse its "created_at" with parse_datetime, and pass that last_time into sync_to_async(update_channel_last_activity)(channel, last_time); update the block that sets last_msg/last_time to reference converted[-1] to avoid duplicate work.
87-90: Consider narrowing exception handling or logging full traceback.The broad
except Exceptioncatch (flagged by Ruff BLE001) continues processing after per-channel failures, which is reasonable for resilience. However, the current logging only capturesstr(e), losing the traceback that would help diagnose issues.♻️ Log with traceback for debugging
except Exception as e: ch_name = channel_data.get("channel", {}).get("name") - logger.error("Failed to persist channel %s: %s", ch_name, e) + logger.exception("Failed to persist channel %s: %s", ch_name, e) continueUsing
logger.exception()automatically includes the traceback at ERROR level.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/sync/importer.py` around lines 87 - 90, The catch-all except block that handles per-channel failures should log the full traceback instead of just str(e); in the except handler for the channel persistence (the block using channel_data and ch_name), replace the current logger.error(...) call with logger.exception(...) (or logger.error(..., exc_info=True)) so the stack trace is recorded while still continuing the loop; keep the broad except if you want resilience but ensure the log message includes ch_name and the traceback for debugging.discord_activity_tracker/management/commands/run_discord_exporter.py (2)
398-404: Consider batching asyncio.run calls for efficiency.Currently
asyncio.run()is called for each JSON file, creating a new event loop per iteration. While this allows immediate cleanup (json_path.unlink()), it adds overhead.The current approach is acceptable for reliability (partial progress on failure) but if performance becomes a concern, batching could help.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/management/commands/run_discord_exporter.py` around lines 398 - 404, The loop is calling asyncio.run(...) for every JSON file which creates a new event loop each iteration; instead accumulate channel payloads (e.g., collect channel_data into a list) and call asyncio.run(persist_exporter_channel_payloads(all_channel_data, expected_guild_id=guild_id)) once to process them in a single event loop, then unlink the corresponding json_path files after the batched call succeeds; alternatively create one event loop outside the per-file loop and schedule persist_exporter_channel_payloads tasks for each batch to avoid repeated asyncio.run invocations while still ensuring json_path.unlink() runs only after successful persistence.
447-448: Move import outside of loop for efficiency.The
parse_datetimeimport is inside the loop over JSON files, causing repeated import overhead:♻️ Move import to top of method
def _import_json_files( self, dry_run: bool, guild_id: int, since_override: Optional[datetime], until_cutoff: Optional[datetime], ): """Import pre-exported JSON files from raw/ into the database.""" self.stdout.write("\n=== Importing JSON Files ===") + + from discord_activity_tracker.sync.utils import parse_datetime temp_dir = get_raw_dir() # ... later in the loop: - from discord_activity_tracker.sync.utils import parse_datetime🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/management/commands/run_discord_exporter.py` around lines 447 - 448, The import of parse_datetime inside the JSON-files loop causes repeated overhead; move the import statement for parse_datetime from inside the loop to the top of the surrounding function (e.g., the command's handle or run loop in run_discord_exporter.py) so it's executed once, then update the loop to call parse_datetime directly; reference the symbol parse_datetime and the loop in run_discord_exporter.py to locate and adjust the code.discord_activity_tracker/offline_scripts/export_guild_by_day.py (2)
622-623: Variable shadowing may cause confusion.
start_dtandend_dtare reused here but represent different values than the outer scope (lines 570-571). Consider using distinct names likech_start_dtandch_end_dtto avoid confusion:♻️ Suggested rename
- start_dt = datetime.strptime(last_date, "%Y-%m-%d") - end_dt = datetime.strptime(today_str, "%Y-%m-%d") - if start_dt > end_dt: + ch_start_dt = datetime.strptime(last_date, "%Y-%m-%d") + ch_end_dt = datetime.strptime(today_str, "%Y-%m-%d") + if ch_start_dt > ch_end_dt: continue dates_from_last = [] - d = start_dt - while d <= end_dt: + d = ch_start_dt + while d <= ch_end_dt:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/offline_scripts/export_guild_by_day.py` around lines 622 - 623, The variables start_dt and end_dt are being shadowed here (assigned from last_date and today_str) which conflicts with outer-scope start_dt/end_dt; rename them to distinct identifiers (e.g., ch_start_dt and ch_end_dt) and update every subsequent use in this local block to the new names so you don't accidentally reference or overwrite the outer variables — change the assignments from datetime.strptime(last_date, "%Y-%m-%d") and datetime.strptime(today_str, "%Y-%m-%d") and all later local references accordingly.
690-695: Add exception handling for thread pool futures.If
export_channel_daysraises an unexpected exception,future.result()will propagate it. Consider wrapping in try-except:♻️ Defensive error handling
with ThreadPoolExecutor(max_workers=PARALLEL) as executor: futures = {executor.submit(export_channel_days, a): a for a in work} for future in as_completed(futures): - cid, cname, failed = future.result() - if failed: - failed_channels.append(cid) + try: + cid, cname, failed = future.result() + if failed: + failed_channels.append(cid) + except Exception as e: + args = futures[future] + cid = args[0] + print(f"ERROR: Channel {cid} raised exception: {e}", file=sys.stderr) + failed_channels.append(cid)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/offline_scripts/export_guild_by_day.py` around lines 690 - 695, The futures loop currently calls future.result() without catching exceptions, so if export_channel_days raises the exception it will propagate; wrap the result retrieval in a try-except around the call in the ThreadPoolExecutor loop that iterates over as_completed(futures) (referencing export_channel_days, futures, failed_channels), catch Exception as e, log or record the exception with context (include the channel id/name from futures[future] or the returned cid/cname if available), append the channel id to failed_channels for failed tasks, and continue to the next future so one failing thread doesn't kill the whole export.discord_activity_tracker/sync/dce_cli.py (2)
38-47: Silent exception swallowing hides configuration issues.The bare
except Exception: passsilently ignores any errors when reading Django settings, including misconfigurations. Consider logging at debug level:♻️ Suggested improvement
try: from django.conf import settings if getattr(settings, "configured", False): v = (getattr(settings, "DISCORD_CHAT_EXPORTER_VERSION", "") or "").strip() if v: return v - except Exception: - pass + except Exception as e: + logger.debug("Could not read DISCORD_CHAT_EXPORTER_VERSION from Django settings: %s", e) return DEFAULT_PINNED_VERSION🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/sync/dce_cli.py` around lines 38 - 47, The try/except in the settings-read block silently swallows all exceptions (hiding config errors); update the handler in the function that returns DEFAULT_PINNED_VERSION to catch Exception but log the caught exception at debug level (using the module logger) including context (e.g., mention reading DISCORD_CHAT_EXPORTER_VERSION from django.conf.settings) before returning DEFAULT_PINNED_VERSION; reference the getattr(settings, "configured"), DISCORD_CHAT_EXPORTER_VERSION and DEFAULT_PINNED_VERSION symbols to locate where to add the logger.debug call.
102-112: URL scheme validation would address static analysis concern.While the URLs are constructed from constants (GitHub API), adding explicit HTTPS validation would silence S310 and add defense in depth:
♻️ Optional: Add URL scheme validation
def _http_get_json(url: str) -> dict: + if not url.startswith("https://"): + raise ValueError(f"Only HTTPS URLs are allowed: {url}") req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT}) with urllib.request.urlopen(req, timeout=120) as resp: return json.loads(resp.read().decode("utf-8")) def _download_file(url: str, dest: Path) -> None: + if not url.startswith("https://"): + raise ValueError(f"Only HTTPS URLs are allowed: {url}") dest.parent.mkdir(parents=True, exist_ok=True) req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/sync/dce_cli.py` around lines 102 - 112, The functions _http_get_json and _download_file currently open URLs without validating the scheme; add a check at the start of each (or a small shared helper) that parses the url (via urllib.parse.urlparse) and verifies parsed.scheme == "https" (raise a ValueError or custom exception if not), so both _http_get_json(url: str) and _download_file(url: str, dest: Path) only proceed when the URL is HTTPS; keep using USER_AGENT and existing timeouts after validation to preserve behavior.discord_activity_tracker/management/commands/run_discord_activity_tracker.py (1)
178-178: Add type annotation forsince_dateparameter.For consistency with other typed parameters and better IDE support:
♻️ Suggested fix
- since_date, + since_date: Optional[datetime] = None,You'll need to import
Optionalanddatetimeat the top if not already present for this signature.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/management/commands/run_discord_activity_tracker.py` at line 178, The parameter since_date in the command handler needs a type annotation (e.g., Optional[datetime.datetime]) for consistency; update the function signature that declares since_date (likely Command.handle or handle method in run_discord_activity_tracker.py) to use since_date: Optional[datetime.datetime], and add the imports "from typing import Optional" and "import datetime" at the top of the file if they aren't already present.discord_activity_tracker/management/commands/run_discord_channel_export.py (1)
99-106: Consider capturing subprocess output for better error diagnostics.When the script fails, only the exit code is reported. Capturing stderr would provide more context:
♻️ Optional improvement for debugging
cwd = script_path.parent cmd = [sys.executable, str(script_path)] logger.info("Running %s with cwd=%s", cmd, cwd) - proc = subprocess.run(cmd, cwd=cwd, env=env, check=False) + proc = subprocess.run(cmd, cwd=cwd, env=env, check=False, capture_output=True, text=True) if proc.returncode != 0: + if proc.stderr: + logger.error("export_guild_by_day.py stderr: %s", proc.stderr[:2000]) raise CommandError( f"export_guild_by_day.py exited with code {proc.returncode}" ) + if proc.stdout: + self.stdout.write(proc.stdout) self.stdout.write(self.style.SUCCESS("export_guild_by_day.py finished"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/management/commands/run_discord_channel_export.py` around lines 99 - 106, The subprocess invocation in run_discord_channel_export.py currently only checks proc.returncode and raises CommandError with the exit code; modify the subprocess.run call (the proc creation) to capture output (e.g., capture_output=True or stdout=PIPE/stderr=PIPE) and on non-zero return code include proc.stderr (and optionally proc.stdout) in the CommandError or an additional logger.error message so the error contains the script's stderr for diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@discord_activity_tracker/github_publish.py`:
- Around line 21-24: The current assignment to branch calls .strip() after the
truthy-or-default logic which lets whitespace-only settings become an empty
string; instead, first retrieve the raw value from getattr(settings,
"DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH), call .strip() on that result,
then if the stripped value is falsy set branch = DEFAULT_BRANCH; update the code
that assigns branch (the variable named branch using
settings.DISCORD_MARKDOWN_REPO_BRANCH and DEFAULT_BRANCH) so the fallback
happens after trimming whitespace.
- Around line 56-64: The code calls get_github_token(use="write") which can
raise and currently lets exceptions escape, breaking the function's boolean
success/failure contract; wrap the call to get_github_token in a try/except,
handle exceptions (log or process via existing logger) and return False on
failure instead of propagating the exception, then only call
upload_folder_to_github when a token was successfully retrieved—reference
get_github_token and upload_folder_to_github to locate where to add the
try/except and the early False return.
In
`@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`:
- Around line 74-85: The --until argument is parsed into until_dt but never used
when calling sync_all_channels (only since_date is passed), causing confusion;
either remove the parser.add_argument(...) entry for
"--until"/"--to-date"/"--end-time" and its validation/usage so the CLI does not
advertise an unsupported option, or update the sync_all_channels function
signature to accept an until_date parameter and thread until_dt through the call
at the site that invokes sync_all_channels (update the function definition of
sync_all_channels and all callers accordingly), and adjust the help text for
parser.add_argument or sync_all_channels docstrings to reflect actual support;
refer to the parser.add_argument block that defines "--until", the until_dt
variable where it is parsed/validated, and the sync_all_channels invocation to
implement the chosen fix.
In `@discord_activity_tracker/preprocessors/discord_preprocessor.py`:
- Line 70: The current assignment to min_len (int(getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", 50) or 50)) treats an explicit 0 as missing; change
it to preserve a user-set zero by fetching the raw setting and only applying the
default when the attribute is not set or is None (e.g. val = getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", None); min_len = int(val) if val is not None else
50), so that settings.PINECONE_MIN_TEXT_LENGTH == 0 remains valid; update the
code around the min_len assignment accordingly and add a brief comment
clarifying the intent.
In `@discord_activity_tracker/sync/dce_cli.py`:
- Around line 115-132: The function _extract_cli_zip_bundle currently calls
zipfile.ZipFile.extractall which is vulnerable to zip-slip; instead iterate over
zf.infolist() and for each member validate its name is not absolute and does not
contain '..', compute the destination path by joining bundle with the member
name and ensure dest.resolve() is within bundle.resolve() (e.g., startswith
bundle.resolve() + os.sep), then create parent dirs and write the file from
zf.open(member) to the destination; keep the existing checks for target_name
("DiscordChatExporter.Cli.exe") and raise DiscordChatExporterCliNotFoundError if
exe is missing.
In `@discord_activity_tracker/tests/test_bulk_services.py`:
- Around line 295-312: The test_idempotent function stops after a single call to
bulk_process_message_batch and never verifies idempotency; update it to call
bulk_process_message_batch(messages, channel) a second time and then assert that
no duplicate records were created (e.g., check the relevant model/query counts
for the message_id 5001 and any related reaction/attachment records remain
unchanged), referencing bulk_process_message_batch and the message_id used in
the messages fixture to locate the correct assertions.
---
Nitpick comments:
In
`@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`:
- Line 178: The parameter since_date in the command handler needs a type
annotation (e.g., Optional[datetime.datetime]) for consistency; update the
function signature that declares since_date (likely Command.handle or handle
method in run_discord_activity_tracker.py) to use since_date:
Optional[datetime.datetime], and add the imports "from typing import Optional"
and "import datetime" at the top of the file if they aren't already present.
In `@discord_activity_tracker/management/commands/run_discord_channel_export.py`:
- Around line 99-106: The subprocess invocation in run_discord_channel_export.py
currently only checks proc.returncode and raises CommandError with the exit
code; modify the subprocess.run call (the proc creation) to capture output
(e.g., capture_output=True or stdout=PIPE/stderr=PIPE) and on non-zero return
code include proc.stderr (and optionally proc.stdout) in the CommandError or an
additional logger.error message so the error contains the script's stderr for
diagnostics.
In `@discord_activity_tracker/management/commands/run_discord_exporter.py`:
- Around line 398-404: The loop is calling asyncio.run(...) for every JSON file
which creates a new event loop each iteration; instead accumulate channel
payloads (e.g., collect channel_data into a list) and call
asyncio.run(persist_exporter_channel_payloads(all_channel_data,
expected_guild_id=guild_id)) once to process them in a single event loop, then
unlink the corresponding json_path files after the batched call succeeds;
alternatively create one event loop outside the per-file loop and schedule
persist_exporter_channel_payloads tasks for each batch to avoid repeated
asyncio.run invocations while still ensuring json_path.unlink() runs only after
successful persistence.
- Around line 447-448: The import of parse_datetime inside the JSON-files loop
causes repeated overhead; move the import statement for parse_datetime from
inside the loop to the top of the surrounding function (e.g., the command's
handle or run loop in run_discord_exporter.py) so it's executed once, then
update the loop to call parse_datetime directly; reference the symbol
parse_datetime and the loop in run_discord_exporter.py to locate and adjust the
code.
In `@discord_activity_tracker/offline_scripts/export_guild_by_day.py`:
- Around line 622-623: The variables start_dt and end_dt are being shadowed here
(assigned from last_date and today_str) which conflicts with outer-scope
start_dt/end_dt; rename them to distinct identifiers (e.g., ch_start_dt and
ch_end_dt) and update every subsequent use in this local block to the new names
so you don't accidentally reference or overwrite the outer variables — change
the assignments from datetime.strptime(last_date, "%Y-%m-%d") and
datetime.strptime(today_str, "%Y-%m-%d") and all later local references
accordingly.
- Around line 690-695: The futures loop currently calls future.result() without
catching exceptions, so if export_channel_days raises the exception it will
propagate; wrap the result retrieval in a try-except around the call in the
ThreadPoolExecutor loop that iterates over as_completed(futures) (referencing
export_channel_days, futures, failed_channels), catch Exception as e, log or
record the exception with context (include the channel id/name from
futures[future] or the returned cid/cname if available), append the channel id
to failed_channels for failed tasks, and continue to the next future so one
failing thread doesn't kill the whole export.
In `@discord_activity_tracker/sync/backfill_paths.py`:
- Around line 44-47: _as_date currently returns None for inputs that are date
instances (only checks isinstance(d, datetime)); update _as_date to accept both
datetime and date by returning d.date() when d is a datetime and returning d
when d is a date, while still returning None for None inputs so callers like
parse_iso_datetime continue to work; adjust the function signature/logic in
_as_date to use isinstance checks against datetime and date accordingly.
In `@discord_activity_tracker/sync/dce_cli.py`:
- Around line 38-47: The try/except in the settings-read block silently swallows
all exceptions (hiding config errors); update the handler in the function that
returns DEFAULT_PINNED_VERSION to catch Exception but log the caught exception
at debug level (using the module logger) including context (e.g., mention
reading DISCORD_CHAT_EXPORTER_VERSION from django.conf.settings) before
returning DEFAULT_PINNED_VERSION; reference the getattr(settings, "configured"),
DISCORD_CHAT_EXPORTER_VERSION and DEFAULT_PINNED_VERSION symbols to locate where
to add the logger.debug call.
- Around line 102-112: The functions _http_get_json and _download_file currently
open URLs without validating the scheme; add a check at the start of each (or a
small shared helper) that parses the url (via urllib.parse.urlparse) and
verifies parsed.scheme == "https" (raise a ValueError or custom exception if
not), so both _http_get_json(url: str) and _download_file(url: str, dest: Path)
only proceed when the URL is HTTPS; keep using USER_AGENT and existing timeouts
after validation to preserve behavior.
In `@discord_activity_tracker/sync/importer.py`:
- Around line 70-76: The code re-converts the last message redundantly; instead
of calling convert_exporter_message_to_dict(messages[-1]) again, reuse the
already converted item in converted (e.g. use converted[-1] as last_msg), parse
its "created_at" with parse_datetime, and pass that last_time into
sync_to_async(update_channel_last_activity)(channel, last_time); update the
block that sets last_msg/last_time to reference converted[-1] to avoid duplicate
work.
- Around line 87-90: The catch-all except block that handles per-channel
failures should log the full traceback instead of just str(e); in the except
handler for the channel persistence (the block using channel_data and ch_name),
replace the current logger.error(...) call with logger.exception(...) (or
logger.error(..., exc_info=True)) so the stack trace is recorded while still
continuing the loop; keep the broad except if you want resilience but ensure the
log message includes ch_name and the traceback for debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0406c40b-fda4-49e8-9af2-fbc1637cd56c
📒 Files selected for processing (23)
.env.example.gitignoreconfig/settings.pyconfig/test_settings.pydiscord_activity_tracker/.gitignorediscord_activity_tracker/github_publish.pydiscord_activity_tracker/management/commands/backfill_discord_json.pydiscord_activity_tracker/management/commands/debug_discord_export.pydiscord_activity_tracker/management/commands/run_discord_activity_tracker.pydiscord_activity_tracker/management/commands/run_discord_channel_export.pydiscord_activity_tracker/management/commands/run_discord_exporter.pydiscord_activity_tracker/offline_scripts/export_guild_by_day.pydiscord_activity_tracker/preprocessors/__init__.pydiscord_activity_tracker/preprocessors/discord_preprocessor.pydiscord_activity_tracker/sync/backfill_paths.pydiscord_activity_tracker/sync/chat_exporter.pydiscord_activity_tracker/sync/dce_cli.pydiscord_activity_tracker/sync/importer.pydiscord_activity_tracker/tests/fixtures/backfill_minimal.jsondiscord_activity_tracker/tests/test_backfill_discord_json.pydiscord_activity_tracker/tests/test_backfill_paths.pydiscord_activity_tracker/tests/test_bulk_services.pydiscord_activity_tracker/workspace.py
| branch = ( | ||
| getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH) | ||
| or DEFAULT_BRANCH | ||
| ).strip() |
There was a problem hiding this comment.
Normalize branch fallback after trimming whitespace.
A whitespace-only branch setting currently becomes "" after .strip(), which can cause upload calls to target an invalid branch ref.
Suggested fix
- branch = (
- getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH)
- or DEFAULT_BRANCH
- ).strip()
+ branch_raw = (
+ getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH)
+ or DEFAULT_BRANCH
+ )
+ branch = branch_raw.strip() or DEFAULT_BRANCH🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@discord_activity_tracker/github_publish.py` around lines 21 - 24, The current
assignment to branch calls .strip() after the truthy-or-default logic which lets
whitespace-only settings become an empty string; instead, first retrieve the raw
value from getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH),
call .strip() on that result, then if the stripped value is falsy set branch =
DEFAULT_BRANCH; update the code that assigns branch (the variable named branch
using settings.DISCORD_MARKDOWN_REPO_BRANCH and DEFAULT_BRANCH) so the fallback
happens after trimming whitespace.
| token = get_github_token(use="write") | ||
| result = upload_folder_to_github( | ||
| local_folder=local_folder, | ||
| owner=owner, | ||
| repo=repo, | ||
| commit_message="chore: update Discord archive markdown", | ||
| branch=branch, | ||
| token=token, | ||
| ) |
There was a problem hiding this comment.
Preserve the function’s boolean failure contract on token errors.
get_github_token(use="write") can raise (e.g., missing token). Right now that exception escapes, despite this function being structured as a bool success/failure API.
Suggested fix
- token = get_github_token(use="write")
- result = upload_folder_to_github(
- local_folder=local_folder,
- owner=owner,
- repo=repo,
- commit_message="chore: update Discord archive markdown",
- branch=branch,
- token=token,
- )
+ try:
+ token = get_github_token(use="write")
+ result = upload_folder_to_github(
+ local_folder=local_folder,
+ owner=owner,
+ repo=repo,
+ commit_message="chore: update Discord archive markdown",
+ branch=branch,
+ token=token,
+ )
+ except Exception as e:
+ logger.error("Discord markdown upload setup failed: %s", e)
+ return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@discord_activity_tracker/github_publish.py` around lines 56 - 64, The code
calls get_github_token(use="write") which can raise and currently lets
exceptions escape, breaking the function's boolean success/failure contract;
wrap the call to get_github_token in a try/except, handle exceptions (log or
process via existing logger) and return False on failure instead of propagating
the exception, then only call upload_folder_to_github when a token was
successfully retrieved—reference get_github_token and upload_folder_to_github to
locate where to add the try/except and the early False return.
| parser.add_argument( | ||
| "--until", | ||
| "--to-date", | ||
| "--end-time", | ||
| type=str, | ||
| default=None, | ||
| dest="until", | ||
| help="Upper bound for message sync (same formats as --since). " | ||
| "Note: the Discord client currently fetches forward from the sync cursor; " | ||
| "this value is passed for API consistency where supported. " | ||
| "--to-date and --end-time are deprecated aliases for --until.", | ||
| ) |
There was a problem hiding this comment.
until_dt is parsed but never passed to sync_all_channels.
The --until argument is parsed and validated (lines 97-105), but only since_date is passed to sync_all_channels at line 145. Based on context snippet 2, sync_all_channels only accepts since_date parameter, so there's no until_date parameter to pass.
Either:
- Remove
--untilfrom this command if it's not supported by the underlying sync function, or - Update
sync_all_channelsto accept anuntil_dateparameter if the feature is intended.
The help text at lines 82-83 acknowledges this limitation ("the Discord client currently fetches forward from the sync cursor"), but the argument still creates user confusion.
Also applies to: 145-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`
around lines 74 - 85, The --until argument is parsed into until_dt but never
used when calling sync_all_channels (only since_date is passed), causing
confusion; either remove the parser.add_argument(...) entry for
"--until"/"--to-date"/"--end-time" and its validation/usage so the CLI does not
advertise an unsupported option, or update the sync_all_channels function
signature to accept an until_date parameter and thread until_dt through the call
at the site that invokes sync_all_channels (update the function definition of
sync_all_channels and all callers accordingly), and adjust the help text for
parser.add_argument or sync_all_channels docstrings to reflect actual support;
refer to the parser.add_argument block that defines "--until", the until_dt
variable where it is parsed/validated, and the sync_all_channels invocation to
implement the chosen fix.
| ``failed_ids``. | ||
| """ | ||
| normalized_failed = _normalize_failed_ids(failed_ids) | ||
| min_len = int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50) |
There was a problem hiding this comment.
Fallback logic may override explicit zero setting.
The expression int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50) will use 50 if the setting is explicitly 0 (since 0 or 50 evaluates to 50). If 0 is a valid value meaning "no minimum length", this could be unexpected.
If 0 should disable the minimum check:
- min_len = int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50)
+ raw = getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50)
+ min_len = int(raw) if raw is not None else 50If 0 should indeed fall back to 50, the current code is correct but consider adding a comment.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| min_len = int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50) | |
| raw = getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) | |
| min_len = int(raw) if raw is not None else 50 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@discord_activity_tracker/preprocessors/discord_preprocessor.py` at line 70,
The current assignment to min_len (int(getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", 50) or 50)) treats an explicit 0 as missing; change
it to preserve a user-set zero by fetching the raw setting and only applying the
default when the attribute is not set or is None (e.g. val = getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", None); min_len = int(val) if val is not None else
50), so that settings.PINECONE_MIN_TEXT_LENGTH == 0 remains valid; update the
code around the min_len assignment accordingly and add a brief comment
clarifying the intent.
| def _extract_cli_zip_bundle(zip_path: Path, tools_dir: Path) -> Path: | ||
| """ | ||
| Extract a RID-specific CLI zip (flat layout) so DiscordChatExporter.Cli.exe | ||
| sits next to its dependencies under tools/DiscordChatExporter-cli/. | ||
| """ | ||
| bundle = tools_dir / DCE_CLI_BUNDLE_SUBDIR | ||
| if bundle.exists(): | ||
| shutil.rmtree(bundle) | ||
| bundle.mkdir(parents=True, exist_ok=True) | ||
| target_name = "DiscordChatExporter.Cli.exe" | ||
| with zipfile.ZipFile(zip_path, "r") as zf: | ||
| zf.extractall(bundle) | ||
| exe = bundle / target_name | ||
| if not exe.is_file(): | ||
| raise DiscordChatExporterCliNotFoundError( | ||
| f"No {target_name} found inside {zip_path} after extract" | ||
| ) | ||
| return exe |
There was a problem hiding this comment.
Potential zip-slip vulnerability in _extract_cli_zip_bundle.
Using zipfile.extractall() without validating member paths could allow a malicious archive to write files outside the target directory via path traversal (e.g., ../../malicious.exe).
🔒 Proposed fix with path validation
def _extract_cli_zip_bundle(zip_path: Path, tools_dir: Path) -> Path:
"""
Extract a RID-specific CLI zip (flat layout) so DiscordChatExporter.Cli.exe
sits next to its dependencies under tools/DiscordChatExporter-cli/.
"""
bundle = tools_dir / DCE_CLI_BUNDLE_SUBDIR
if bundle.exists():
shutil.rmtree(bundle)
bundle.mkdir(parents=True, exist_ok=True)
target_name = "DiscordChatExporter.Cli.exe"
with zipfile.ZipFile(zip_path, "r") as zf:
+ for member in zf.namelist():
+ member_path = (bundle / member).resolve()
+ if not str(member_path).startswith(str(bundle.resolve())):
+ raise DiscordChatExporterCliNotFoundError(
+ f"Zip contains unsafe path: {member}"
+ )
zf.extractall(bundle)
exe = bundle / target_name🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@discord_activity_tracker/sync/dce_cli.py` around lines 115 - 132, The
function _extract_cli_zip_bundle currently calls zipfile.ZipFile.extractall
which is vulnerable to zip-slip; instead iterate over zf.infolist() and for each
member validate its name is not absolute and does not contain '..', compute the
destination path by joining bundle with the member name and ensure
dest.resolve() is within bundle.resolve() (e.g., startswith bundle.resolve() +
os.sep), then create parent dirs and write the file from zf.open(member) to the
destination; keep the existing checks for target_name
("DiscordChatExporter.Cli.exe") and raise DiscordChatExporterCliNotFoundError if
exe is missing.
| @pytest.mark.django_db | ||
| def test_idempotent(channel): | ||
| """Running same batch twice should not create duplicates.""" | ||
| now = datetime(2026, 2, 17, 12, 0, 0, tzinfo=timezone.utc) | ||
| messages = [ | ||
| { | ||
| "message_id": 5001, | ||
| "author": _user(1001, "alice"), | ||
| "content": "Test", | ||
| "message_created_at": now, | ||
| "message_edited_at": None, | ||
| "reply_to_message_id": None, | ||
| "attachment_urls": [], | ||
| "reactions": [{"emoji": "\U0001f44d", "count": 1}], | ||
| }, | ||
| ] | ||
|
|
||
| bulk_process_message_batch(messages, channel) |
There was a problem hiding this comment.
Test function test_idempotent appears incomplete.
The test calls bulk_process_message_batch(messages, channel) on line 312 but the file ends without the second call and assertions to verify idempotency. Based on the docstring "Running same batch twice should not create duplicates", the test should run the batch twice and assert no duplicates were created.
Proposed completion
bulk_process_message_batch(messages, channel)
+ bulk_process_message_batch(messages, channel)
+
+ assert DiscordMessage.objects.filter(message_id=5001).count() == 1
+ assert DiscordReaction.objects.filter(emoji="\U0001f44d").count() == 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@discord_activity_tracker/tests/test_bulk_services.py` around lines 295 - 312,
The test_idempotent function stops after a single call to
bulk_process_message_batch and never verifies idempotency; update it to call
bulk_process_message_batch(messages, channel) a second time and then assert that
no duplicate records were created (e.g., check the relevant model/query counts
for the message_id 5001 and any related reaction/attachment records remain
unchanged), referencing bulk_process_message_batch and the message_id used in
the messages fixture to locate the correct assertions.
Summary by CodeRabbit
New Features
Tests