Skip to content

Update discord_activity tracker app#142

Open
leostar0412 wants to merge 3 commits intocppalliance:developfrom
leostar0412:feature/141
Open

Update discord_activity tracker app#142
leostar0412 wants to merge 3 commits intocppalliance:developfrom
leostar0412:feature/141

Conversation

@leostar0412
Copy link
Copy Markdown
Collaborator

@leostar0412 leostar0412 commented Apr 21, 2026

Summary by CodeRabbit

  • New Features

    • Added Discord guild message export with flexible per-day or multi-day chunking and date-range filtering
    • Added backfill import to load previously exported Discord messages from JSON files
    • Added optional GitHub repository publishing for Discord exports
    • Added Pinecone search indexing integration for Discord messages
    • Enhanced management commands with date filtering and granular control via skip options
  • Tests

    • Added test coverage for backfill import and export path handling

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Discord data export and sync pipeline. It adds per-day channel exports via DiscordChatExporter, GitHub markdown publishing, Pinecone integration for message indexing, JSON backfill importing, new management commands, and supporting utilities for workspace path resolution and CLI bootstrapping.

Changes

Cohort / File(s) Summary
Configuration & Environment
.env.example, .gitignore, discord_activity_tracker/.gitignore, config/settings.py, config/test_settings.py
Added Discord-related environment variables for GitHub markdown repo target, per-day export settings (timezone, chunk sizing, channel names, script override), DiscordChatExporter version pinning, and Pinecone identifiers. Updated gitignore patterns to exclude build artifacts (tools/, vendor/) and local config files.
Management Commands
discord_activity_tracker/management/commands/run_discord_activity_tracker.py, discord_activity_tracker/management/commands/run_discord_exporter.py, discord_activity_tracker/management/commands/run_discord_channel_export.py, discord_activity_tracker/management/commands/backfill_discord_json.py, discord_activity_tracker/management/commands/debug_discord_export.py
Extended CLI support for date-window filtering (--since/--until), skip flags for sync/markdown/GitHub/Pinecone steps, and new commands to run DiscordChatExporter by day and backfill JSON files. Updated help text and argument metadata for clarity.
Sync & Import Infrastructure
discord_activity_tracker/sync/dce_cli.py, discord_activity_tracker/sync/backfill_paths.py, discord_activity_tracker/sync/importer.py, discord_activity_tracker/sync/chat_exporter.py
Added DiscordChatExporter CLI bootstrapping with GitHub release download and source build fallback, backfill path utilities for sorting and date-window filtering of exported JSON files, channel payload persistence into database, and JSON parsing improvements for encoding tolerance.
GitHub Publishing
discord_activity_tracker/github_publish.py
New module providing Discord markdown export to GitHub with configuration validation and token-based repository upload via existing GitHub utilities.
Export Scripts & Utilities
discord_activity_tracker/offline_scripts/export_guild_by_day.py, discord_activity_tracker/workspace.py
Standalone per-day/per-chunk Discord guild export script with resume logic, thread-pooled parallel exports, timezone-aware date conversion, and retry handling. New workspace path utilities for tools/, script/, and discussion export directories.
Preprocessors
discord_activity_tracker/preprocessors/__init__.py, discord_activity_tracker/preprocessors/discord_preprocessor.py
New Pinecone preprocessor for Discord messages with markup cleaning, mention/emoji stripping, author prefixing, and attachment URL extraction, supporting full sync, incremental sync, and retry modes.
Tests & Fixtures
discord_activity_tracker/tests/test_backfill_discord_json.py, discord_activity_tracker/tests/test_backfill_paths.py, discord_activity_tracker/tests/test_bulk_services.py, discord_activity_tracker/tests/fixtures/backfill_minimal.json
Added tests for backfill JSON importing, backfill path ordering/filtering, and refactored bulk service tests from classes to module-level functions. New minimal fixture for backfill data validation.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Command
    participant Exporter as DiscordChatExporter
    participant DB as Database
    participant GitHub as GitHub API
    participant Pinecone as Pinecone API

    CLI->>Exporter: run export_guild_by_day.py<br/>(per-day or chunked)
    Exporter-->>CLI: JSON files (channels/messages)
    
    CLI->>DB: persist_exporter_channel_payloads<br/>(batch import)
    DB-->>CLI: records created/updated
    
    alt markdown export enabled
        CLI->>CLI: generate markdown<br/>from DB messages
        CLI->>GitHub: push_discord_markdown_to_github<br/>(commit to branch)
        GitHub-->>CLI: success/failure
    end
    
    alt pinecone sync enabled
        CLI->>Pinecone: run_cppa_pinecone_sync<br/>(preprocess & index)
        Pinecone-->>CLI: documents indexed
    end
    
    CLI-->>CLI: summary report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #74: Adds Pinecone integration that directly uses the Discord preprocessor and sync settings introduced in this PR.
  • PR #114: Uses the GitHub publishing functions (upload_folder_to_github, get_github_token) that this PR now calls from the new github_publish.py module.

Suggested reviewers

  • snowfox1003
  • jonathanMLDev

Poem

🐰 A rabbit's tale of exports fine,
Discord messages in perfect line,
To GitHub they hop, to Pinecone they fly,
Each channel and day gets indexed on high!
With chunks and with timezones, retries galore,
The sync pipeline bounces from shore to shore! 🚀

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title "Update discord_activity tracker app" is vague and generic, describing the general subject area but not conveying the specific nature or scope of the changes. Consider a more specific title that highlights the main feature or change, such as "Add Discord export and Pinecone sync capabilities" or "Implement Discord channel exporter with markdown publishing."
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Only repository collaborators, contributors, or members can run CodeRabbit commands.

@snowfox1003 snowfox1003 self-requested a review April 23, 2026 16:03
@leostar0412 leostar0412 self-assigned this Apr 23, 2026
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (11)
discord_activity_tracker/sync/backfill_paths.py (1)

44-47: _as_date returns None for date inputs.

If d is a date object (not datetime), isinstance(d, datetime) is False and the function returns None. Since datetime is a subclass of date, not vice versa, this could cause issues if callers ever pass date objects directly.

Currently callers pass results from parse_iso_datetime which returns datetime, so this works in practice.

♻️ Handle both date and datetime inputs
 def _as_date(d: Optional[datetime]) -> Optional[date]:
     if d is None:
         return None
-    return d.date() if isinstance(d, datetime) else None
+    if isinstance(d, datetime):
+        return d.date()
+    if isinstance(d, date):
+        return d
+    return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/sync/backfill_paths.py` around lines 44 - 47,
_as_date currently returns None for inputs that are date instances (only checks
isinstance(d, datetime)); update _as_date to accept both datetime and date by
returning d.date() when d is a datetime and returning d when d is a date, while
still returning None for None inputs so callers like parse_iso_datetime continue
to work; adjust the function signature/logic in _as_date to use isinstance
checks against datetime and date accordingly.
discord_activity_tracker/sync/importer.py (2)

70-76: Minor: Duplicate message conversion.

convert_exporter_message_to_dict(messages[-1]) on line 71 re-converts the last message that was already converted in converted on line 66. Consider reusing:

♻️ Reuse converted message
             converted = [convert_exporter_message_to_dict(msg) for msg in messages]

             processed = await _process_messages_in_batches(channel, converted)

-            if messages:
-                last_msg = convert_exporter_message_to_dict(messages[-1])
+            if converted:
+                last_msg = converted[-1]
                 last_time = parse_datetime(last_msg.get("created_at"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/sync/importer.py` around lines 70 - 76, The code
re-converts the last message redundantly; instead of calling
convert_exporter_message_to_dict(messages[-1]) again, reuse the already
converted item in converted (e.g. use converted[-1] as last_msg), parse its
"created_at" with parse_datetime, and pass that last_time into
sync_to_async(update_channel_last_activity)(channel, last_time); update the
block that sets last_msg/last_time to reference converted[-1] to avoid duplicate
work.

87-90: Consider narrowing exception handling or logging full traceback.

The broad except Exception catch (flagged by Ruff BLE001) continues processing after per-channel failures, which is reasonable for resilience. However, the current logging only captures str(e), losing the traceback that would help diagnose issues.

♻️ Log with traceback for debugging
         except Exception as e:
             ch_name = channel_data.get("channel", {}).get("name")
-            logger.error("Failed to persist channel %s: %s", ch_name, e)
+            logger.exception("Failed to persist channel %s: %s", ch_name, e)
             continue

Using logger.exception() automatically includes the traceback at ERROR level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/sync/importer.py` around lines 87 - 90, The
catch-all except block that handles per-channel failures should log the full
traceback instead of just str(e); in the except handler for the channel
persistence (the block using channel_data and ch_name), replace the current
logger.error(...) call with logger.exception(...) (or logger.error(...,
exc_info=True)) so the stack trace is recorded while still continuing the loop;
keep the broad except if you want resilience but ensure the log message includes
ch_name and the traceback for debugging.
discord_activity_tracker/management/commands/run_discord_exporter.py (2)

398-404: Consider batching asyncio.run calls for efficiency.

Currently asyncio.run() is called for each JSON file, creating a new event loop per iteration. While this allows immediate cleanup (json_path.unlink()), it adds overhead.

The current approach is acceptable for reliability (partial progress on failure) but if performance becomes a concern, batching could help.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/management/commands/run_discord_exporter.py` around
lines 398 - 404, The loop is calling asyncio.run(...) for every JSON file which
creates a new event loop each iteration; instead accumulate channel payloads
(e.g., collect channel_data into a list) and call
asyncio.run(persist_exporter_channel_payloads(all_channel_data,
expected_guild_id=guild_id)) once to process them in a single event loop, then
unlink the corresponding json_path files after the batched call succeeds;
alternatively create one event loop outside the per-file loop and schedule
persist_exporter_channel_payloads tasks for each batch to avoid repeated
asyncio.run invocations while still ensuring json_path.unlink() runs only after
successful persistence.

447-448: Move import outside of loop for efficiency.

The parse_datetime import is inside the loop over JSON files, causing repeated import overhead:

♻️ Move import to top of method
     def _import_json_files(
         self,
         dry_run: bool,
         guild_id: int,
         since_override: Optional[datetime],
         until_cutoff: Optional[datetime],
     ):
         """Import pre-exported JSON files from raw/ into the database."""
         self.stdout.write("\n=== Importing JSON Files ===")
+        
+        from discord_activity_tracker.sync.utils import parse_datetime
 
         temp_dir = get_raw_dir()
         # ... later in the loop:
-                        from discord_activity_tracker.sync.utils import parse_datetime
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/management/commands/run_discord_exporter.py` around
lines 447 - 448, The import of parse_datetime inside the JSON-files loop causes
repeated overhead; move the import statement for parse_datetime from inside the
loop to the top of the surrounding function (e.g., the command's handle or run
loop in run_discord_exporter.py) so it's executed once, then update the loop to
call parse_datetime directly; reference the symbol parse_datetime and the loop
in run_discord_exporter.py to locate and adjust the code.
discord_activity_tracker/offline_scripts/export_guild_by_day.py (2)

622-623: Variable shadowing may cause confusion.

start_dt and end_dt are reused here but represent different values than the outer scope (lines 570-571). Consider using distinct names like ch_start_dt and ch_end_dt to avoid confusion:

♻️ Suggested rename
-            start_dt = datetime.strptime(last_date, "%Y-%m-%d")
-            end_dt = datetime.strptime(today_str, "%Y-%m-%d")
-            if start_dt > end_dt:
+            ch_start_dt = datetime.strptime(last_date, "%Y-%m-%d")
+            ch_end_dt = datetime.strptime(today_str, "%Y-%m-%d")
+            if ch_start_dt > ch_end_dt:
                 continue
             dates_from_last = []
-            d = start_dt
-            while d <= end_dt:
+            d = ch_start_dt
+            while d <= ch_end_dt:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/offline_scripts/export_guild_by_day.py` around lines
622 - 623, The variables start_dt and end_dt are being shadowed here (assigned
from last_date and today_str) which conflicts with outer-scope start_dt/end_dt;
rename them to distinct identifiers (e.g., ch_start_dt and ch_end_dt) and update
every subsequent use in this local block to the new names so you don't
accidentally reference or overwrite the outer variables — change the assignments
from datetime.strptime(last_date, "%Y-%m-%d") and datetime.strptime(today_str,
"%Y-%m-%d") and all later local references accordingly.

690-695: Add exception handling for thread pool futures.

If export_channel_days raises an unexpected exception, future.result() will propagate it. Consider wrapping in try-except:

♻️ Defensive error handling
         with ThreadPoolExecutor(max_workers=PARALLEL) as executor:
             futures = {executor.submit(export_channel_days, a): a for a in work}
             for future in as_completed(futures):
-                cid, cname, failed = future.result()
-                if failed:
-                    failed_channels.append(cid)
+                try:
+                    cid, cname, failed = future.result()
+                    if failed:
+                        failed_channels.append(cid)
+                except Exception as e:
+                    args = futures[future]
+                    cid = args[0]
+                    print(f"ERROR: Channel {cid} raised exception: {e}", file=sys.stderr)
+                    failed_channels.append(cid)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/offline_scripts/export_guild_by_day.py` around lines
690 - 695, The futures loop currently calls future.result() without catching
exceptions, so if export_channel_days raises the exception it will propagate;
wrap the result retrieval in a try-except around the call in the
ThreadPoolExecutor loop that iterates over as_completed(futures) (referencing
export_channel_days, futures, failed_channels), catch Exception as e, log or
record the exception with context (include the channel id/name from
futures[future] or the returned cid/cname if available), append the channel id
to failed_channels for failed tasks, and continue to the next future so one
failing thread doesn't kill the whole export.
discord_activity_tracker/sync/dce_cli.py (2)

38-47: Silent exception swallowing hides configuration issues.

The bare except Exception: pass silently ignores any errors when reading Django settings, including misconfigurations. Consider logging at debug level:

♻️ Suggested improvement
     try:
         from django.conf import settings
 
         if getattr(settings, "configured", False):
             v = (getattr(settings, "DISCORD_CHAT_EXPORTER_VERSION", "") or "").strip()
             if v:
                 return v
-    except Exception:
-        pass
+    except Exception as e:
+        logger.debug("Could not read DISCORD_CHAT_EXPORTER_VERSION from Django settings: %s", e)
     return DEFAULT_PINNED_VERSION
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/sync/dce_cli.py` around lines 38 - 47, The
try/except in the settings-read block silently swallows all exceptions (hiding
config errors); update the handler in the function that returns
DEFAULT_PINNED_VERSION to catch Exception but log the caught exception at debug
level (using the module logger) including context (e.g., mention reading
DISCORD_CHAT_EXPORTER_VERSION from django.conf.settings) before returning
DEFAULT_PINNED_VERSION; reference the getattr(settings, "configured"),
DISCORD_CHAT_EXPORTER_VERSION and DEFAULT_PINNED_VERSION symbols to locate where
to add the logger.debug call.

102-112: URL scheme validation would address static analysis concern.

While the URLs are constructed from constants (GitHub API), adding explicit HTTPS validation would silence S310 and add defense in depth:

♻️ Optional: Add URL scheme validation
 def _http_get_json(url: str) -> dict:
+    if not url.startswith("https://"):
+        raise ValueError(f"Only HTTPS URLs are allowed: {url}")
     req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT})
     with urllib.request.urlopen(req, timeout=120) as resp:
         return json.loads(resp.read().decode("utf-8"))
 
 
 def _download_file(url: str, dest: Path) -> None:
+    if not url.startswith("https://"):
+        raise ValueError(f"Only HTTPS URLs are allowed: {url}")
     dest.parent.mkdir(parents=True, exist_ok=True)
     req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/sync/dce_cli.py` around lines 102 - 112, The
functions _http_get_json and _download_file currently open URLs without
validating the scheme; add a check at the start of each (or a small shared
helper) that parses the url (via urllib.parse.urlparse) and verifies
parsed.scheme == "https" (raise a ValueError or custom exception if not), so
both _http_get_json(url: str) and _download_file(url: str, dest: Path) only
proceed when the URL is HTTPS; keep using USER_AGENT and existing timeouts after
validation to preserve behavior.
discord_activity_tracker/management/commands/run_discord_activity_tracker.py (1)

178-178: Add type annotation for since_date parameter.

For consistency with other typed parameters and better IDE support:

♻️ Suggested fix
-        since_date,
+        since_date: Optional[datetime] = None,

You'll need to import Optional and datetime at the top if not already present for this signature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`
at line 178, The parameter since_date in the command handler needs a type
annotation (e.g., Optional[datetime.datetime]) for consistency; update the
function signature that declares since_date (likely Command.handle or handle
method in run_discord_activity_tracker.py) to use since_date:
Optional[datetime.datetime], and add the imports "from typing import Optional"
and "import datetime" at the top of the file if they aren't already present.
discord_activity_tracker/management/commands/run_discord_channel_export.py (1)

99-106: Consider capturing subprocess output for better error diagnostics.

When the script fails, only the exit code is reported. Capturing stderr would provide more context:

♻️ Optional improvement for debugging
     cwd = script_path.parent
     cmd = [sys.executable, str(script_path)]
     logger.info("Running %s with cwd=%s", cmd, cwd)
-    proc = subprocess.run(cmd, cwd=cwd, env=env, check=False)
+    proc = subprocess.run(cmd, cwd=cwd, env=env, check=False, capture_output=True, text=True)
     if proc.returncode != 0:
+        if proc.stderr:
+            logger.error("export_guild_by_day.py stderr: %s", proc.stderr[:2000])
         raise CommandError(
             f"export_guild_by_day.py exited with code {proc.returncode}"
         )
+    if proc.stdout:
+        self.stdout.write(proc.stdout)
     self.stdout.write(self.style.SUCCESS("export_guild_by_day.py finished"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/management/commands/run_discord_channel_export.py`
around lines 99 - 106, The subprocess invocation in
run_discord_channel_export.py currently only checks proc.returncode and raises
CommandError with the exit code; modify the subprocess.run call (the proc
creation) to capture output (e.g., capture_output=True or
stdout=PIPE/stderr=PIPE) and on non-zero return code include proc.stderr (and
optionally proc.stdout) in the CommandError or an additional logger.error
message so the error contains the script's stderr for diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@discord_activity_tracker/github_publish.py`:
- Around line 21-24: The current assignment to branch calls .strip() after the
truthy-or-default logic which lets whitespace-only settings become an empty
string; instead, first retrieve the raw value from getattr(settings,
"DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH), call .strip() on that result,
then if the stripped value is falsy set branch = DEFAULT_BRANCH; update the code
that assigns branch (the variable named branch using
settings.DISCORD_MARKDOWN_REPO_BRANCH and DEFAULT_BRANCH) so the fallback
happens after trimming whitespace.
- Around line 56-64: The code calls get_github_token(use="write") which can
raise and currently lets exceptions escape, breaking the function's boolean
success/failure contract; wrap the call to get_github_token in a try/except,
handle exceptions (log or process via existing logger) and return False on
failure instead of propagating the exception, then only call
upload_folder_to_github when a token was successfully retrieved—reference
get_github_token and upload_folder_to_github to locate where to add the
try/except and the early False return.

In
`@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`:
- Around line 74-85: The --until argument is parsed into until_dt but never used
when calling sync_all_channels (only since_date is passed), causing confusion;
either remove the parser.add_argument(...) entry for
"--until"/"--to-date"/"--end-time" and its validation/usage so the CLI does not
advertise an unsupported option, or update the sync_all_channels function
signature to accept an until_date parameter and thread until_dt through the call
at the site that invokes sync_all_channels (update the function definition of
sync_all_channels and all callers accordingly), and adjust the help text for
parser.add_argument or sync_all_channels docstrings to reflect actual support;
refer to the parser.add_argument block that defines "--until", the until_dt
variable where it is parsed/validated, and the sync_all_channels invocation to
implement the chosen fix.

In `@discord_activity_tracker/preprocessors/discord_preprocessor.py`:
- Line 70: The current assignment to min_len (int(getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", 50) or 50)) treats an explicit 0 as missing; change
it to preserve a user-set zero by fetching the raw setting and only applying the
default when the attribute is not set or is None (e.g. val = getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", None); min_len = int(val) if val is not None else
50), so that settings.PINECONE_MIN_TEXT_LENGTH == 0 remains valid; update the
code around the min_len assignment accordingly and add a brief comment
clarifying the intent.

In `@discord_activity_tracker/sync/dce_cli.py`:
- Around line 115-132: The function _extract_cli_zip_bundle currently calls
zipfile.ZipFile.extractall which is vulnerable to zip-slip; instead iterate over
zf.infolist() and for each member validate its name is not absolute and does not
contain '..', compute the destination path by joining bundle with the member
name and ensure dest.resolve() is within bundle.resolve() (e.g., startswith
bundle.resolve() + os.sep), then create parent dirs and write the file from
zf.open(member) to the destination; keep the existing checks for target_name
("DiscordChatExporter.Cli.exe") and raise DiscordChatExporterCliNotFoundError if
exe is missing.

In `@discord_activity_tracker/tests/test_bulk_services.py`:
- Around line 295-312: The test_idempotent function stops after a single call to
bulk_process_message_batch and never verifies idempotency; update it to call
bulk_process_message_batch(messages, channel) a second time and then assert that
no duplicate records were created (e.g., check the relevant model/query counts
for the message_id 5001 and any related reaction/attachment records remain
unchanged), referencing bulk_process_message_batch and the message_id used in
the messages fixture to locate the correct assertions.

---

Nitpick comments:
In
`@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`:
- Line 178: The parameter since_date in the command handler needs a type
annotation (e.g., Optional[datetime.datetime]) for consistency; update the
function signature that declares since_date (likely Command.handle or handle
method in run_discord_activity_tracker.py) to use since_date:
Optional[datetime.datetime], and add the imports "from typing import Optional"
and "import datetime" at the top of the file if they aren't already present.

In `@discord_activity_tracker/management/commands/run_discord_channel_export.py`:
- Around line 99-106: The subprocess invocation in run_discord_channel_export.py
currently only checks proc.returncode and raises CommandError with the exit
code; modify the subprocess.run call (the proc creation) to capture output
(e.g., capture_output=True or stdout=PIPE/stderr=PIPE) and on non-zero return
code include proc.stderr (and optionally proc.stdout) in the CommandError or an
additional logger.error message so the error contains the script's stderr for
diagnostics.

In `@discord_activity_tracker/management/commands/run_discord_exporter.py`:
- Around line 398-404: The loop is calling asyncio.run(...) for every JSON file
which creates a new event loop each iteration; instead accumulate channel
payloads (e.g., collect channel_data into a list) and call
asyncio.run(persist_exporter_channel_payloads(all_channel_data,
expected_guild_id=guild_id)) once to process them in a single event loop, then
unlink the corresponding json_path files after the batched call succeeds;
alternatively create one event loop outside the per-file loop and schedule
persist_exporter_channel_payloads tasks for each batch to avoid repeated
asyncio.run invocations while still ensuring json_path.unlink() runs only after
successful persistence.
- Around line 447-448: The import of parse_datetime inside the JSON-files loop
causes repeated overhead; move the import statement for parse_datetime from
inside the loop to the top of the surrounding function (e.g., the command's
handle or run loop in run_discord_exporter.py) so it's executed once, then
update the loop to call parse_datetime directly; reference the symbol
parse_datetime and the loop in run_discord_exporter.py to locate and adjust the
code.

In `@discord_activity_tracker/offline_scripts/export_guild_by_day.py`:
- Around line 622-623: The variables start_dt and end_dt are being shadowed here
(assigned from last_date and today_str) which conflicts with outer-scope
start_dt/end_dt; rename them to distinct identifiers (e.g., ch_start_dt and
ch_end_dt) and update every subsequent use in this local block to the new names
so you don't accidentally reference or overwrite the outer variables — change
the assignments from datetime.strptime(last_date, "%Y-%m-%d") and
datetime.strptime(today_str, "%Y-%m-%d") and all later local references
accordingly.
- Around line 690-695: The futures loop currently calls future.result() without
catching exceptions, so if export_channel_days raises the exception it will
propagate; wrap the result retrieval in a try-except around the call in the
ThreadPoolExecutor loop that iterates over as_completed(futures) (referencing
export_channel_days, futures, failed_channels), catch Exception as e, log or
record the exception with context (include the channel id/name from
futures[future] or the returned cid/cname if available), append the channel id
to failed_channels for failed tasks, and continue to the next future so one
failing thread doesn't kill the whole export.

In `@discord_activity_tracker/sync/backfill_paths.py`:
- Around line 44-47: _as_date currently returns None for inputs that are date
instances (only checks isinstance(d, datetime)); update _as_date to accept both
datetime and date by returning d.date() when d is a datetime and returning d
when d is a date, while still returning None for None inputs so callers like
parse_iso_datetime continue to work; adjust the function signature/logic in
_as_date to use isinstance checks against datetime and date accordingly.

In `@discord_activity_tracker/sync/dce_cli.py`:
- Around line 38-47: The try/except in the settings-read block silently swallows
all exceptions (hiding config errors); update the handler in the function that
returns DEFAULT_PINNED_VERSION to catch Exception but log the caught exception
at debug level (using the module logger) including context (e.g., mention
reading DISCORD_CHAT_EXPORTER_VERSION from django.conf.settings) before
returning DEFAULT_PINNED_VERSION; reference the getattr(settings, "configured"),
DISCORD_CHAT_EXPORTER_VERSION and DEFAULT_PINNED_VERSION symbols to locate where
to add the logger.debug call.
- Around line 102-112: The functions _http_get_json and _download_file currently
open URLs without validating the scheme; add a check at the start of each (or a
small shared helper) that parses the url (via urllib.parse.urlparse) and
verifies parsed.scheme == "https" (raise a ValueError or custom exception if
not), so both _http_get_json(url: str) and _download_file(url: str, dest: Path)
only proceed when the URL is HTTPS; keep using USER_AGENT and existing timeouts
after validation to preserve behavior.

In `@discord_activity_tracker/sync/importer.py`:
- Around line 70-76: The code re-converts the last message redundantly; instead
of calling convert_exporter_message_to_dict(messages[-1]) again, reuse the
already converted item in converted (e.g. use converted[-1] as last_msg), parse
its "created_at" with parse_datetime, and pass that last_time into
sync_to_async(update_channel_last_activity)(channel, last_time); update the
block that sets last_msg/last_time to reference converted[-1] to avoid duplicate
work.
- Around line 87-90: The catch-all except block that handles per-channel
failures should log the full traceback instead of just str(e); in the except
handler for the channel persistence (the block using channel_data and ch_name),
replace the current logger.error(...) call with logger.exception(...) (or
logger.error(..., exc_info=True)) so the stack trace is recorded while still
continuing the loop; keep the broad except if you want resilience but ensure the
log message includes ch_name and the traceback for debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0406c40b-fda4-49e8-9af2-fbc1637cd56c

📥 Commits

Reviewing files that changed from the base of the PR and between cf8f184 and 70805b5.

📒 Files selected for processing (23)
  • .env.example
  • .gitignore
  • config/settings.py
  • config/test_settings.py
  • discord_activity_tracker/.gitignore
  • discord_activity_tracker/github_publish.py
  • discord_activity_tracker/management/commands/backfill_discord_json.py
  • discord_activity_tracker/management/commands/debug_discord_export.py
  • discord_activity_tracker/management/commands/run_discord_activity_tracker.py
  • discord_activity_tracker/management/commands/run_discord_channel_export.py
  • discord_activity_tracker/management/commands/run_discord_exporter.py
  • discord_activity_tracker/offline_scripts/export_guild_by_day.py
  • discord_activity_tracker/preprocessors/__init__.py
  • discord_activity_tracker/preprocessors/discord_preprocessor.py
  • discord_activity_tracker/sync/backfill_paths.py
  • discord_activity_tracker/sync/chat_exporter.py
  • discord_activity_tracker/sync/dce_cli.py
  • discord_activity_tracker/sync/importer.py
  • discord_activity_tracker/tests/fixtures/backfill_minimal.json
  • discord_activity_tracker/tests/test_backfill_discord_json.py
  • discord_activity_tracker/tests/test_backfill_paths.py
  • discord_activity_tracker/tests/test_bulk_services.py
  • discord_activity_tracker/workspace.py

Comment on lines +21 to +24
branch = (
getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH)
or DEFAULT_BRANCH
).strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize branch fallback after trimming whitespace.

A whitespace-only branch setting currently becomes "" after .strip(), which can cause upload calls to target an invalid branch ref.

Suggested fix
-    branch = (
-        getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH)
-        or DEFAULT_BRANCH
-    ).strip()
+    branch_raw = (
+        getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH)
+        or DEFAULT_BRANCH
+    )
+    branch = branch_raw.strip() or DEFAULT_BRANCH
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/github_publish.py` around lines 21 - 24, The current
assignment to branch calls .strip() after the truthy-or-default logic which lets
whitespace-only settings become an empty string; instead, first retrieve the raw
value from getattr(settings, "DISCORD_MARKDOWN_REPO_BRANCH", DEFAULT_BRANCH),
call .strip() on that result, then if the stripped value is falsy set branch =
DEFAULT_BRANCH; update the code that assigns branch (the variable named branch
using settings.DISCORD_MARKDOWN_REPO_BRANCH and DEFAULT_BRANCH) so the fallback
happens after trimming whitespace.

Comment on lines +56 to +64
token = get_github_token(use="write")
result = upload_folder_to_github(
local_folder=local_folder,
owner=owner,
repo=repo,
commit_message="chore: update Discord archive markdown",
branch=branch,
token=token,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the function’s boolean failure contract on token errors.

get_github_token(use="write") can raise (e.g., missing token). Right now that exception escapes, despite this function being structured as a bool success/failure API.

Suggested fix
-    token = get_github_token(use="write")
-    result = upload_folder_to_github(
-        local_folder=local_folder,
-        owner=owner,
-        repo=repo,
-        commit_message="chore: update Discord archive markdown",
-        branch=branch,
-        token=token,
-    )
+    try:
+        token = get_github_token(use="write")
+        result = upload_folder_to_github(
+            local_folder=local_folder,
+            owner=owner,
+            repo=repo,
+            commit_message="chore: update Discord archive markdown",
+            branch=branch,
+            token=token,
+        )
+    except Exception as e:
+        logger.error("Discord markdown upload setup failed: %s", e)
+        return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/github_publish.py` around lines 56 - 64, The code
calls get_github_token(use="write") which can raise and currently lets
exceptions escape, breaking the function's boolean success/failure contract;
wrap the call to get_github_token in a try/except, handle exceptions (log or
process via existing logger) and return False on failure instead of propagating
the exception, then only call upload_folder_to_github when a token was
successfully retrieved—reference get_github_token and upload_folder_to_github to
locate where to add the try/except and the early False return.

Comment on lines +74 to 85
parser.add_argument(
"--until",
"--to-date",
"--end-time",
type=str,
default=None,
dest="until",
help="Upper bound for message sync (same formats as --since). "
"Note: the Discord client currently fetches forward from the sync cursor; "
"this value is passed for API consistency where supported. "
"--to-date and --end-time are deprecated aliases for --until.",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

until_dt is parsed but never passed to sync_all_channels.

The --until argument is parsed and validated (lines 97-105), but only since_date is passed to sync_all_channels at line 145. Based on context snippet 2, sync_all_channels only accepts since_date parameter, so there's no until_date parameter to pass.

Either:

  1. Remove --until from this command if it's not supported by the underlying sync function, or
  2. Update sync_all_channels to accept an until_date parameter if the feature is intended.

The help text at lines 82-83 acknowledges this limitation ("the Discord client currently fetches forward from the sync cursor"), but the argument still creates user confusion.

Also applies to: 145-146

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/management/commands/run_discord_activity_tracker.py`
around lines 74 - 85, The --until argument is parsed into until_dt but never
used when calling sync_all_channels (only since_date is passed), causing
confusion; either remove the parser.add_argument(...) entry for
"--until"/"--to-date"/"--end-time" and its validation/usage so the CLI does not
advertise an unsupported option, or update the sync_all_channels function
signature to accept an until_date parameter and thread until_dt through the call
at the site that invokes sync_all_channels (update the function definition of
sync_all_channels and all callers accordingly), and adjust the help text for
parser.add_argument or sync_all_channels docstrings to reflect actual support;
refer to the parser.add_argument block that defines "--until", the until_dt
variable where it is parsed/validated, and the sync_all_channels invocation to
implement the chosen fix.

``failed_ids``.
"""
normalized_failed = _normalize_failed_ids(failed_ids)
min_len = int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback logic may override explicit zero setting.

The expression int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50) will use 50 if the setting is explicitly 0 (since 0 or 50 evaluates to 50). If 0 is a valid value meaning "no minimum length", this could be unexpected.

If 0 should disable the minimum check:

-    min_len = int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50)
+    raw = getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50)
+    min_len = int(raw) if raw is not None else 50

If 0 should indeed fall back to 50, the current code is correct but consider adding a comment.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
min_len = int(getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50) or 50)
raw = getattr(settings, "PINECONE_MIN_TEXT_LENGTH", 50)
min_len = int(raw) if raw is not None else 50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/preprocessors/discord_preprocessor.py` at line 70,
The current assignment to min_len (int(getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", 50) or 50)) treats an explicit 0 as missing; change
it to preserve a user-set zero by fetching the raw setting and only applying the
default when the attribute is not set or is None (e.g. val = getattr(settings,
"PINECONE_MIN_TEXT_LENGTH", None); min_len = int(val) if val is not None else
50), so that settings.PINECONE_MIN_TEXT_LENGTH == 0 remains valid; update the
code around the min_len assignment accordingly and add a brief comment
clarifying the intent.

Comment on lines +115 to +132
def _extract_cli_zip_bundle(zip_path: Path, tools_dir: Path) -> Path:
"""
Extract a RID-specific CLI zip (flat layout) so DiscordChatExporter.Cli.exe
sits next to its dependencies under tools/DiscordChatExporter-cli/.
"""
bundle = tools_dir / DCE_CLI_BUNDLE_SUBDIR
if bundle.exists():
shutil.rmtree(bundle)
bundle.mkdir(parents=True, exist_ok=True)
target_name = "DiscordChatExporter.Cli.exe"
with zipfile.ZipFile(zip_path, "r") as zf:
zf.extractall(bundle)
exe = bundle / target_name
if not exe.is_file():
raise DiscordChatExporterCliNotFoundError(
f"No {target_name} found inside {zip_path} after extract"
)
return exe
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential zip-slip vulnerability in _extract_cli_zip_bundle.

Using zipfile.extractall() without validating member paths could allow a malicious archive to write files outside the target directory via path traversal (e.g., ../../malicious.exe).

🔒 Proposed fix with path validation
 def _extract_cli_zip_bundle(zip_path: Path, tools_dir: Path) -> Path:
     """
     Extract a RID-specific CLI zip (flat layout) so DiscordChatExporter.Cli.exe
     sits next to its dependencies under tools/DiscordChatExporter-cli/.
     """
     bundle = tools_dir / DCE_CLI_BUNDLE_SUBDIR
     if bundle.exists():
         shutil.rmtree(bundle)
     bundle.mkdir(parents=True, exist_ok=True)
     target_name = "DiscordChatExporter.Cli.exe"
     with zipfile.ZipFile(zip_path, "r") as zf:
+        for member in zf.namelist():
+            member_path = (bundle / member).resolve()
+            if not str(member_path).startswith(str(bundle.resolve())):
+                raise DiscordChatExporterCliNotFoundError(
+                    f"Zip contains unsafe path: {member}"
+                )
         zf.extractall(bundle)
     exe = bundle / target_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/sync/dce_cli.py` around lines 115 - 132, The
function _extract_cli_zip_bundle currently calls zipfile.ZipFile.extractall
which is vulnerable to zip-slip; instead iterate over zf.infolist() and for each
member validate its name is not absolute and does not contain '..', compute the
destination path by joining bundle with the member name and ensure
dest.resolve() is within bundle.resolve() (e.g., startswith bundle.resolve() +
os.sep), then create parent dirs and write the file from zf.open(member) to the
destination; keep the existing checks for target_name
("DiscordChatExporter.Cli.exe") and raise DiscordChatExporterCliNotFoundError if
exe is missing.

Comment on lines +295 to +312
@pytest.mark.django_db
def test_idempotent(channel):
"""Running same batch twice should not create duplicates."""
now = datetime(2026, 2, 17, 12, 0, 0, tzinfo=timezone.utc)
messages = [
{
"message_id": 5001,
"author": _user(1001, "alice"),
"content": "Test",
"message_created_at": now,
"message_edited_at": None,
"reply_to_message_id": None,
"attachment_urls": [],
"reactions": [{"emoji": "\U0001f44d", "count": 1}],
},
]

bulk_process_message_batch(messages, channel)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test function test_idempotent appears incomplete.

The test calls bulk_process_message_batch(messages, channel) on line 312 but the file ends without the second call and assertions to verify idempotency. Based on the docstring "Running same batch twice should not create duplicates", the test should run the batch twice and assert no duplicates were created.

Proposed completion
     bulk_process_message_batch(messages, channel)
+    bulk_process_message_batch(messages, channel)
+
+    assert DiscordMessage.objects.filter(message_id=5001).count() == 1
+    assert DiscordReaction.objects.filter(emoji="\U0001f44d").count() == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@discord_activity_tracker/tests/test_bulk_services.py` around lines 295 - 312,
The test_idempotent function stops after a single call to
bulk_process_message_batch and never verifies idempotency; update it to call
bulk_process_message_batch(messages, channel) a second time and then assert that
no duplicate records were created (e.g., check the relevant model/query counts
for the message_id 5001 and any related reaction/attachment records remain
unchanged), referencing bulk_process_message_batch and the message_id used in
the messages fixture to locate the correct assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant