Rust_Bucket/testing.md at main · AdarshRazor/Rust_Bucket

I’m mapping the project from the repo summary first, then I’ll trace the main user flows from the client into the server so I can answer where the core logic lives and what processes text behind the scenes.I’m loading the relevant project-analysis skills now so the architecture walkthrough stays grounded in the repo’s frontend and backend patterns.

Read skill frontend-patterns, lines 1 to 240

Read skill backend-patterns, lines 1 to 240

Read skill python-patterns, lines 1 to 240

I have the general frontend/backend guidance loaded. Next I’m reading the repo’s own entry points and route files so I can explain the real request flow instead of giving a generic architecture summary.

Read , lines 1 to 260

Read

Read memory

I’ve confirmed the app is a Next.js client talking to a FastAPI server, but I still need the concrete call chain. I’m reading the main pages, API client, server router, and processing modules now to trace exactly what happens to uploaded text.

Read , lines 1 to 320

Read , lines 1 to 260

Read

Read , lines 1 to 360

Read

Read , lines 1 to 420

The client pages confirm the user flow. I’m now tracing the internal stores and server modules, because that is where the real behavior sits: request dispatch, state polling, extraction, translation, glossary generation, Mongo persistence, and export.

Read

Read , lines 1 to 400

Read

Read , lines 1 to 420

Read

Read , lines 1 to 420

I’ve got the main path from upload to glossary generation. I’m filling the remaining gaps now: auxiliary routes, preprocessing/postprocessing, chunking, export, and the exact client API surface so I can tell you which pieces are real APIs versus internal services.

Searched for regex @router\.(get|post|patch|delete)\( (**/src/server/api/routes/**), 23 results

Read

Read , lines 1 to 340

Read

Read , lines 1 to 420

Read

I now have the backend processing modules. I’m pulling the remaining route definitions and the client’s exported API functions so I can give you a clean map of “screen -> store -> API route -> backend service/module.”

Read , lines 420 to 666

Searched for text export async function (**/src/client/src/lib/api.ts), 24 results

Read

I have enough to explain the system. I’m grabbing precise locations for the core entry points now so the walkthrough points you to the right files to read first instead of making you hunt through the repo.

Searched for regex def create_app|include_router|startup_event|uvicorn.run (**/src/server/app/main.py), 7 results

Searched for regex export async function createBatch|export async function startBatch|export async function getGlossary|export async function exportGlossary|export async function getSystemMetrics|export async function applyRules (**/src/client/src/lib/api.ts), 6 results

Searched for regex createBatch: async|fetchBatches: async|startBatch: async|fetchGlossary: async|fetchPartialGlossary: async|applyRulesToGlossary: async|checkApiHealth: async (**/src/client/src/store/**), 7 results

Searched for regex def extract_text|def extract_from_pdf|def extract_from_docx|def extract_from_txt|def extract_from_json (**/src/server/modules/extractor.py), 5 results

Searched for regex def translate_text|def translate_chunk|def translate_with_retry (**/src/server/modules/translator.py), 3 results

Searched for regex def create_prompt|def generate_glossary_chunk|def generate_glossary|def call_gemma_api|def poll_nvcf_result (**/src/server/modules/glossary_engine.py), 5 results

Searched for regex class TextQualityAnalyzer|def detect_language|class BilingualSeparator|class TextSanitizer (**/src/server/modules/preprocessor.py), 4 results

Searched for regex class JSONRepair|def repair_and_parse|class GlossaryValidator|def deduplicate_entries (**/src/server/modules/postprocessor.py), 4 results

Searched for regex @router.get\("/metrics|@router.post\("/postprocessor/apply|@router.get\("/health (**/src/server/api/routes/**), 3 results

One last gap is the live-state layer. I’m reading the Redis helper so I can distinguish transient runtime state from persisted Mongo data and explain why the UI can show partial progress while a batch is still running.

Read , lines 1 to 360

I found two architecture details that matter a lot: the docs are a bit stale, and the runtime is more in-process than the high-level docs suggest. I’m reading the LLM provider abstraction now so I can accurately explain the optional post-processing path too.

Searched for regex def generate_with_llm|provider|gemma|openai (**/src/server/utils/llm_provider.py), 69 results

Searched for regex useBatchStore|useGlossaryStore|usePostProcessorStore|useMetricsStore (**/src/client/src/components/**), 10 results

Read

The call chain is complete now. I’m checking the combined-glossary UI once so I can tell you where the post-processing rules fit in from the user’s point of view, not just on the server side.

Read , lines 1 to 260

Read

Big Picture

This project is a 2-tier app, not a monolith with many hidden layers.

The frontend is a Next.js App Router app that renders pages and manages client state with Zustand stores, then calls the Python backend directly through fetch wrappers in api.ts. The backend is a FastAPI app created in main.py, with routers for batches, metrics, health, and post-processors mounted in main.py.

There is no separate Next.js API layer in this repo. The browser talks straight to FastAPI using the base URL defined in constants.ts.

One important correction to the docs: the actual frontend state layer is mostly Zustand, not React Context. The real state entry points are batchStore.ts, glossaryStore.ts, settingsStore.ts, metricsStore.ts, and postProcessorStore.ts.

Also, the runtime is more in-process than the docs imply. Batch execution is started with FastAPI BackgroundTasks in batches.py, and live batch state is primarily kept in memory with optional Redis sync in redis_client.py. MongoDB is the durable store for batch/glossary data in mongo_connector.py.

What Happens When The User Uses It

From the user side, the main screens are: page.tsx page.tsx page.tsx page.tsx

The end-to-end flow is:

User uploads a source file and optionally a reference file on page.tsx.
That page calls the batch store in batchStore.ts.
The store calls the client API function createBatch in api.ts.
The browser sends POST /api/v1/batches to FastAPI, handled by batches.py.
FastAPI calls create_batch in batch_processor.py.
That function stores metadata and raw file bytes in runtime state, marks the batch pending, and initializes a Mongo document via mongo_connector.py.

At that point the batch exists, but processing has not started yet.

When the user clicks Start:

The client calls startBatch from api.ts.
FastAPI handles POST /api/v1/batches/{id}/start in batches.py.
FastAPI schedules process_single_batch from batch_processor.py.

That processing pipeline is the real heart of the app:

Extract text from uploaded files using extractor.py.
Decide how to get Hindi reference text: Path A: user supplied a Hindi reference file. Path B: source file is bilingual, so Hindi is split out with preprocessor.py. Path C: source is mostly English, so Bhashini translation becomes the fallback Hindi reference.
Translate English chunks to Hindi via Bhashini in translator.py.
Run glossary extraction against Original + Reference + Bhashini output using the Gemma prompt pipeline in glossary_engine.py.
Repair malformed LLM JSON with postprocessor.py.
Persist chunk-level and final results into Mongo with mongo_connector.py and mongo_connector.py.
Update live progress/status through the in-memory/Redis helper in redis_client.py.

The status progression is: pending -> extracting -> translating -> auditing -> done or error

While this is running, the UI polls:

the batch list/detail using batch store calls from batchStore.ts
the glossary endpoint, including partial glossary during processing, from glossaryStore.ts

That is why the batch detail page can show partial terms before the job finishes. The detail page explicitly fetches partial glossary while processing in page.tsx.

Which APIs Are Called

From frontend to backend, the important real HTTP calls are:

POST /api/v1/batches -> create batch in batches.py
GET /api/v1/batches -> list batches in batches.py
GET /api/v1/batches/{id} -> batch detail in batches.py
POST /api/v1/batches/{id}/start -> start processing in batches.py
GET /api/v1/batches/{id}/glossary -> final or partial glossary in batches.py
PATCH /api/v1/batches/{id}/terms/{term_id} -> term edits in batches.py
GET /api/v1/batches/{id}/terms -> server-side search/filter in batches.py
GET /api/v1/batches/{id}/export -> per-batch export in batches.py
GET /api/v1/batches/glossaries/combined -> combined glossary in batches.py
GET /api/v1/batches/glossaries/export-all -> combined export in batches.py
GET /api/v1/metrics -> system metrics in metrics.py
GET /health -> health check in health.py
POST /api/v1/postprocessor/apply -> apply LLM post-processing rules in postprocessors.py

From backend to external services, the important non-local calls are:

Bhashini translation API from translator.py
NVIDIA NVCF Gemma endpoint from glossary_engine.py
Optional rule-based post-processing also goes through Gemma via llm_provider.py

So yes, there are two kinds of “API” in play:

internal HTTP APIs between browser and FastAPI
external service APIs from FastAPI to Bhashini and Gemma

Which Files “Handle The Text”

If your question is “where is the actual text work happening?”, these are the key files:

Text extraction from PDF/DOCX/TXT/JSON: extractor.py
Language detection and bilingual split: preprocessor.py and preprocessor.py
Text chunking: chunker.py and chunker.py
Bhashini translation: translator.py
Gemma prompt and glossary generation: glossary_engine.py and glossary_engine.py
LLM JSON cleanup and dedup logic: postprocessor.py and postprocessor.py
Mongo persistence and term update/search: mongo_connector.py and mongo_connector.py

If you want to improve extraction quality, start with extractor.py.

If you want to improve translation behavior, start with translator.py.

If you want to change what terms get extracted, the most important file is glossary_engine.py.

If you want to improve noisy OCR/bilingual input handling, focus on preprocessor.py.

If you want to change how results are cleaned up after the LLM responds, use postprocessor.py.

Where You Should Work

If you are trying to change UI/UX:

Start in page.tsx, page.tsx, page.tsx, and feature components under feature

If you are trying to change the frontend-backend contract:

Start in api.ts and batches.py

If you are trying to change processing flow:

Start in batch_processor.py

If you are trying to change persistence or term review/search:

Start in mongo_connector.py

If you are trying to change progress handling or live status:

Start in redis_client.py

If you are trying to change combined glossary export or post-processing:

UI: TotalGlossariesSection.tsx and PostProcessorSection.tsx
Server: postprocessors.py and llm_provider.py

Do not work in repomix-output.xml. That file is only a packed snapshot of the repo for analysis, not the source of truth.

Important Observations

A few things matter when you reason about this codebase:

The “worker” is not a separate queue system like Celery. Processing is started as an in-process FastAPI background task in batches.py.
Live state is memory-first, Redis-second, as implemented in redis_client.py. So Redis is helpful, but not the primary runtime source.
MongoDB is the long-term source of truth for saved batches and glossary terms, and batch info overlays live Redis/in-memory status on top of Mongo in batch_processor.py.
The combined glossary endpoints currently do not truly deduplicate, even though the UI text suggests unique terms. The dedup logic is commented out in batches.py and batches.py.

If you want, I can next give you one of these:

a screen-to-API map
a backend sequence diagram for one batch
a “read these 10 files first” onboarding order

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

testing.md

Latest commit

History

testing.md

File metadata and controls