knowledge-base/
alpha.md
beta.md
alpha.md:
# Alpha
OpenViking repeat add alpha v1. Stable text marker ALPHA_REPEAT.
beta.md:
# Beta
OpenViking repeat add beta v1. Stable text marker BETA_REPEAT.
Then run the same import three times:
from openviking import AsyncHTTPClient
client = AsyncHTTPClient(
url="http://127.0.0.1:<port>",
account="codeask",
user="codeask",
agent_id="codeask",
timeout=300,
)
await client.initialize()
to_uri = "viking://resources/codeask/wiki/repeat-add-incremental-smoke"
# 1. Initial import
await client.add_resource(
path="/tmp/.../knowledge-base",
to=to_uri,
reason="repeat add smoke first_full",
instruction="Index this tiny wiki fixture for incremental add_resource testing.",
wait=True,
timeout=240,
strict=False,
preserve_structure=True,
)
# 2. Repeat import with no file changes
await client.add_resource(
path="/tmp/.../knowledge-base",
to=to_uri,
reason="repeat add smoke second_no_change",
instruction="Index this tiny wiki fixture for incremental add_resource testing.",
wait=True,
timeout=240,
strict=False,
preserve_structure=True,
)
# 3. Modify only alpha.md, then import again
# alpha.md content changed, beta.md unchanged
await client.add_resource(
path="/tmp/.../knowledge-base",
to=to_uri,
reason="repeat add smoke third_one_file_changed",
instruction="Index this tiny wiki fixture for incremental add_resource testing.",
wait=True,
timeout=240,
strict=False,
preserve_structure=True,
)
## Actual Result
All three calls succeeded with no queue errors, but the returned queue_status showed the same Embedding work each time:
first_full:
Semantic.processed = 1
Embedding.processed = 8
error_count = 0
elapsed = 8.67s
second_no_change:
Semantic.processed = 1
Embedding.processed = 8
error_count = 0
elapsed = 3.60s
third_one_file_changed:
Semantic.processed = 1
Embedding.processed = 8
error_count = 0
elapsed = 3.56s
Full returned queue summaries:
{
"first": {
"Semantic": {
"processed": 1,
"requeue_count": 0,
"error_count": 0
},
"Embedding": {
"processed": 8,
"requeue_count": 0,
"error_count": 0
}
},
"second_no_change": {
"Semantic": {
"processed": 1,
"requeue_count": 0,
"error_count": 0
},
"Embedding": {
"processed": 8,
"requeue_count": 0,
"error_count": 0
}
},
"third_one_file_changed": {
"Semantic": {
"processed": 1,
"requeue_count": 0,
"error_count": 0
},
"Embedding": {
"processed": 8,
"requeue_count": 0,
"error_count": 0
}
}
}
## Expected Result
For a repeated add_resource with the same directory and same to URI:
1. If no file content changed:
- Ideally no embedding provider calls should be made.
- Embedding.processed should be 0, or there should be a separate skipped/noop metric.
- The task should still complete successfully.
2. If only one file changed:
- Only the changed file and necessary affected directory metadata should be re-vectorized.
- Unchanged sibling files should not be re-embedded.
## Why This Matters
We are integrating OpenViking as a wiki knowledge base indexer. A normal integration pattern is:
- use one stable to URI per feature directory;
- periodically refresh the same directory;
- optionally use watch_interval or an application-side scheduled sweep.
If a no-op refresh still performs full Embedding queue work, then scheduled refresh/watch becomes expensive for large wiki/code directories, especially with local embedding models.
This also makes it hard for downstream systems to estimate indexing cost and progress, because Embedding.processed does not distinguish actual embedding calls from skipped/no-op work.
## Questions
1. Is this behavior expected in v0.3.22?
2. Does Embedding.processed count actual embedding provider calls, or can it include skipped/no-op queue messages?
3. Is there a supported way to tell whether repeated add_resource actually reused existing vectors?
4. Is watch_interval expected to avoid re-embedding unchanged files, or does it simply re-run add_resource on schedule?
5. Is there a recommended API for efficient no-op refresh of a stable local directory?
## Related Work
- #659 introduced add_resource incremental update behavior.
- #709 added resource watch scheduling/status tracking.
- #890 looks related to skipping re-embedding when content hash is unchanged.
- #1800 looks related to skipping unchanged sibling subtrees during incremental update.
The current behavior seems to indicate that repeated same-directory imports still perform full Embedding queue work, or at least report full Embedding queue work, even when content is unchanged.
### Context
_No response_
### Code Example (Optional)
```python
Your Question
Summary
When calling
add_resourcerepeatedly with the same local directory and the same stabletoURI, OpenViking v0.3.22 still reports the sameEmbedding.processedcount even when no files changed.This makes
watch_interval/ scheduled refresh expensive for large directories, because a no-op refresh appears to enqueue and process the same amount of embedding work as the initial import.I expected the incremental behavior from #659 to skip unchanged files/directories, or at least expose whether
Embedding.processedrepresents actual embedding provider calls vs. no-op/skipped queue messages.Environment
0.3.22AsyncHTTPClientbge-m3:latestembedding.text_source = "content_only"embedding.max_concurrent = 1add_resource(..., wait=True, strict=False, preserve_structure=True, to=<same_uri>)Reproduction
Create a clean OpenViking workspace and start
openviking-server.Create a local directory:
Related Area
None
Before Asking