Skip to content

[Question]: Repeated add_resource to the same directory/URI still reports full Embedding queue work on v0.3.22 #2383

@hzh000sunny

Description

@hzh000sunny

Your Question

Summary

When calling add_resource repeatedly with the same local directory and the same stable to URI, OpenViking v0.3.22 still reports the same Embedding.processed count even when no files changed.

This makes watch_interval / scheduled refresh expensive for large directories, because a no-op refresh appears to enqueue and process the same amount of embedding work as the initial import.

I expected the incremental behavior from #659 to skip unchanged files/directories, or at least expose whether Embedding.processed represents actual embedding provider calls vs. no-op/skipped queue messages.

Environment

  • OpenViking version: 0.3.22
  • Install source: PyPI package
  • Mode: HTTP server
  • Client: Python AsyncHTTPClient
  • Embedding provider: Ollama
  • Embedding model: bge-m3:latest
  • Config highlights:
    • embedding.text_source = "content_only"
    • embedding.max_concurrent = 1
    • local filesystem workspace
  • Resource type: local directory with two Markdown files
  • API call:
    • add_resource(..., wait=True, strict=False, preserve_structure=True, to=<same_uri>)

Reproduction

Create a clean OpenViking workspace and start openviking-server.

Create a local directory:

knowledge-base/
  alpha.md
  beta.md

alpha.md:

# Alpha

OpenViking repeat add alpha v1. Stable text marker ALPHA_REPEAT.
beta.md:

# Beta

OpenViking repeat add beta v1. Stable text marker BETA_REPEAT.

Then run the same import three times:

from openviking import AsyncHTTPClient

client = AsyncHTTPClient(
    url="http://127.0.0.1:<port>",
    account="codeask",
    user="codeask",
    agent_id="codeask",
    timeout=300,
)
await client.initialize()

to_uri = "viking://resources/codeask/wiki/repeat-add-incremental-smoke"

# 1. Initial import
await client.add_resource(
    path="/tmp/.../knowledge-base",
    to=to_uri,
    reason="repeat add smoke first_full",
    instruction="Index this tiny wiki fixture for incremental add_resource testing.",
    wait=True,
    timeout=240,
    strict=False,
    preserve_structure=True,
)

# 2. Repeat import with no file changes
await client.add_resource(
    path="/tmp/.../knowledge-base",
    to=to_uri,
    reason="repeat add smoke second_no_change",
    instruction="Index this tiny wiki fixture for incremental add_resource testing.",
    wait=True,
    timeout=240,
    strict=False,
    preserve_structure=True,
)
# 3. Modify only alpha.md, then import again                                                                                                                                                                                                                 
# alpha.md content changed, beta.md unchanged                                                                                                                                                                                                                
await client.add_resource(                                                                                                                                                                                                                                   
    path="/tmp/.../knowledge-base",                                                                                                                                                                                                                          
    to=to_uri,                                                                                                                                                                                                                                               
    reason="repeat add smoke third_one_file_changed",                                                                                                                                                                                                        
    instruction="Index this tiny wiki fixture for incremental add_resource testing.",                                                                                                                                                                        
    wait=True,                                                                                                                                                                                                                                               
    timeout=240,                                                                                                                                                                                                                                             
    strict=False,                                                                                                                                                                                                                                            
    preserve_structure=True,                                                                                                                                                                                                                                 
)                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                             
## Actual Result                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                             
All three calls succeeded with no queue errors, but the returned queue_status showed the same Embedding work each time:                                                                                                                                      
                                                                                                                                                                                                                                                             
first_full:                                                                                                                                                                                                                                                  
  Semantic.processed = 1                                                                                                                                                                                                                                     
  Embedding.processed = 8                                                                                                                                                                                                                                    
  error_count = 0                                                                                                                                                                                                                                            
  elapsed = 8.67s                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                             
second_no_change:                                                                                                                                                                                                                                            
  Semantic.processed = 1                                                                                                                                                                                                                                     
  Embedding.processed = 8                                                                                                                                                                                                                                    
  error_count = 0                                                                                                                                                                                                                                            
  elapsed = 3.60s

third_one_file_changed:
  Semantic.processed = 1
  Embedding.processed = 8
  error_count = 0
  elapsed = 3.56s

Full returned queue summaries:
{                                                                                                                                                                                                                                                            
  "first": {                                                                                                                                                                                                                                                 
    "Semantic": {                                                                                                                                                                                                                                            
      "processed": 1,                                                                                                                                                                                                                                        
      "requeue_count": 0,                                                                                                                                                                                                                                    
      "error_count": 0                                                                                                                                                                                                                                       
    },                                                                                                                                                                                                                                                       
    "Embedding": {                                                                                                                                                                                                                                           
      "processed": 8,                                                                                                                                                                                                                                        
      "requeue_count": 0,                                                                                                                                                                                                                                    
      "error_count": 0                                                                                                                                                                                                                                       
    }                                                                                                                                                                                                                                                        
  },                                                                                                                                                                                                                                                         
  "second_no_change": {                                                                                                                                                                                                                                      
    "Semantic": {                                                                                                                                                                                                                                            
      "processed": 1,
      "requeue_count": 0,
      "error_count": 0
    },
    "Embedding": {
      "processed": 8,
      "requeue_count": 0,
      "error_count": 0
    }
  },
  "third_one_file_changed": {
    "Semantic": {
      "processed": 1,
      "requeue_count": 0,
      "error_count": 0
    },
    "Embedding": {
      "processed": 8,
      "requeue_count": 0,
      "error_count": 0
    }
  }
}

## Expected Result

For a repeated add_resource with the same directory and same to URI:

1. If no file content changed:
    - Ideally no embedding provider calls should be made.
    - Embedding.processed should be 0, or there should be a separate skipped/noop metric.
    - The task should still complete successfully.

2. If only one file changed:
    - Only the changed file and necessary affected directory metadata should be re-vectorized.
    - Unchanged sibling files should not be re-embedded.

## Why This Matters

We are integrating OpenViking as a wiki knowledge base indexer. A normal integration pattern is:

- use one stable to URI per feature directory;
- periodically refresh the same directory;
- optionally use watch_interval or an application-side scheduled sweep.

If a no-op refresh still performs full Embedding queue work, then scheduled refresh/watch becomes expensive for large wiki/code directories, especially with local embedding models.

This also makes it hard for downstream systems to estimate indexing cost and progress, because Embedding.processed does not distinguish actual embedding calls from skipped/no-op work.

## Questions

1. Is this behavior expected in v0.3.22?
2. Does Embedding.processed count actual embedding provider calls, or can it include skipped/no-op queue messages?
3. Is there a supported way to tell whether repeated add_resource actually reused existing vectors?
4. Is watch_interval expected to avoid re-embedding unchanged files, or does it simply re-run add_resource on schedule?
5. Is there a recommended API for efficient no-op refresh of a stable local directory?

## Related Work

- #659 introduced add_resource incremental update behavior.
- #709 added resource watch scheduling/status tracking.
- #890 looks related to skipping re-embedding when content hash is unchanged.
- #1800 looks related to skipping unchanged sibling subtrees during incremental update.

The current behavior seems to indicate that repeated same-directory imports still perform full Embedding queue work, or at least report full Embedding queue work, even when content is unchanged.



### Context

_No response_

### Code Example (Optional)

```python

Related Area

None

Before Asking

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions