feat(retrieve): 新增混合检索通道,支持PG全文检索+zhparser中文分词+RRF融合#37
Open
LHkeeper666 wants to merge 5 commits into
Open
Conversation
引入 KeywordSearchChannel 作为第三路检索通道,利用 PostgreSQL tsvector/tsquery 实现关键词精确匹配,弥补向量检索对专有名词、型号等 精确查询的召回不足。新增 HybridFusionPostProcessor,通过 RRF 或加权 求和将向量与关键词两路结果融合,位于去重之后、Rerank 之前。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
将全文检索配置从 simple(按字切分)迁移到 zhparser(中文词语切分), 提升中文关键词检索的召回精度。升级脚本增加 zhparser 扩展安装及文本 检索配置,触发器和存量数据回填同步切换分词引擎。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plainto_tsquery 将切词结果用 & 连接,要求所有 token 同时命中, 对中文短句过于严格。改为 to_tsquery + | 运算符实现 OR 语义, 匹配任一词即可返回结果。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1. 升级脚本增加 ADD MAPPING FOR n,v,a,i,e,l WITH simple, 修复 zhparser token 类型未映射导致 to_tsvector 返回空的问题 2. 在中英文/数字交界处插入空格,避免混合文本被当成单个 token 导致 to_tsquery 返回空 3. 删除启动诊断和每次请求的验证查询,保留正常的 info 统计日志 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
概述
新增两阶段混合检索:Stage 1 并行执行向量检索 + 关键词检索,Stage 2 后处理器链融合排序。
文件清单
新增
channel/KeywordSearchChannel.javapostprocessor/HybridFusionPostProcessor.javafusion/FusionStrategy.javafusion/RRFFusionStrategy.javafusion/WeightedSumFusionStrategy.javaresources/database/upgrade_v1.2_to_v1.3.sql修改
SearchChannelProperties.javaSystemSettingsVO.javaRAGSettingsController.javaSystemSettingsPage.tsxsettingsService.tsschema_pg.sql配置项
数据库迁移
psql -U postgres -d ragent < resources/database/upgrade_v1.2_to_v1.3.sql迁移脚本做以下操作:
t_knowledge_vector新增tsv tsvector列 + GIN 索引zhparser扩展 +CREATE TEXT SEARCH CONFIGURATIONADD MAPPING FOR n,v,a,i,e,l WITH simple—— zhparser 产出的 token 类型必须映射到 simple 词典,否则 PG 丢弃所有 token,to_tsvector返回空部署注意事项:Docker 镜像需编译 zhparser
当前项目使用
docker run启动pgvector/pgvector:pg16。关键词通道依赖 zhparser 扩展,该扩展需编译进 PG 镜像,无法通过挂载注入。方案一:构建自定义镜像(推荐)
在项目根目录新建
Dockerfile.pg:构建并使用自定义镜像启动:
docker build -t pgvector-zhparser:pg16 -f Dockerfile.pg . docker run -d \ --name postgres \ -e POSTGRES_DB=ragent \ -e POSTGRES_USER=postgres \ -e POSTGRES_PASSWORD=postgres \ -p 5432:5432 \ -v pgdata:/var/lib/postgresql/data \ pgvector-zhparser:pg16非 Docker 环境参考上述步骤在宿主机编译安装 SCWS + zhparser。
方案二:不启用关键词通道
如果不想引入编译步骤,KeywordSearchChannel 通过
@ConditionalOnProperty(name = "rag.vector.type", havingValue = "pg")条件装配。zhparser 扩展缺失时通道会自动跳过,不影响现有向量检索功能。也可手动关闭: