存储后端与向量数据库配置
存储后端与向量数据库配置
相关源文件
本章引用的主要源码文件:
api/.env.exampleapi/app.pyapi/app_factory.pyapi/configs/feature/__init__.pyapi/configs/middleware/__init__.pyapi/configs/observability/__init__.pyapi/configs/observability/otel/otel_config.pyapi/configs/packaging/__init__.pyapi/controllers/console/datasets/datasets.pyapi/core/plugin/backwards_invocation/model.pyapi/core/rag/datasource/keyword/jieba/jieba.pyapi/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.pyapi/core/rag/datasource/vdb/vector_factory.pyapi/core/rag/datasource/vdb/vector_type.pyapi/core/rag/retrieval/router/multi_dataset_function_call_router.pyapi/core/rag/retrieval/router/multi_dataset_react_route.pyapi/core/rag/splitter/fixed_text_splitter.pyapi/core/rag/splitter/text_splitter.pyapi/extensions/ext_compress.pyapi/extensions/ext_otel.pyapi/extensions/ext_storage.pyapi/extensions/otel/instrumentation.pyapi/extensions/storage/storage_type.pyapi/factories/variable_factory.pyapi/providers/vdb/vdb-couchbase/src/dify_vdb_couchbase/couchbase_vector.pyapi/providers/vdb/vdb-elasticsearch/src/dify_vdb_elasticsearch/elasticsearch_vector.pyapi/providers/vdb/vdb-huawei-cloud/src/dify_vdb_huawei_cloud/huawei_cloud_vector.pyapi/providers/vdb/vdb-lindorm/src/dify_vdb_lindorm/lindorm_vector.pyapi/providers/vdb/vdb-milvus/src/dify_vdb_milvus/milvus_vector.pyapi/providers/vdb/vdb-opensearch/src/dify_vdb_opensearch/opensearch_vector.pyapi/providers/vdb/vdb-oracle/src/dify_vdb_oracle/oraclevector.pyapi/providers/vdb/vdb-pgvector/src/dify_vdb_pgvector/pgvector.pyapi/providers/vdb/vdb-relyt/src/dify_vdb_relyt/relyt_vector.pyapi/providers/vdb/vdb-tablestore/src/dify_vdb_tablestore/tablestore_vector.pyapi/providers/vdb/vdb-tidb-vector/src/dify_vdb_tidb_vector/tidb_vector.pyapi/providers/vdb/vdb-upstash/src/dify_vdb_upstash/upstash_vector.pyapi/providers/vdb/vdb-vastbase/src/dify_vdb_vastbase/vastbase_vector.pyapi/pyproject.tomlapi/tests/unit_tests/configs/test_dify_config.pyapi/tests/unit_tests/core/rag/splitter/__init__.pyapi/tests/unit_tests/core/rag/splitter/test_text_splitter.pyapi/tests/unit_tests/core/workflow/graph_engine/test_table_runner.pyapi/uv.lockdocker/.env.exampledocker/README.mddocker/docker-compose-template.yamldocker/docker-compose.middleware.yamldocker/docker-compose.yamldocker/envs/core-services/shared.env.exampledocker/envs/infrastructure/nginx.env.exampledocker/envs/security.env.exampledocker/nginx/conf.d/default.conf.templateweb/app/components/app/configuration/config-var/index.tsxweb/app/components/app/configuration/config-var/var-item.tsxweb/app/components/workflow/nodes/_base/components/variable/__tests__/output-var-list.spec.tsxweb/app/components/workflow/nodes/_base/components/variable/output-var-list.tsxweb/app/components/workflow/nodes/_base/components/variable/var-list.tsxweb/app/components/workflow/nodes/_base/hooks/use-output-var-list.tsweb/app/components/workflow/nodes/loop/components/loop-variables/item.tsxweb/app/components/workflow/nodes/start/components/var-item.tsxweb/app/components/workflow/nodes/start/components/var-list.tsxweb/app/components/workflow/nodes/variable-assigner/components/var-group-item.tsxweb/app/components/workflow/nodes/variable-assigner/components/var-list/index.tsxweb/app/components/workflow/panel/chat-variable-panel/components/variable-modal.tsxweb/app/components/workflow/panel/chat-variable-panel/type.tsweb/app/components/workflow/panel/env-panel/variable-modal.tsxweb/package.jsonweb/utils/var.ts
目的与范围
本文档描述了 Dify 的存储后端配置(用于文件存储)和向量数据库配置(用于知识库嵌入向量)。内容涵盖系统架构、支持的存储后端(23 种以上向量数据库和 12 种以上存储提供商)、配置方法,以及用于在运行时初始化这些系统的工厂模式。
Dify 对文件存储和向量搜索均采用可插拔架构,开发者可以通过修改环境变量来切换提供商 docker/.env.example:151-207。
存储后端架构
概述
Dify 使用可插拔的存储后端系统来存储用户上传的文件、文档和生成的资源。该系统通过统一接口支持多个云提供商和本地存储,其中 Apache OpenDAL 作为主要的抽象层 api/.env.example:111-115。
存储后端选择流程
来源:api/extensions/ext_storage.py:22-86、api/extensions/storage/storage_type.py:4-19、api/configs/middleware/__init__.py:70-77
存储配置与类型
存储系统使用基于 Pydantic 的配置模型(位于 api/configs/middleware/storage/),用于校验和解析环境变量。api/extensions/ext_storage.py 中的 Storage 类作为入口点,通过工厂模式实例化具体的提供商。
| STORAGE_TYPE 值 | 提供商 | 类实现 | 配置文件 |
|---|---|---|---|
opendal | Apache OpenDAL | OpenDALStorage | opendal_storage_config.py |
s3 | AWS S3 | AwsS3Storage | amazon_s3_storage_config.py |
azure-blob | Azure | AzureBlobStorage | azure_blob_storage_config.py |
aliyun-oss | 阿里云 | AliyunOssStorage | aliyun_oss_storage_config.py |
google-storage | Google Cloud | GoogleCloudStorage | google_cloud_storage_config.py |
tencent-cos | 腾讯云 | TencentCosStorage | tencent_cos_storage_config.py |
huawei-obs | 华为云 | HuaweiObsStorage | huawei_obs_storage_config.py |
baidu-obs | 百度云 | BaiduObsStorage | baidu_obs_storage_config.py |
volcengine-tos | 火山引擎 | VolcengineTosStorage | volcengine_tos_storage_config.py |
oci-storage | Oracle Cloud | OracleOCIStorage | oci_storage_config.py |
supabase | Supabase | SupabaseStorage | supabase_storage_config.py |
clickzetta-volume | ClickZetta | ClickZettaVolumeStorage | clickzetta_volume_storage_config.py |
local | 本地文件系统(已废弃) | OpenDALStorage(scheme='fs') | - |
来源:api/extensions/ext_storage.py:22-86、api/extensions/storage/storage_type.py:4-19、api/configs/middleware/__init__.py:53-67
OpenDAL 集成
OpenDAL 为 40 多种存储服务提供了统一接口。当 STORAGE_TYPE=opendal 时,通过 OPENDAL_SCHEME 选择方案 api/.env.example:111-115。
OpenDAL 初始化模式
OpenDALStorage 类使用重试层和从环境变量中提取的动态关键字参数来初始化 opendal.Operator。
# api/extensions/storage/opendal_storage.py
class OpenDALStorage(BaseStorage):
def __init__(self, scheme: str, **kwargs):
# OpenDAL Operator 的初始化逻辑
# 使用 opendal.layers.RetryLayer
系统会解析以 OPENDAL_<SCHEME>_ 开头的环境变量,并将其转换为小写键名,用于 OpenDAL 操作器。
来源:api/pyproject.toml:191、api/configs/middleware/storage/opendal_storage_config.py、api/.env.example:113-115
向量数据库架构
概述
Dify 支持 23 种以上向量数据库实现。每种实现都注册为入口点,并通过 VectorFactory 进行实例化 api/core/rag/datasource/vdb/vector_factory.py。
向量数据库初始化流程
来源:api/core/rag/datasource/vdb/vector_type.py:4-37、api/configs/middleware/__init__.py:86-101、api/pyproject.toml:203-241
支持的实现
Dify 采用基于工作区的插件架构来管理向量数据库,每个提供商都是 providers/vdb/* 下的独立包 api/pyproject.toml:56-58。
| 数据库 | 包名 | 配置类 |
|---|---|---|
| Weaviate | dify-vdb-weaviate | WeaviateConfig |
| Milvus | dify-vdb-milvus | MilvusConfig |
| PGVector | dify-vdb-pgvector | PGVectorConfig |
| Qdrant | dify-vdb-qdrant | QdrantConfig |
| Elasticsearch | dify-vdb-elasticsearch | ElasticsearchConfig |
| TiDB Vector | dify-vdb-tidb-vector | TiDBVectorConfig |
| OceanBase | dify-vdb-oceanbase | OceanBaseVectorConfig |
| Chroma | dify-vdb-chroma | ChromaConfig |
| Oracle | dify-vdb-oracle | OracleConfig |
来源:api/pyproject.toml:62-91、api/configs/middleware/__init__.py:22-51
配置模式
向量数据库通过 VECTOR_STORE 环境变量进行配置 api/.env.example:205。每个数据库都有自己特定的配置块:
- Weaviate:
WEAVIATE_ENDPOINT、WEAVIATE_API_KEYapi/.env.example:210-211。 - Milvus:
MILVUS_URI、MILVUS_TOKENdocker/.env.example:186。 - PGVector:
PGVECTOR_HOST、PGVECTOR_PORTdocker/docker-compose.yaml:32。
数据流:从文档到向量存储
下图展示了从高层文档入库到负责持久化的具体代码实体之间的桥梁。
来源:api/core/rag/datasource/vdb/vector_factory.py、api/controllers/console/datasets/datasets.py:25、api/models/dataset.py
配置汇总表
| 中间件 | 配置类 | 默认端口 | 关键环境变量 |
|---|---|---|---|
| PostgreSQL | DatabaseConfig | 5432 | DB_HOST、DB_USERNAME、DB_PASSWORD、DB_DATABASE |
| Redis | RedisConfig | 6379 | REDIS_HOST、REDIS_PORT、REDIS_PASSWORD |
| S3 | S3StorageConfig | 443 | S3_ENDPOINT、S3_BUCKET_NAME、S3_ACCESS_KEY |
| Milvus | MilvusConfig | 19530 | MILVUS_URI、MILVUS_TOKEN |
| Weaviate | WeaviateConfig | 8080 | WEAVIATE_ENDPOINT、WEAVIATE_API_KEY |
| Qdrant | QdrantConfig | 6333 | QDRANT_URL、QDRANT_API_KEY |
来源:api/configs/middleware/__init__.py:123-153、api/configs/middleware/cache/redis_config.py、api/.env.example:46-101、api/.env.example:117-124、api/.env.example:209-211