7.1 · 文档入库管线（Document Ingestion Pipeline）

多模型对话工作台与知识应用入口 · 本章是 Open WebUI DeepWiki 中文译文的独立章节页，保留原始链接、源码锚点、模块标签和章节层级。

项目Open WebUI 章节7.1 状态全文译文模块检索、召回与知识系统、工具、记忆与模型调用、接口与服务契约、界面与交互

项目要点页2.5 参考项目项目章节目录Open WebUI DeepWiki 原始章节Document Ingestion Pipeline 上一章7 下一章7.2

源码线索

backend/open_webui/models/memories.py
backend/open_webui/retrieval/loaders/datalab_marker.py
backend/open_webui/retrieval/loaders/main.py
backend/open_webui/retrieval/loaders/mineru.py
backend/open_webui/retrieval/loaders/paddleocr_vl.py
backend/open_webui/retrieval/utils.py
backend/open_webui/routers/files.py
backend/open_webui/routers/knowledge.py
backend/open_webui/routers/memories.py
backend/open_webui/routers/retrieval.py

模块标签

检索、召回与知识系统
工具、记忆与模型调用
接口与服务契约
界面与交互
系统架构

中文译文

文档入库管线（中文译文）

原始 DeepWiki 页面：https://deepwiki.com/open-webui/open-webui/7.1-document-ingestion-pipeline

翻译时间：2026-06-09T16:09:14.606Z

翻译模型：deepseek-chat

原文字符数：12464

项目：Open WebUI (open-webui)

---

文档摄入管道

目的与范围

文档摄入管道是所有基于文档的内容进入 Open WebUI RAG（检索增强生成）系统的入口。它处理来自多个来源的文件上传，验证格式和大小，使用可配置的存储提供程序存储文件，并为内容提取和嵌入做好准备。

关于已摄入文档的内容提取，请参阅内容提取引擎。关于提取后的文本分割，请参阅文本分割与分块。关于完整的聊天时检索流程，请参阅消息输入系统和文件上传与处理。

支持的文件格式

Open WebUI 通过多种加载器实现支持广泛的文档格式：

文档格式

PDF：.pdf — 通过 PyPDFLoader、Tika、Docling、Marker、MinerU 支持 backend/open_webui/retrieval/loaders/main.py:9-18
Microsoft Office：.doc、.docx、.xls、.xlsx、.ppt、.pptx backend/open_webui/retrieval/loaders/main.py:9-18
OpenDocument：.odt、.ods、.odp backend/open_webui/retrieval/loaders/main.py:9-18
文本：.txt、.md、.rst、.xml、.html、.csv backend/open_webui/retrieval/loaders/main.py:9-18
电子邮件：.msg、.eml — Outlook 邮件文件 backend/open_webui/retrieval/loaders/main.py:9-18
电子书：.epub backend/open_webui/retrieval/loaders/main.py:9-18

源代码格式

通过 TextLoader 广泛支持编程语言。系统根据预定义的扩展名列表识别代码文件，包括 .py、.js、.ts、.go、.java、.cpp、.rs 等。backend/open_webui/retrieval/loaders/main.py:33-87

图片格式

.png、.jpeg、.jpg、.webp、.gif、.tiff backend/open_webui/retrieval/loaders/datalab_marker.py:57-63
可通过 OCR 引擎（如 Marker、MinerU 或 Azure Document Intelligence）处理。backend/open_webui/retrieval/loaders/main.py:239-260

网络内容

通过网页加载器直接加载 URL。backend/open_webui/retrieval/utils.py:76-81
通过 YoutubeLoader 获取 YouTube 视频转录文本。backend/open_webui/retrieval/utils.py:69-74

文件上传来源

标题：文档摄入来源与验证

graph TB
    subgraph "上传来源"
        DirectUpload["直接文件上传<br/>/api/files"]
        WebURL["网页 URL 摄入<br/>get_content_from_url"]
        YouTubeURL["YouTube URL<br/>YoutubeLoader"]
        GoogleDrive["Google Drive 集成<br/>ENABLE_GOOGLE_DRIVE_INTEGRATION"]
        OneDrive["OneDrive 集成<br/>ENABLE_ONEDRIVE_INTEGRATION"]
    end

    subgraph "验证与存储"
        Validator["文件验证器<br/>- 大小检查 (FILE_MAX_SIZE)<br/>- 数量检查 (FILE_MAX_COUNT)<br/>- 扩展名检查 (ALLOWED_FILE_EXTENSIONS)"]
        StorageProvider["存储提供程序<br/>open_webui.storage.provider.Storage"]
    end

    subgraph "文件注册表"
        FilesModel["文件数据库表<br/>open_webui.models.files.Files"]
        FileMetadata["文件元数据<br/>- id<br/>- filename<br/>- user_id<br/>- hash<br/>- meta"]
    end

    DirectUpload --> Validator
    WebURL --> Validator
    YouTubeURL --> Validator
    GoogleDrive --> Validator
    OneDrive --> Validator

    Validator --> StorageProvider
    StorageProvider --> FilesModel
    FilesModel --> FileMetadata

    FileMetadata --> ContentExtraction["内容提取<br/>process_file"]

来源：backend/open_webui/routers/files.py:177-187、backend/open_webui/retrieval/utils.py:176-180、backend/open_webui/storage/provider.py、backend/open_webui/models/files.py:32-38

网页 URL 处理

当用户包含 URL 时，系统通过专门的逻辑路由请求：

YouTube：is_youtube_url 检测链接并使用 YoutubeLoader 提取转录文本。backend/open_webui/retrieval/utils.py:65-76
标准网页：get_content_from_url 验证 URL（阻止私有 IP）并获取内容。backend/open_webui/retrieval/utils.py:176-185
二进制内容：如果 URL 指向文档（例如 PDF），_extract_text_from_binary_response 将其下载到临时文件并通过标准 Loader 管道处理。backend/open_webui/retrieval/utils.py:127-161

云盘集成

Google Drive：通过 ENABLE_GOOGLE_DRIVE_INTEGRATION 启用。src/lib/apis/retrieval/index.ts:55
OneDrive：通过 ENABLE_ONEDRIVE_INTEGRATION 启用。src/lib/apis/retrieval/index.ts:56

文件上传工作流

标题：文档摄入序列

sequenceDiagram
    participant User
    participant Frontend as files/index.ts
    participant API as routers/files.py
    participant Storage as StorageProvider
    participant DB as models/files.py
    participant Retrieval as routers/retrieval.py

    User->>Frontend: 上传文件
    Frontend->>API: POST /api/v1/files/ (upload_file)

    API->>Storage: Storage.save_file(file_path, file)
    Storage-->>API: file_path

    API->>DB: Files.insert_new_file(user.id, file_form)
    DB-->>API: file_item

    Note over API, Retrieval: 如果 process=True
    API->>Retrieval: process_file(file_id)

    Retrieval->>Retrieval: extract_content()
    Retrieval->>Retrieval: store_vector_embeddings()

    API-->>Frontend: FileModelResponse

来源：backend/open_webui/routers/files.py:177-209、backend/open_webui/routers/files.py:105-155、src/lib/apis/files/index.ts:4-39

验证规则

文件上传受 RAGConfig 中定义的限值约束，管理员可更新这些配置。src/lib/components/admin/Settings/Documents.svelte:229-239

验证项	配置项	描述
最大文件大小	`FILE_MAX_SIZE`	每个文件的最大字节数。`src/lib/components/admin/Settings/Documents.svelte:233`
最大文件数量	`FILE_MAX_COUNT`	每次上传的最大文件数。`src/lib/components/admin/Settings/Documents.svelte:234`
允许的扩展名	`ALLOWED_FILE_EXTENSIONS`	允许的扩展名列表。`src/lib/components/admin/Settings/Documents.svelte:237`
图片压缩	`FILE_IMAGE_COMPRESSION_WIDTH`	大图片的自动缩放宽度。`src/lib/components/admin/Settings/Documents.svelte:235`

存储提供程序架构

文件通过 Storage 抽象进行存储。提供程序由 STORAGE_PROVIDER 配置决定。backend/open_webui/routers/files.py:51

本地存储：文件存储在 UPLOAD_DIR 中。backend/open_webui/config.py:110
云提供程序：通过 Storage 类支持 Amazon S3、Google Cloud Storage (GCS) 和 Azure Blob Storage。backend/open_webui/storage/provider.py
缓存管理：如果禁用了 STORAGE_LOCAL_CACHE，_cleanup_local_cache 会在文件上传到云存储后删除临时本地副本。backend/open_webui/routers/files.py:91-103

文档摄入逻辑

标题：从上传到向量数据库的逻辑流程

graph TD
    Start["文件已上传或 URL 已提供"] --> Detect["检测来源类型"]

    Detect --> DirectFile["直接文件上传"]
    Detect --> URLType["URL 检测"]

    URLType --> IsYouTube{"是 YouTube URL？"}
    IsYouTube -->|是| YouTubeLoader["YoutubeLoader<br/>提取转录文本"]
    IsYouTube -->|否| WebLoader["get_web_loader()<br/>获取内容"]

    DirectFile --> SaveStorage["Storage.save_file()"]
    SaveStorage --> SaveDB["Files.insert_new_file()"]

    SaveDB --> ProcessFile["process_file()"]
    YouTubeLoader --> ProcessFile
    WebLoader --> ProcessFile

    ProcessFile --> SelectEngine["Loader(engine=...)"]

    SelectEngine --> EngineChoice{"引擎类型"}

    EngineChoice -->|"default"| PyLoader["Python 加载器<br/>PyPDFLoader 等"]
    EngineChoice -->|"tika"| TikaEngine["TikaLoader"]
    EngineChoice -->|"docling"| DoclingEngine["DoclingLoader"]
    EngineChoice -->|"datalab_marker"| MarkerEngine["DatalabMarkerLoader"]
    EngineChoice -->|"mineru"| MinerUEngine["MinerULoader"]
    EngineChoice -->|"mistral_ocr"| MistralOCREngine["MistralLoader"]

    PyLoader --> ExtractContent["loader.load()"]
    TikaEngine --> ExtractContent
    DoclingEngine --> ExtractContent
    MarkerEngine --> ExtractContent
    MinerUEngine --> ExtractContent
    MistralOCREngine --> ExtractContent

    ExtractContent --> TextSplit["文本分割 (7.3)"]
    TextSplit --> Embedding["生成嵌入 (7.4)"]
    Embedding --> VectorDB["ASYNC_VECTOR_DB_CLIENT.upsert()"]

来源：backend/open_webui/routers/retrieval.py:416-460、backend/open_webui/retrieval/loaders/main.py:231-241、backend/open_webui/retrieval/utils.py:86-124

加载器工厂与选择

Loader 类充当工厂，根据 engine 参数和文件扩展名选择实现。backend/open_webui/retrieval/loaders/main.py:231-241

引擎值	实现	描述
`""`（默认）	标准加载器	使用 `PyPDFLoader`、`Docx2txtLoader`、`TextLoader` 或回退到 `ExcelLoader`/`PptxLoader`。`backend/open_webui/retrieval/loaders/main.py:261-283`
`"tika"`	`TikaLoader`	通过 `requests.put` 与 Apache Tika 服务器通信。`backend/open_webui/retrieval/loaders/main.py:138-163`
`"docling"`	`DoclingLoader`	使用 IBM 的 Docling API 处理复杂布局。`backend/open_webui/retrieval/loaders/main.py:179-210`
`"datalab_marker"`	`DatalabMarkerLoader`	与 DataLab Marker API 集成，支持轮询。`backend/open_webui/retrieval/loaders/datalab_marker.py:13-40`
`"mineru"`	`MinerULoader`	与 MinerU 集成，支持云和本地 API 模式。`backend/open_webui/retrieval/loaders/mineru.py:14-54`
`"mistral_ocr"`	`MistralLoader`	与 Mistral OCR API 集成。`backend/open_webui/retrieval/loaders/main.py:23`
`"external"`	`ExternalDocumentLoader`	使用可配置的外部文档加载器服务。`backend/open_webui/retrieval/loaders/main.py:21`

用户记忆摄入

除文档外，Open WebUI 还支持"记忆"系统，将用户特定的事实摄入到向量数据库中。backend/open_webui/routers/memories.py:62-100

添加：add_memory 通过 Memories.insert_new_memory 将事实插入数据库。backend/open_webui/routers/memories.py:84
向量化：使用配置的 EMBEDDING_FUNCTION 对事实进行向量化。backend/open_webui/routers/memories.py:86
存储：向量存储在用户特定的集合中（例如 user-memory-{user_id}），通过 ASYNC_VECTOR_DB_CLIENT.upsert 完成。backend/open_webui/routers/memories.py:88-98

来源：backend/open_webui/routers/memories.py:62-102、backend/open_webui/models/memories.py:7-8