支持的数据源
支持的数据源
相关源文件
本章引用的主要源码文件:
backend/alembic/versions/3fc5d75723b3_add_doc_metadata_field_in_document_model.pybackend/alembic/versions/47a07e1a38f1_fix_invalid_model_configurations_state.pybackend/alembic/versions/7a70b7664e37_add_model_configuration_table.pybackend/alembic/versions/9a0296d7421e_add_is_auto_mode_to_llm_provider.pybackend/ee/onyx/connectors/perm_sync_valid.pybackend/ee/onyx/external_permissions/confluence/constants.pybackend/ee/onyx/external_permissions/confluence/doc_sync.pybackend/ee/onyx/external_permissions/confluence/group_sync.pybackend/ee/onyx/external_permissions/confluence/space_access.pybackend/ee/onyx/external_permissions/github/utils.pybackend/ee/onyx/external_permissions/gmail/doc_sync.pybackend/ee/onyx/external_permissions/google_drive/doc_sync.pybackend/ee/onyx/external_permissions/google_drive/permission_retrieval.pybackend/ee/onyx/external_permissions/jira/doc_sync.pybackend/ee/onyx/external_permissions/salesforce/postprocessing.pybackend/ee/onyx/external_permissions/salesforce/utils.pybackend/ee/onyx/external_permissions/sharepoint/doc_sync.pybackend/ee/onyx/external_permissions/sharepoint/group_sync.pybackend/ee/onyx/external_permissions/sharepoint/permission_utils.pybackend/ee/onyx/external_permissions/slack/doc_sync.pybackend/ee/onyx/external_permissions/slack/group_sync.pybackend/ee/onyx/external_permissions/slack/utils.pybackend/ee/onyx/external_permissions/sync_params.pybackend/ee/onyx/external_permissions/teams/doc_sync.pybackend/ee/onyx/external_permissions/utils.pybackend/onyx/access/models.pybackend/onyx/background/indexing/checkpointing_utils.pybackend/onyx/connectors/airtable/airtable_connector.pybackend/onyx/connectors/axero/connector.pybackend/onyx/connectors/bookstack/client.pybackend/onyx/connectors/clickup/connector.pybackend/onyx/connectors/confluence/connector.pybackend/onyx/connectors/confluence/onyx_confluence.pybackend/onyx/connectors/confluence/utils.pybackend/onyx/connectors/connector_runner.pybackend/onyx/connectors/discord/__init__.pybackend/onyx/connectors/discord/connector.pybackend/onyx/connectors/discourse/connector.pybackend/onyx/connectors/document360/connector.pybackend/onyx/connectors/egnyte/connector.pybackend/onyx/connectors/fireflies/connector.pybackend/onyx/connectors/gitbook/__init__.pybackend/onyx/connectors/gitbook/connector.pybackend/onyx/connectors/google_drive/connector.pybackend/onyx/connectors/google_drive/doc_conversion.pybackend/onyx/connectors/google_drive/file_retrieval.pybackend/onyx/connectors/google_drive/models.pybackend/onyx/connectors/google_utils/resources.pybackend/onyx/connectors/highspot/__init__.pybackend/onyx/connectors/highspot/client.pybackend/onyx/connectors/highspot/connector.pybackend/onyx/connectors/highspot/utils.pybackend/onyx/connectors/hubspot/connector.pybackend/onyx/connectors/hubspot/rate_limit.pybackend/onyx/connectors/interfaces.pybackend/onyx/connectors/linear/connector.pybackend/onyx/connectors/mock_connector/connector.pybackend/onyx/connectors/productboard/connector.pybackend/onyx/connectors/salesforce/connector.pybackend/onyx/connectors/salesforce/doc_conversion.pybackend/onyx/connectors/salesforce/onyx_salesforce.pybackend/onyx/connectors/salesforce/salesforce_calls.pybackend/onyx/connectors/salesforce/sqlite_functions.pybackend/onyx/connectors/salesforce/utils.pybackend/onyx/connectors/sharepoint/connector.pybackend/onyx/connectors/sharepoint/connector_utils.pybackend/onyx/connectors/slack/connector.pybackend/onyx/connectors/slack/onyx_retry_handler.pybackend/onyx/connectors/slack/onyx_slack_web_client.pybackend/onyx/connectors/slack/utils.pybackend/onyx/connectors/teams/connector.pybackend/onyx/connectors/teams/models.pybackend/onyx/connectors/teams/utils.pybackend/onyx/connectors/zendesk/connector.pybackend/onyx/onyxbot/slack/icons.pybackend/onyx/server/documents/standard_oauth.pybackend/onyx/tools/tool_implementations/mcp/mcp_client.pybackend/onyx/utils/subclasses.pybackend/onyx/utils/threadpool_concurrency.pybackend/scripts/decrypt.pybackend/tests/daily/connectors/airtable/test_airtable_basic.pybackend/tests/daily/connectors/discord/test_discord_connector.pybackend/tests/daily/connectors/fireflies/test_fireflies_connector.pybackend/tests/daily/connectors/fireflies/test_fireflies_data.jsonbackend/tests/daily/connectors/gitbook/test_gitbook_connector.pybackend/tests/daily/connectors/google_drive/conftest.pybackend/tests/daily/connectors/google_drive/consts_and_utils.pybackend/tests/daily/connectors/google_drive/test_admin_oauth.pybackend/tests/daily/connectors/google_drive/test_drive_perm_sync.pybackend/tests/daily/connectors/google_drive/test_link_visibility_filter.pybackend/tests/daily/connectors/google_drive/test_map_test_ids.pybackend/tests/daily/connectors/google_drive/test_resolver.pybackend/tests/daily/connectors/google_drive/test_sections.pybackend/tests/daily/connectors/google_drive/test_service_acct.pybackend/tests/daily/connectors/google_drive/test_user_1_oauth.pybackend/tests/daily/connectors/highspot/test_highspot_connector.pybackend/tests/daily/connectors/highspot/test_highspot_data.jsonbackend/tests/daily/connectors/hubspot/test_hubspot_connector.pybackend/tests/daily/connectors/salesforce/test_salesforce_connector.pybackend/tests/daily/connectors/salesforce/test_salesforce_data.jsonbackend/tests/daily/connectors/sharepoint/test_sharepoint_connector.pybackend/tests/daily/connectors/slack/test_slack_connector.pybackend/tests/daily/connectors/slack/test_slack_perm_sync.pybackend/tests/daily/connectors/teams/test_teams_connector.pybackend/tests/daily/connectors/utils.pybackend/tests/daily/connectors/zendesk/test_zendesk_connector.pybackend/tests/daily/connectors/zendesk/test_zendesk_data.jsonbackend/tests/external_dependency_unit/connectors/confluence/conftest.pybackend/tests/integration/connector_job_tests/sharepoint/conftest.pybackend/tests/integration/connector_job_tests/slack/slack_api_utils.pybackend/tests/unit/ee/onyx/external_permissions/confluence/test_space_access.pybackend/tests/unit/ee/onyx/external_permissions/salesforce/test_postprocessing.pybackend/tests/unit/ee/onyx/external_permissions/sharepoint/test_permission_utils.pybackend/tests/unit/onyx/connectors/airtable/test_airtable_index_all.pybackend/tests/unit/onyx/connectors/confluence/test_confluence_checkpointing.pybackend/tests/unit/onyx/connectors/confluence/test_onyx_confluence.pybackend/tests/unit/onyx/connectors/discord/test_discord_validation.pybackend/tests/unit/onyx/connectors/google_drive/__init__.pybackend/tests/unit/onyx/connectors/google_drive/test_slim_retrieval.pybackend/tests/unit/onyx/connectors/google_utils/test_impersonation_guard.pybackend/tests/unit/onyx/connectors/hubspot/test_hubspot_inline_associations.pybackend/tests/unit/onyx/connectors/jira/test_jira_permission_sync.pybackend/tests/unit/onyx/connectors/linear/test_linear_load_credentials.pybackend/tests/unit/onyx/connectors/salesforce/test_salesforce_custom_config.pybackend/tests/unit/onyx/connectors/salesforce/test_salesforce_sqlite.pybackend/tests/unit/onyx/connectors/salesforce/test_yield_doc_batches.pybackend/tests/unit/onyx/connectors/sharepoint/test_delta_checkpointing.pybackend/tests/unit/onyx/connectors/sharepoint/test_drive_matching.pybackend/tests/unit/onyx/connectors/sharepoint/test_fetch_site_pages.pybackend/tests/unit/onyx/connectors/sharepoint/test_hierarchy_helpers.pybackend/tests/unit/onyx/connectors/sharepoint/test_rest_client_context_caching.pybackend/tests/unit/onyx/connectors/teams/test_collect_teams.pybackend/tests/unit/onyx/connectors/test_connector_factory.pybackend/tests/unit/onyx/connectors/utils.pybackend/tests/unit/onyx/connectors/zendesk/test_zendesk_checkpointing.pybackend/tests/unit/onyx/connectors/zendesk/test_zendesk_rate_limit.pyweb/src/app/craft/components/ConnectDataBanner.tsxweb/src/app/craft/components/ConnectorBannersRow.tsxweb/src/app/craft/v1/configure/components/ComingSoonConnectors.tsxweb/src/lib/connectors/AutoSyncOptionFields.tsx
目的与范围
本文档列出了 Onyx 可以连接的所有数据源,用于文档索引和检索。它记录了数据源的枚举、元数据、配置要求、认证方式以及后端实现细节。有关连接器框架和生命周期的信息,请参阅连接器框架概述。有关凭证管理的详细信息,请参阅凭证管理。有关配置这些连接器的管理界面,请参阅连接器管理界面。
数据源枚举
所有支持的数据源都在 ValidSources 枚举中定义。该枚举是整个系统中连接器类型的唯一真实来源。
源文件: web/src/lib/types.ts:466-526
ValidSources 枚举包含 60 多个数据源,分为以下几类:
- 知识库和 Wiki(Confluence、Notion、BookStack 等)
- 云存储(Google Drive、Dropbox、S3 等)
- 工单和任务管理(Jira、Zendesk、Linear 等)
- 消息平台(Slack、Teams、Gmail 等)
- 代码仓库(GitHub、GitLab、Bitbucket)
- 销售平台(Salesforce、HubSpot、Gong)
- 通用数据源(Web、File、Ingestion API)
- 特殊数据源(FederatedSlack、CraftFile、UserFile)
数据源注册流程
下图展示了从界面的数据源字符串如何映射到后端的连接器实现类。
数据源映射逻辑
源文件: web/src/lib/types.ts:466-559, web/src/lib/sources.ts:77-451, web/src/lib/connectors/connectors.tsx:145-148, backend/onyx/connectors/factory.py:91-101
数据源分类
数据源按照 SourceCategory 枚举定义的类别进行组织。SOURCE_METADATA_MAP 将每个数据源与其类别、图标、显示名称和文档关联起来。
类别划分
系统根据功能领域对连接器进行分组,以简化管理员的设置体验。
数据源分类映射
源文件: web/src/lib/sources.ts:95-451, web/src/components/icons/icons.tsx:1-97
连接器配置
每个可配置的数据源在 connectorConfigs 中都有一个条目,定义了设置连接器所需的字段。配置使用类型安全的模式,并通过 Yup 进行校验。
配置模式
ConnectionConfiguration 接口定义了如何为每个数据源生成管理表单。
配置对象结构
源文件: web/src/lib/connectors/connectors.tsx:114-143, web/src/lib/connectors/connectors.tsx:17-112
实现细节:Google Drive
Google Drive 连接器支持递归文件夹遍历、权限同步和多种文件类型。
数据流:文件检索
连接器使用 crawl_folders_for_files backend/onyx/connectors/google_drive/file_retrieval.py:36 来遍历层级结构,并根据凭证类型使用 get_all_files_for_oauth backend/onyx/connectors/google_drive/file_retrieval.py:38 或 get_all_files_in_my_drive_and_shared backend/onyx/connectors/google_drive/file_retrieval.py:40-41。
Google Drive 遍历逻辑
源文件: backend/onyx/connectors/google_drive/connector.py:36-49, backend/onyx/connectors/google_drive/doc_conversion.py:33-34, backend/onyx/connectors/google_drive/file_retrieval.py:106-128
文档转换
文件通过 convert_drive_item_to_document backend/onyx/connectors/google_drive/doc_conversion.py:33 转换为 Onyx 的 Document 对象。对于 Google Docs,连接器使用 get_document_sections backend/onyx/connectors/google_drive/doc_conversion.py:27 提取章节。对于二进制文件(PDF、DOCX、PPTX),它使用 MediaIoBaseDownload backend/onyx/connectors/google_drive/doc_conversion.py:10 下载内容,并使用本地提取器(如 read_pdf_file backend/onyx/connectors/google_drive/doc_conversion.py:43)进行处理。
实现细节:Confluence
Confluence 连接器同时支持 Cloud 版和 Server/Data Center 版。它使用 OnyxConfluence backend/onyx/connectors/confluence/onyx_confluence.py:110,这是对 atlassian-python-api 库的封装。
检查点
ConfluenceConnector 实现了 CheckpointedConnector backend/onyx/connectors/confluence/connector.py:121。它将 next_page_url 存储在 ConfluenceCheckpoint backend/onyx/connectors/confluence/connector.py:108-109 中,以便从中断处恢复索引。
CQL 过滤
连接器构建复杂的 CQL(Confluence 查询语言)字符串,以按空间、页面 ID 或标签进行过滤 backend/onyx/connectors/confluence/connector.py:170-181。
源文件: backend/onyx/connectors/confluence/connector.py:120-154, backend/onyx/connectors/confluence/onyx_confluence.py:110-157
实现细节:SharePoint
SharePoint 连接器使用 Microsoft Graph API 和 office365-rest-python-client 库。它支持索引文档检索和权限同步。
认证
SharePoint 支持多种认证方式,包括客户端密钥和基于证书的认证。load_credentials 方法 backend/onyx/connectors/sharepoint/connector.py:221-255 负责初始化 msal.ConfidentialClientApplication backend/onyx/connectors/sharepoint/connector.py:21 和 GraphClient backend/onyx/connectors/sharepoint/connector.py:26。
文档处理
连接器遍历 SharePoint 站点和驱动器,使用 DriveItemData.from_graph_json backend/onyx/connectors/sharepoint/connector.py:174 获取项目。它使用 extract_text_and_images backend/onyx/connectors/sharepoint/connector.py:76 处理文件内容,并使用 get_sharepoint_external_access backend/onyx/connectors/sharepoint/connector.py:74 进行权限映射。
源文件: backend/onyx/connectors/sharepoint/connector.py:21-34, backend/onyx/connectors/sharepoint/connector.py:155-172, backend/onyx/connectors/sharepoint/connector.py:221-255
实现细节:Slack
Slack 连接器索引公共和私有频道中的消息和线程。
消息检索
它使用 OnyxSlackWebClient backend/onyx/connectors/slack/connector.py:63 与 Slack API 交互。get_channel_messages 函数 backend/onyx/connectors/slack/connector.py:146 执行分页调用 conversations_history,而 get_thread backend/onyx/connectors/slack/connector.py:176 检索特定消息的回复。
权限同步
Slack 权限通过 get_channel_access backend/onyx/connectors/slack/connector.py:58 进行同步,该方法将 Slack 用户 ID 映射到外部访问记录。
源文件: backend/onyx/connectors/slack/connector.py:146-174, backend/onyx/connectors/slack/connector.py:176-184
实现细节:Salesforce
Salesforce 连接器通过将数据导出到本地 SQLite 数据库进行处理,执行全量同步和增量同步。
同步策略
连接器使用 OnyxSalesforce backend/onyx/connectors/salesforce/onyx_salesforce.py:30 进行 API 交互。在初始同步期间,它通过 fetch_all_csvs_in_parallel backend/onyx/connectors/salesforce/connector.py:31 批量导出对象类型到 CSV,并将其加载到 OnyxSalesforceSQLite backend/onyx/connectors/salesforce/connector.py:32 中。
文档生成
文档通过将父对象与其子对象(例如,账户与机会)在本地数据库中进行关联来创建 backend/onyx/connectors/salesforce/connector.py:172-180。
源文件: backend/onyx/connectors/salesforce/connector.py:163-182, backend/onyx/connectors/salesforce/doc_conversion.py:27-28
后端连接器注册表
后端使用工厂模式,根据 DocumentSource 实例化正确的连接器类。
源文件: backend/onyx/connectors/factory.py:1-185
连接器类加载
identify_connector_class 函数 backend/onyx/connectors/factory.py:91-101 从 registry.py 中定义的 CONNECTOR_CLASS_MAP 中检索类。它使用 _load_connector_class backend/onyx/connectors/factory.py:36-54 动态导入模块并缓存类对象。
输入类型校验
在实例化之前,工厂会校验连接器类是否为其 InputType(例如,LoadConnector 对应 LOAD_STATE)实现了所需的接口 backend/onyx/connectors/factory.py:57-88。