agentic_huge_data_base / wiki
页面 Graphiti · 3.4 去重与消解·DeepWiki 中文全文译文

3.4 · 去重与消解(Deduplication and Resolution)

时序知识图谱与动态事实记忆 · 聚焦本章的模块关系、源码依据与实现要点。

项目Graphiti 章节3.4 状态全文译文 模块图谱与关系、界面与交互、系统架构、测试、发布与运维
源码线索
  • graphiti_core/prompts/dedupe_edges.py
  • graphiti_core/prompts/dedupe_nodes.py
  • graphiti_core/prompts/extract_edges.py
  • graphiti_core/prompts/extract_nodes.py
  • graphiti_core/prompts/summarize_nodes.py
  • graphiti_core/utils/maintenance/edge_operations.py
  • graphiti_core/utils/maintenance/node_operations.py
  • tests/utils/maintenance/test_edge_operations.py
  • tests/utils/maintenance/test_node_operations.py
模块标签
  • 图谱与关系
  • 界面与交互
  • 系统架构
  • 测试、发布与运维
  • 检索、召回与索引

章节正文

去重与消解

去重与解析

相关源文件

本章引用的主要源码文件:

  • graphiti_core/prompts/dedupe_edges.py
  • graphiti_core/prompts/dedupe_nodes.py
  • graphiti_core/prompts/extract_edges.py
  • graphiti_core/prompts/extract_nodes.py
  • graphiti_core/prompts/summarize_nodes.py
  • graphiti_core/utils/maintenance/edge_operations.py
  • graphiti_core/utils/maintenance/node_operations.py
  • tests/utils/maintenance/test_edge_operations.py
  • tests/utils/maintenance/test_node_operations.py

本文档解释了 Graphiti 在剧集入库期间如何解析重复实体和关系。系统采用三层策略进行节点去重(精确匹配、模糊相似度和大语言模型推理),并同时处理边去重和矛盾检测,支持基于时间的失效机制。

概述

在剧集入库期间,提取的节点和边必须与现有图谱实体进行解析,以防止重复并保持一致性。去重系统分两个阶段运行:

  1. 节点解析:通过 resolve_extracted_nodes graphiti_core/utils/maintenance/node_operations.py:31-31,先使用确定性启发式方法将提取的实体节点与现有节点进行匹配,若无法解决则升级到大语言模型(LLM)解析。
  2. 边解析:通过 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 检查提取的边是否存在重复和矛盾,并对过时信息进行基于时间的失效处理。

系统优先使用快速确定性方法以提高性能,仅在必要时才调用大语言模型(LLM)。这可以在保持高准确率的同时,最大限度降低 API 成本。

来源graphiti_core/utils/maintenance/node_operations.py:31-31, graphiti_core/utils/maintenance/edge_operations.py:225-225

节点去重:三层策略

架构总览

节点去重过程遵循分层升级策略,每一层处理上一层无法解决的案例。该逻辑的主要入口点是 resolve_extracted_nodes graphiti_core/utils/maintenance/node_operations.py:31-31

分层解析流程

Graphiti · 架构总览 · 图 1
Graphiti · 架构总览 · 图 1

来源graphiti_core/utils/maintenance/node_operations.py:31-31, graphiti_core/utils/maintenance/dedup_helpers.py:161-175, graphiti_core/utils/maintenance/node_operations.py:217-222

第一层:精确字符串匹配

第一层对实体名称进行标准化,并执行不区分大小写、空白标准化的比较。这发生在 _resolve_with_similarity graphiti_core/utils/maintenance/dedup_helpers.py:196-196 中。

标准化函数操作示例
_normalize_string_exact()转小写,合并空白" Alice Smith ""alice smith"
_normalize_name_for_fuzzy()移除标点,转小写"Alice-Smith!""alice smith"

来源graphiti_core/utils/maintenance/dedup_helpers.py:39-49, graphiti_core/utils/maintenance/dedup_helpers.py:52-64

精确匹配逻辑检查是否存在一个具有标准化名称的候选节点:

normalized_key = _normalize_string_exact(extracted_node.name)
candidates = indexes.normalized_existing[normalized_key]

if len(candidates) == 1:
    # 单个精确匹配 - 立即解析
    state.resolved_nodes[idx] = candidates[0]
    state.uuid_map[extracted_node.uuid] = candidates[0].uuid

来源graphiti_core/utils/maintenance/dedup_helpers.py:214-222

第二层:基于 MinHash 和 LSH 的模糊相似度

对于无法精确匹配的实体,系统在 _resolve_with_similarity graphiti_core/utils/maintenance/dedup_helpers.py:196-196 中使用概率哈希来查找近似重复项。

模糊解析管线

Graphiti · 第二层:基于 MinHash 和 LSH 的模糊相似度 · 图 2
Graphiti · 第二层:基于 MinHash 和 LSH 的模糊相似度 · 图 2

来源graphiti_core/utils/maintenance/dedup_helpers.py:88-140

基于熵的门控:低熵名称(例如 "Joe")会跳过模糊匹配,直接升级到大语言模型(LLM)推理,以避免误报。这由 _has_high_entropy graphiti_core/utils/maintenance/dedup_helpers.py:79-79 控制,该函数计算字符的香农熵。

常量用途
_NAME_ENTROPY_THRESHOLD1.5模糊匹配的最小香农熵
_MIN_NAME_LENGTH6信任模糊匹配的最小长度
_FUZZY_JACCARD_THRESHOLD0.9解析的最小 Jaccard 相似度

来源graphiti_core/utils/maintenance/dedup_helpers.py:31-36, graphiti_core/utils/maintenance/dedup_helpers.py:79-85, graphiti_core/utils/maintenance/dedup_helpers.py:52-76

第三层:基于大语言模型的解析

经过第一层和第二层后仍未解析的实体,会通过 _resolve_with_llm 函数 graphiti_core/utils/maintenance/node_operations.py:29-29 批量发送给大语言模型(LLM)进行语义推理。

大语言模型解析序列

Graphiti · 第三层:基于大语言模型的解析 · 图 3
Graphiti · 第三层:基于大语言模型的解析 · 图 3

来源graphiti_core/utils/maintenance/node_operations.py:29-29, graphiti_core/prompts/dedupe_nodes.py:117-117, graphiti_core/llm_client/llm_client.py:42-42

大语言模型(LLM)提示会接收 extracted_nodesexisting_nodes 以及原始 episode_content,以提供消歧上下文 graphiti_core/prompts/dedupe_nodes.py:117-179

响应模型NodeDuplicate graphiti_core/prompts/dedupe_nodes.py:25-34

class NodeDuplicate(BaseModel):
    id: int # 来自新实体的整数 ID
    name: str # 最完整的名称
    duplicate_candidate_id: int # 匹配的现有实体的候选 ID,或 -1

边去重与矛盾检测

集成解析

边解析由 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 管理,该函数将重复检测和矛盾识别合并到一次大语言模型(LLM)调用中,使用 resolve_edge 提示 graphiti_core/prompts/dedupe_edges.py:43-43

边解析流程

Graphiti · 集成解析 · 图 4
Graphiti · 集成解析 · 图 4

来源graphiti_core/utils/maintenance/edge_operations.py:107-107, graphiti_core/prompts/dedupe_edges.py:43-91, graphiti_core/utils/maintenance/edge_operations.py:225-225

快速路径:精确事实匹配

resolve_extracted_edge graphiti_core/utils/maintenance/edge_operations.py:107-107 中,系统首先检查新事实与同一对节点之间的现有边是否存在精确语义匹配(标准化字符串)。如果找到匹配,则会短路大语言模型(LLM)调用,并将新剧集的 UUID 附加到现有边上 tests/utils/maintenance/test_edge_operations.py:108-152

来源graphiti_core/utils/maintenance/edge_operations.py:107-152, tests/utils/maintenance/test_edge_operations.py:108-152

基于时间的矛盾解析

当大语言模型(LLM)识别出矛盾时(通过 EdgeDuplicate 响应中的 contradicted_facts graphiti_core/prompts/dedupe_edges.py:24-33),系统会执行基于时间的失效处理。如果基于 valid_at 时间戳新事实更新,则现有事实的 invalid_at 会更新为新事实的 valid_at,从而有效地"淘汰"旧信息。

来源graphiti_core/prompts/dedupe_edges.py:24-33, graphiti_core/prompts/dedupe_edges.py:79-84

数据结构

DedupResolutionState

跟踪一批节点的解析进度 graphiti_core/utils/maintenance/dedup_helpers.py:161-168

@dataclass
class DedupResolutionState:
    resolved_nodes: list[EntityNode | None]
    uuid_map: dict[str, str]
    unresolved_indices: list[int]
    duplicate_pairs: list[tuple[EntityNode, EntityNode]] = field(default_factory=list)

来源graphiti_core/utils/maintenance/dedup_helpers.py:161-168

DedupCandidateIndexes

存储用于精确和模糊匹配的预计算查找结构 graphiti_core/utils/maintenance/dedup_helpers.py:150-158

@dataclass
class DedupCandidateIndexes:
    existing_nodes: list[EntityNode]
    nodes_by_uuid: dict[str, EntityNode]
    normalized_existing: defaultdict[str, list[EntityNode]]
    shingles_by_candidate: dict[str, set[str]]
    lsh_buckets: defaultdict[tuple[int, tuple[int, ...]], list[str]]

来源graphiti_core/utils/maintenance/dedup_helpers.py:150-158

与剧集处理的集成

去重是处理管线的核心部分。在通过 extract_nodes graphiti_core/utils/maintenance/node_operations.py:69-69 提取节点后,会对它们进行解析。生成的 uuid_map 随后用于更新提取的边的 source_node_uuidtarget_node_uuid,然后再通过 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 对边进行解析。

来源graphiti_core/utils/maintenance/node_operations.py:69-148, graphiti_core/utils/maintenance/edge_operations.py:225-232