去重与消解
去重与解析
相关源文件
本章引用的主要源码文件:
graphiti_core/prompts/dedupe_edges.pygraphiti_core/prompts/dedupe_nodes.pygraphiti_core/prompts/extract_edges.pygraphiti_core/prompts/extract_nodes.pygraphiti_core/prompts/summarize_nodes.pygraphiti_core/utils/maintenance/edge_operations.pygraphiti_core/utils/maintenance/node_operations.pytests/utils/maintenance/test_edge_operations.pytests/utils/maintenance/test_node_operations.py
本文档解释了 Graphiti 在剧集入库期间如何解析重复实体和关系。系统采用三层策略进行节点去重(精确匹配、模糊相似度和大语言模型推理),并同时处理边去重和矛盾检测,支持基于时间的失效机制。
概述
在剧集入库期间,提取的节点和边必须与现有图谱实体进行解析,以防止重复并保持一致性。去重系统分两个阶段运行:
- 节点解析:通过
resolve_extracted_nodesgraphiti_core/utils/maintenance/node_operations.py:31-31,先使用确定性启发式方法将提取的实体节点与现有节点进行匹配,若无法解决则升级到大语言模型(LLM)解析。 - 边解析:通过
resolve_extracted_edgesgraphiti_core/utils/maintenance/edge_operations.py:225-225检查提取的边是否存在重复和矛盾,并对过时信息进行基于时间的失效处理。
系统优先使用快速确定性方法以提高性能,仅在必要时才调用大语言模型(LLM)。这可以在保持高准确率的同时,最大限度降低 API 成本。
来源:graphiti_core/utils/maintenance/node_operations.py:31-31, graphiti_core/utils/maintenance/edge_operations.py:225-225
节点去重:三层策略
架构总览
节点去重过程遵循分层升级策略,每一层处理上一层无法解决的案例。该逻辑的主要入口点是 resolve_extracted_nodes graphiti_core/utils/maintenance/node_operations.py:31-31。
分层解析流程
来源:graphiti_core/utils/maintenance/node_operations.py:31-31, graphiti_core/utils/maintenance/dedup_helpers.py:161-175, graphiti_core/utils/maintenance/node_operations.py:217-222
第一层:精确字符串匹配
第一层对实体名称进行标准化,并执行不区分大小写、空白标准化的比较。这发生在 _resolve_with_similarity graphiti_core/utils/maintenance/dedup_helpers.py:196-196 中。
| 标准化函数 | 操作 | 示例 |
|---|---|---|
_normalize_string_exact() | 转小写,合并空白 | " Alice Smith " → "alice smith" |
_normalize_name_for_fuzzy() | 移除标点,转小写 | "Alice-Smith!" → "alice smith" |
来源:graphiti_core/utils/maintenance/dedup_helpers.py:39-49, graphiti_core/utils/maintenance/dedup_helpers.py:52-64
精确匹配逻辑检查是否存在一个具有标准化名称的候选节点:
normalized_key = _normalize_string_exact(extracted_node.name)
candidates = indexes.normalized_existing[normalized_key]
if len(candidates) == 1:
# 单个精确匹配 - 立即解析
state.resolved_nodes[idx] = candidates[0]
state.uuid_map[extracted_node.uuid] = candidates[0].uuid
来源:graphiti_core/utils/maintenance/dedup_helpers.py:214-222
第二层:基于 MinHash 和 LSH 的模糊相似度
对于无法精确匹配的实体,系统在 _resolve_with_similarity graphiti_core/utils/maintenance/dedup_helpers.py:196-196 中使用概率哈希来查找近似重复项。
模糊解析管线
来源:graphiti_core/utils/maintenance/dedup_helpers.py:88-140
基于熵的门控:低熵名称(例如 "Joe")会跳过模糊匹配,直接升级到大语言模型(LLM)推理,以避免误报。这由 _has_high_entropy graphiti_core/utils/maintenance/dedup_helpers.py:79-79 控制,该函数计算字符的香农熵。
| 常量 | 值 | 用途 |
|---|---|---|
_NAME_ENTROPY_THRESHOLD | 1.5 | 模糊匹配的最小香农熵 |
_MIN_NAME_LENGTH | 6 | 信任模糊匹配的最小长度 |
_FUZZY_JACCARD_THRESHOLD | 0.9 | 解析的最小 Jaccard 相似度 |
来源:graphiti_core/utils/maintenance/dedup_helpers.py:31-36, graphiti_core/utils/maintenance/dedup_helpers.py:79-85, graphiti_core/utils/maintenance/dedup_helpers.py:52-76
第三层:基于大语言模型的解析
经过第一层和第二层后仍未解析的实体,会通过 _resolve_with_llm 函数 graphiti_core/utils/maintenance/node_operations.py:29-29 批量发送给大语言模型(LLM)进行语义推理。
大语言模型解析序列
来源:graphiti_core/utils/maintenance/node_operations.py:29-29, graphiti_core/prompts/dedupe_nodes.py:117-117, graphiti_core/llm_client/llm_client.py:42-42
大语言模型(LLM)提示会接收 extracted_nodes、existing_nodes 以及原始 episode_content,以提供消歧上下文 graphiti_core/prompts/dedupe_nodes.py:117-179。
响应模型:NodeDuplicate graphiti_core/prompts/dedupe_nodes.py:25-34
class NodeDuplicate(BaseModel):
id: int # 来自新实体的整数 ID
name: str # 最完整的名称
duplicate_candidate_id: int # 匹配的现有实体的候选 ID,或 -1
边去重与矛盾检测
集成解析
边解析由 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 管理,该函数将重复检测和矛盾识别合并到一次大语言模型(LLM)调用中,使用 resolve_edge 提示 graphiti_core/prompts/dedupe_edges.py:43-43。
边解析流程
来源:graphiti_core/utils/maintenance/edge_operations.py:107-107, graphiti_core/prompts/dedupe_edges.py:43-91, graphiti_core/utils/maintenance/edge_operations.py:225-225
快速路径:精确事实匹配
在 resolve_extracted_edge graphiti_core/utils/maintenance/edge_operations.py:107-107 中,系统首先检查新事实与同一对节点之间的现有边是否存在精确语义匹配(标准化字符串)。如果找到匹配,则会短路大语言模型(LLM)调用,并将新剧集的 UUID 附加到现有边上 tests/utils/maintenance/test_edge_operations.py:108-152。
来源:graphiti_core/utils/maintenance/edge_operations.py:107-152, tests/utils/maintenance/test_edge_operations.py:108-152
基于时间的矛盾解析
当大语言模型(LLM)识别出矛盾时(通过 EdgeDuplicate 响应中的 contradicted_facts graphiti_core/prompts/dedupe_edges.py:24-33),系统会执行基于时间的失效处理。如果基于 valid_at 时间戳新事实更新,则现有事实的 invalid_at 会更新为新事实的 valid_at,从而有效地"淘汰"旧信息。
来源:graphiti_core/prompts/dedupe_edges.py:24-33, graphiti_core/prompts/dedupe_edges.py:79-84
数据结构
DedupResolutionState
跟踪一批节点的解析进度 graphiti_core/utils/maintenance/dedup_helpers.py:161-168。
@dataclass
class DedupResolutionState:
resolved_nodes: list[EntityNode | None]
uuid_map: dict[str, str]
unresolved_indices: list[int]
duplicate_pairs: list[tuple[EntityNode, EntityNode]] = field(default_factory=list)
来源:graphiti_core/utils/maintenance/dedup_helpers.py:161-168
DedupCandidateIndexes
存储用于精确和模糊匹配的预计算查找结构 graphiti_core/utils/maintenance/dedup_helpers.py:150-158。
@dataclass
class DedupCandidateIndexes:
existing_nodes: list[EntityNode]
nodes_by_uuid: dict[str, EntityNode]
normalized_existing: defaultdict[str, list[EntityNode]]
shingles_by_candidate: dict[str, set[str]]
lsh_buckets: defaultdict[tuple[int, tuple[int, ...]], list[str]]
来源:graphiti_core/utils/maintenance/dedup_helpers.py:150-158
与剧集处理的集成
去重是处理管线的核心部分。在通过 extract_nodes graphiti_core/utils/maintenance/node_operations.py:69-69 提取节点后,会对它们进行解析。生成的 uuid_map 随后用于更新提取的边的 source_node_uuid 和 target_node_uuid,然后再通过 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 对边进行解析。
来源:graphiti_core/utils/maintenance/node_operations.py:69-148, graphiti_core/utils/maintenance/edge_operations.py:225-232