关于 NebulaGraph

一个可靠的分布式、线性扩容、 性能高效的图数据库,擅长处理千亿节点万亿条边的超大数据集,同时保持毫秒级查询延时的图数据库解决方案

部署 NebulaGraph 集群

使用 Docker Compose 部署 NebulaGraph 集群

参考资料:https://docs.nebula-graph.com.cn/

使用 Docker Compose 可以基于准备好的配置文件快速部署 NebulaGraph 服务,仅建议在测试NebulaGraph 功能时使用该方式

主机 Linux 系统:Ubuntu 20.04.6

安装 Docker

使用 Apt repository 进行安装

参考文档:https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository

  1. 创建 Docker 的 Apt repository.
1
2
3
4
5
6
7
8
9
10
11
12
13
# Add Docker's official GPG key:
$ sudo apt-get update
$ sudo apt-get install ca-certificates curl gnupg
$ sudo install -m 0755 -d /etc/apt/keyrings
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
$ sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
$ echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get update
  1. 安装最新的 Docker packages
1
$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  1. 验证 Docker Engine 是否安装成功
1
$ sudo docker run hello-world

安装 Docker Compose

参考文档:https://docs.docker.com/compose/install/standalone/

  1. 下载并安装 Compose standalone
1
$ curl -SL https://github.com/docker/compose/releases/download/v2.20.3/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
1
$ sudo chmod +x /usr/local/bin/docker-compose
  1. 验证 Compose standalone 是否安装成功
1
$ docker-compose

部署 NebulaGraph

  1. 通过 Git 克隆 nebula-docker-compose 仓库的分支到主机,并切换工作目录
1
$ git clone -b release-3.6 https://github.com/vesoft-inc/nebula-docker-compose.git

注:执行该命令时出现报错 GnuTLS recv error (-54): Error in the pull function,没有解决,尝试将 release-3.6 更换为 release-3.5 之后连接超时,没有解决,尝试将 release-3.6 更换为 release-3.4 之后成功克隆仓库(具体原因未知)

1
$ cd nebula-docker-compose
  1. 启动 NebulaGraph 服务
1
$ docker-compose up -d

连接 NebulaGraph(两种方式)

  1. 在容器外通过 Nebula Console 连接

    因为容器配置文件中将 Graph 服务的外部映射端口也固定为 9669,因此可以直接通过默认端口连接

    • 下载所需版本的 NebulaGraph Console 二进制文件

    • 可选:重命名文件为 nebula-console(方便使用)

    • 授予 nebula-console 文件的执行权限: sudo chmod 111 nebula-console

    • 运行 nebula-console,连接 NebulaGraph

      1
      $ ./nebula-console -addr 127.0.0.1 -port 9669 -u root -p nebula
  2. 登录安装了 NebulaGraph Console 的容器,然后再连接 Graph 服务

    1
    $ docker exec -it nebuladockercompose_console_1 /bin/sh
    1
    / # ./usr/local/bin/nebula-console -u root -p nebula --address=graphd --port=9669

查看 NebulaGraph 服务的状态和端口

NebulaGraph 默认使用 9669 端口为客户端提供服务,如果需要修改端口,需要修改目录 nebula-docker-compose 内的文件 docker-compose.yaml,然后重启 NebulaGraph 服务

1
$ docker-compose ps

关闭 NebulaGraph 服务

1
$ sudo docker-compose down

关于 LLM 与 KG 的结合

核心内容概述

要点总结

整个 Demo 对 LLM 与 KG 的结合作出了一定的探索并已经将具体实现整合到 langchain 与 llama_index 中,大致分为三个方向(其中,第一个方向主要涉及借助 LLM 构建 KG,后两个方向主要涉及借助 KG 增强 LLM):

  • Construct KG via LLM:从已有的文档中按照确定的规则抽取知识(三元组)并存入图数据库中
  • Text2Cypher:对于用户提出的问题,由 LLM 将自然语言的询问转换为对图数据库的 Cypher 查询,获取到查询结果后,再经由 LLM 整合并输出回答 (类似 前端 Hubble 打通 LLM 方案梳理)
  • Graph RAG:对于用户提出的问题,由 LLM 提取问题中所涉及的 Entity,以此来从 KG 中获取与问题相关的 sub-KG 作为上下文,再由 LLM 对该上下文信息进行理解并输出回答(该方法分为两个使用场景:一是使用 llama_index 等框架/工具基于已有的文档构建 KG 并使用,二是基于已有的 KG 使用,案例中两者均给出了具体实现)
    • 对比 Vector RAG:与 Graph RAG 不同的地方在于,将文档(构建 KG 的原始文档)分为多个 chunk,通过向量搜索找到 K 个最语义相关的 chunk 作为 上下文信息
    • 尝试 Graph + Vector RAG:结合 Graph RAG 与 Vector RAG (并与单独的 Vector RAG 和单独的 Graph RAG 对比效果)

实验结论

原作者针对 Demo 进行了对比试验,实验结果指出:

  • pure-KG-based 和 otherwise:
    • Text2Cypher 和 Graph RAG 均能给出精确的答案,并且 Token 开销远小于 Vector RAG 和 Graph + Vector RAG
      • 关于此处 Token 开销,原作者仅在实验文档中对比了 Response 的 Token 数量而没有对比 Prompt 的 Token 数量
      • pure-KG-based 方法得到的回答更加简短精炼(作者想表达的意思应该是其信息密度较高 )
    • Graph + Vector RAG 在问题涉及的知识广泛分布在文档的不同段落时,能够给出更加综合和完整的回答 (意思是说 –> 更适合复杂问题?)
  • Text2Cypher 和 Graph RAG:
    • Text2Cypher 仅结合查询结果作出回答,Graph RAG 结合所有相关上下文进行回答,因此,对于答案本身属于细节性、碎片性信息的,Text2Cypher 效果更好,反之 Graph RAG 效果更好
    • 对于以下情形优先使用 Graph RAG:
      • 需要考虑潜在相关信息
      • KG 的 schema 相对复杂
      • KG 本身的数据质量不高
      • 问题包含多个 Starting Entity

详细复现过程

  1. 添加环境变量
1
2
3
4
5
6
import os

os.environ['OPENAI_API_KEY'] = 'sk-xxxxxxxxxxx'
os.environ['NEBULA_USER'] = 'root'
os.environ['NEBULA_PASSWORD'] = 'nebula'
os.environ['NEBULA_ADDRESS'] = '127.0.0.1:9669'
  1. 图数据库的 space 创建与 schema 定义(in nebula-console)
1
2
3
4
5
CREATE SPACE guardians(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
USE guardians;
CREATE TAG entity(name string);
CREATE EDGE relationship(relationship string);
CREATE TAG INDEX entity_index ON entity(name(256));
  1. LLM 服务的构建与准备
1
2
3
4
5
6
from llama_index import LLMPredictor
from llama_index.llms import OpenAI
from llama_index import ServiceContext

llm = OpenAI(model='text-davinci-002', temperature=0)
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)
  1. 图存储的构建与准备(基于 NebulaGraph)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from llama_index.graph_stores import NebulaGraphStore
from llama_index.storage.storage_context import StorageContext

os.environ['NEBULA_USER'] = 'root'
os.environ['NEBULA_PASSWORD'] = 'nebula'
os.environ['NEBULA_ADDRESS'] = '127.0.0.1:9669'
space_name = 'guardians'
edge_types, rel_prop_names= ['relationship'], ['relationship']
tags = ['entity']
graph_store = NebulaGraphStore(
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)
  1. 样例文档数据的准备
1
2
3
4
5
from llama_index import download_loader

WikipediaReader = download_loader('WikipediaReader')
loader = WikipediaReader()
documents = loader.load_data(pages=['Guardians of the Galaxy Vol. 3'], auto_suggest=False)
  1. 提取文档中的知识并存入图数据库(for Cypher Query & Graph RAG)

Construct KG via LLM 的核心步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
from llama_index import KnowledgeGraphIndex

kg_index = KnowledgeGraphIndex.from_documents(
documents=documents,
storage_context=storage_context,
max_triplets_per_chunk=10,
service_context=service_context,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)
  1. 为文档生成向量索引(for Vector RAG)
1
2
3
4
5
6
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(
documents=documents,
service_context=service_context
)
  1. 将相关数据固定在磁盘并支持重复读取(可选)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from llama_index import load_index_from_storage
from llama_index.storage.storage_context import StorageContext

kg_index.storage_context.persist(persist_dir='./storage_graph')
vector_index.storage_context.persist(persist_dir='./storage_vector')

storage_context = StorageContext.from_defaults(
persist_dir='./storage_graph',
graph_store=graph_store,
)
kg_index = load_index_from_storage(
storage_context=storage_context,
service_context=service_context,
max_triplets_per_chunk=10,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)

storage_context_vector = StorageContext.from_defaults(persist_dir='./storage_vector')
vector_index = load_index_from_storage(
service_context=service_context,
storage_context=storage_context_vector,
)
  1. Text2Cypher 的构建

Text2Cypher 的核心步骤

1
2
3
4
5
6
7
8
9
10
from llama_index.query_engine import KnowledgeGraphQueryEngine
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore

nl2kg_query_engine = KnowledgeGraphQueryEngine(
storage_context=storage_context,
service_context=service_context,
llm=llm,
verbose=True,
)
  1. Graph RAG 的构建

Graph RAG 的核心步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
kg_rag_query_engine = kg_index.as_query_engine(
include_text=False,
retriever_mode='keyword',
response_mode='tree_summarize',
)

# OR
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import KnowledgeGraphRAGRetriever

graph_rag_retriever = KnowledgeGraphRAGRetriever(
storage_context=storage_context,
service_context=service_context,
llm=llm,
verbose=True,
)
kg_rag_query_engine = RetrieverQueryEngine.from_args(
retriever=graph_rag_retriever,
service_context=service_context,
)
  1. Vector RAG 的构建
1
vector_rag_query_engine = vector_index.as_query_engine()
  1. Graph + Vector RAG 的构建
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
from llama_index import QueryBundle
from llama_index.schema import NodeWithScore
from llama_index.retrievers import BaseRetriever, VectorIndexRetriever, KGTableRetriever

from typing import List


class CustomRetriever(BaseRetriever):
"""Custom retriever that performs both Vector search and Knowledge Graph search"""

def __init__(
self,
vector_retriever: VectorIndexRetriever,
kg_retriever: KGTableRetriever,
mode: str = "OR",
) -> None:
"""Init params."""

self._vector_retriever = vector_retriever
self._kg_retriever = kg_retriever
if mode not in ("AND", "OR"):
raise ValueError("Invalid mode.")
self._mode = mode

def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve nodes given query."""

vector_nodes = self._vector_retriever.retrieve(query_bundle)
kg_nodes = self._kg_retriever.retrieve(query_bundle)

vector_ids = {n.node.node_id for n in vector_nodes}
kg_ids = {n.node.node_id for n in kg_nodes}

combined_dict = {n.node.node_id: n for n in vector_nodes}
combined_dict.update({n.node.node_id: n for n in kg_nodes})

retrieve_ids = vector_ids.intersection(kg_ids) if self._mode == "AND" else vector_ids.union(kg_ids)

retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
return retrieve_nodes
from llama_index import get_response_synthesizer
from llama_index.query_engine import RetrieverQueryEngine

# create custom retriever
vector_retriever = VectorIndexRetriever(index=vector_index)
kg_retriever = KGTableRetriever(
index=kg_index,
retriever_mode="keyword",
include_text=False,
)
custom_retriever = CustomRetriever(vector_retriever, kg_retriever)

# create response synthesizer
response_synthesizer = get_response_synthesizer(
service_context=service_context,
response_mode="tree_summarize",
)

graph_vector_rag_query_engine = RetrieverQueryEngine(
retriever=custom_retriever,
response_synthesizer=response_synthesizer,
)
  1. 提问并获取回答
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
seperator = '\n' + '#'*50 + '\n'

response_nl2kg = nl2kg_query_engine.query("Tell me about Peter Quill.")
print(response_nl2kg, end=seperator)

graph_query = nl2kg_query_engine.generate_query("Tell me about Peter Quill?")
print(graph_query.replace("WHERE", "\n WHERE").replace("RETURN", "\nRETURN"), end=seperator)

response_graph_rag = kg_rag_query_engine.query("Tell me about Peter Quill.")
print(response_graph_rag, end=seperator)

response_vector_rag = vector_rag_query_engine.query("Tell me about Peter Quill.")
print(response_vector_rag, end=seperator)

response_graph_vector_rag = graph_vector_rag_query_engine.query("Tell me about Peter Quill.")
print(response_graph_vector_rag, end=seperator)

经代码整合后的完整案例复现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
from typing import List, Optional

from llama_index import KnowledgeGraphIndex, VectorStoreIndex
from llama_index import QueryBundle
from llama_index import ServiceContext
from llama_index import download_loader
from llama_index import get_response_synthesizer
from llama_index import load_index_from_storage
from llama_index.graph_stores import NebulaGraphStore
from llama_index.indices.knowledge_graph.retrievers import KGRetrieverMode
from llama_index.indices.query.base import BaseQueryEngine
from llama_index.llms import OpenAI
from llama_index.llms.base import LLM
from llama_index.response_synthesizers import ResponseMode
from llama_index.retrievers import BaseRetriever, VectorIndexRetriever, KGTableRetriever
from llama_index.schema import NodeWithScore
from llama_index.storage.storage_context import StorageContext
from llama_index.query_engine import RetrieverQueryEngine, KnowledgeGraphQueryEngine
from llama_index.retrievers import KnowledgeGraphRAGRetriever


class CustomRetriever(BaseRetriever):
"""Custom retriever that performs both Vector search and Knowledge Graph search"""
def __init__(
self,
vector_retriever: VectorIndexRetriever,
kg_retriever: KGTableRetriever,
mode: str = "OR",
) -> None:
"""Init params."""
self._vector_retriever = vector_retriever
self._kg_retriever = kg_retriever
if mode not in ("AND", "OR"):
raise ValueError("Invalid mode.")
self._mode = mode

def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve nodes given query."""
vector_nodes = self._vector_retriever.retrieve(query_bundle)
kg_nodes = self._kg_retriever.retrieve(query_bundle)
vector_ids = {n.node.node_id for n in vector_nodes}
kg_ids = {n.node.node_id for n in kg_nodes}
combined_dict = {n.node.node_id: n for n in vector_nodes}
combined_dict.update({n.node.node_id: n for n in kg_nodes})
retrieve_ids = vector_ids.intersection(kg_ids) if self._mode == "AND" else vector_ids.union(kg_ids)
retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
return retrieve_nodes


def prepare_llm() -> LLM:
return OpenAI(model='text-davinci-002', temperature=0)


def prepare_service_context(llm: LLM) -> ServiceContext:
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)
return service_context


def prepare_storage_context(mode: str = 'reload') -> (tuple, StorageContext):
space_name = 'guardians'
edge_types, rel_prop_names = ['relationship'], ['relationship']
tags = ['entity']
graph_store = NebulaGraphStore(
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
)
if mode == 'reload':
storage_context = StorageContext.from_defaults(
persist_dir='./storage_graph',
graph_store=graph_store,
)
else:
storage_context = StorageContext.from_defaults(graph_store=graph_store)
return (space_name, edge_types, rel_prop_names, tags), storage_context


def prepare_data() -> None:
llm = prepare_llm()
service_context = prepare_service_context(llm)
kg_settings, storage_context = prepare_storage_context(mode='prepare')
space_name, edge_types, rel_prop_names, tags = kg_settings
WikipediaReader = download_loader('WikipediaReader')
loader = WikipediaReader()
documents = loader.load_data(
pages=['Guardians of the Galaxy Vol. 3'],
auto_suggest=False,
)
kg_index = KnowledgeGraphIndex.from_documents(
documents=documents,
storage_context=storage_context,
max_triplets_per_chunk=10,
service_context=service_context,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)
kg_index.storage_context.persist(persist_dir='./storage_graph')
vector_index = VectorStoreIndex.from_documents(
documents=documents,
service_context=service_context
)
vector_index.storage_context.persist(persist_dir='./storage_vector')


def get_query_engin(method: str, llm: LLM) -> Optional[BaseQueryEngine]:
if method == 'graph-rag-prebuilt':
service_context = prepare_service_context(llm=llm)
kg_settings, storage_context = prepare_storage_context()
space_name, edge_types, rel_prop_names, tags = kg_settings
kg_index = load_index_from_storage(
storage_context=storage_context,
service_context=service_context,
max_triplets_per_chunk=10,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)
query_engine = kg_index.as_query_engine(
include_text=False,
retriever_mode='keyword',
response_mode='tree_summarize',
verbose=True,
)
elif method == 'graph-rag-existing':
service_context = prepare_service_context(llm=llm)
_, storage_context = prepare_storage_context(mode='no-index')
graph_rag_retriever = KnowledgeGraphRAGRetriever(
storage_context=storage_context,
service_context=service_context,
llm=llm,
verbose=True,
)
query_engine = RetrieverQueryEngine.from_args(
retriever=graph_rag_retriever,
service_context=service_context,
)
elif method == 'vector-rag':
service_context = prepare_service_context(llm=llm)
storage_context_vector = StorageContext.from_defaults(persist_dir='./storage_vector')
vector_index = load_index_from_storage(
service_context=service_context,
storage_context=storage_context_vector
)
query_engine = vector_index.as_query_engine()
elif method == 'text-to-cypher':
service_context = prepare_service_context(llm=llm)
_, storage_context = prepare_storage_context()
query_engine = KnowledgeGraphQueryEngine(
storage_context=storage_context,
service_context=service_context,
llm=llm,
verbose=True,
)
elif method == 'graph-vector-rag' or 'vector-graph-rag':
service_context = prepare_service_context(llm=llm)
storage_context_vector = StorageContext.from_defaults(persist_dir='./storage_vector')
vector_index = load_index_from_storage(
service_context=service_context,
storage_context=storage_context_vector
)
kg_settings, storage_context = prepare_storage_context()
space_name, edge_types, rel_prop_names, tags = kg_settings
kg_index = load_index_from_storage(
storage_context=storage_context,
service_context=service_context,
max_triplets_per_chunk=10,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)
# create custom retriever
vector_retriever = VectorIndexRetriever(index=vector_index)
kg_retriever = KGTableRetriever(
index=kg_index,
retriever_mode=KGRetrieverMode.KEYWORD,
include_text=False,
)
custom_retriever = CustomRetriever(vector_retriever, kg_retriever)
# create response synthesizer
response_synthesizer = get_response_synthesizer(
service_context=service_context,
response_mode=ResponseMode.TREE_SUMMARIZE,
)
query_engine = RetrieverQueryEngine(
retriever=custom_retriever,
response_synthesizer=response_synthesizer
)
else:
print('Invalid method!')
query_engine = None
return query_engine


def print_response(response: str) -> None:
result = response.split('.')
for line in result:
sentence = f'{line.strip()}.'
if len(sentence.strip()) > 1:
print(sentence)


if __name__ == '__main__':
# prepare_data()
llm = prepare_llm()
# query_engin = get_query_engin(method='vector-rag', llm=llm)
# query_engin = get_query_engin(method='text-to-cypher', llm=llm)
# query_engin = get_query_engin(method='graph-rag-prebuilt', llm=llm)
# query_engin = get_query_engin(method='graph-rag-existing', llm=llm)
query_engin = get_query_engin(method='graph-vector-rag', llm=llm)
if query_engin is not None:
response = query_engin.query('Tell me about Peter Quill.')
print_response(response.response)

源码实现细节

Text2Cypher 过程解析

  • service_context 对象提供 LLM 服务的上下文(可用其他 LLM 替换 OpenAI 提供的服务)
  • storage_context 对象提供图数据库存储的上下文
    • NebulaGraphStore 类实现(./llama_index/graph_stores/nebulagraph.py
    • 继承自 GraphStore 类(./llama_index/graph_stores/types.py
  • query_engine 对象提供向图数据库进行查询的接口
    • KnowledgeGraphQueryEngine 类实现(./llama_index/query_engine/knowledge_graph_query_engine.py
    • 使用 service_context 对象与 storage_context 对象作为参数实例化KnowledgeGraphQueryEngine
    • 在与 LLM 交互时使用了为 NebulaGraph 定制的提示模板(./llama_index/query_engine/knowledge_graph_query_engine.py
  • query_engine 对象的 query 方法完成了提示词构建以及与 LLM 和 KG 的交互
    • 自然语言转换为查询语句 & 通过查询语句从图数据库中得到查询结果(Retrieve
    • 查询结果转换为自然语言回答(Response Synthesize

Text2Cypher Prompt 解析

任务阐述:根据自然语言询问生成图数据库查询语句

辅助信息:图数据库相关 schema;自然语言询问语句

约束限制:只能用提供 schema 中涉及的关系类型和属性

特殊提示:NebulaGraph 的 Cypher 方言规范(语言描述 + 举例说明)

  • 对比 Neo4j 图数据库的定制化提示模板,特殊提示中还可以添加:
    • 回答中不要包含解释或道歉
    • 不要回答任何可能要求构建 Cypher 语句以外的问题
    • 回答中不要包含除了生成的 Cypher 语句以外的内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# ./llama_index/query_engine/knowledge_graph_query_engine.py

DEFAULT_NEBULAGRAPH_NL2CYPHER_PROMPT_TMPL = """
Generate NebulaGraph query from natural language.
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
---
{schema}
---
Note: NebulaGraph speaks a dialect of Cypher, comparing to standard Cypher:

it uses double equals sign for comparison: == rather than =
it needs explicit label specification when referring to node properties, i.e.
v is a variable of a node, and we know its label is Foo, v.foo.name is correct
while v.name is not.

For example, see this diff between standard and NebulaGraph Cypher dialect:
```diff
< MATCH (p:person)-[:directed]->(m:movie) WHERE m.name = 'The Godfather'
< RETURN p.name;
---
> MATCH (p:`person`)-[:directed]->(m:`movie`) WHERE m.`movie`.`name` == 'The Godfather'
> RETURN p.`person`.`name`;
```

Question: {query_str}

NebulaGraph Cypher dialect query:
"""
DEFAULT_NEBULAGRAPH_NL2CYPHER_PROMPT = PromptTemplate(
DEFAULT_NEBULAGRAPH_NL2CYPHER_PROMPT_TMPL,
prompt_type=PromptType.TEXT_TO_GRAPH_QUERY,
)

Graph RAG 过程解析

Graph RAG for LlamaIndex Built KG

对于基于已有的文档使用 LlamaIndex 构建的 KG,已经在构建时保存了相关索引,由 KnowledgeGraphIndex 类封装(./llama_index/indices/knowledge_graph/base.py),继承自 BaseIndex 类(./llama_index/indices/base.py

  • 调用索引对象的 as_query_engine() 获得一个 RetrieverQueryEngine 类对象(./llama_index/query_engine/retriever_query_engine.py),继承自 BaseQueryEngine 类(./llama_index/indices/query/base.py
    • 核心仍是由 RetrieverResponse Synthesizer 构成
    • Retriever 由 KGTableRetriever 类对象承担(./llama_index/indices/knowledge_graph/retrievers.py),继承自 BaseRetriever 类(./llama_index/indices/base_retriever.py
    • Response Synthesizer 的具体类型由 response_mode 参数决定,如本案例中使用了 TreeSummarize(./llama_index/response_synthesizers/tree_summarize.py),继承自 BaseSynthesizer(./llama_index/response_synthesizers/base.py
  • 检索过程:(详见 ./llama_index/indices/knowledge_graph/retrievers.pyKGTableRetriever 类的 _retrieve 方法)
    • 根据问题归纳相关关键词(由 LLM 完成)
    • 根据关键词找到对应的实体
    • 根据相关实体以固定深度提取 SubGraph
    • 将 SubGraph 转换为上下文信息(字符串化等处理)
  • 回答整合过程:(详见 ./llama_index/response_synthesizers/tree_summarize.pyTreeSummarize 类的 get_response 方法)

Graph RAG for Existing KG

与 Graph RAG for LlamaIndex Built KG 流程基本相同,不同之处在于:

  • 前者 query_engine(包括其中的 retriever)直接通过索引构建,并且其 retriever 是一个 KGTableRetriever 对象
  • 后者需要手动构建 retriever 和 query_engine,其中 retriever 是一个 KnowledgeGraphRAGRetriever 对象
  • 两者 retriever 的实现方式不同,详见./llama_index/indices/knowledge_graph/retrievers.py

注:搜索相关实体的方式可以是基于关键字的提取,也可以是基于嵌入的提取,这由 KnowledgeGraphRAGRetriever 的参数 retriever_mode 控制,支持的选项有

  • keyword
  • embedding(尚未实现)
  • keyword_embedding(尚未实现)

资料来源