A comprehensive RAG CheatSheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds.

Basic RAG

主流的 RAG 主要涉及从外部知识库召回文档,并将这些文档连同用户的查询一起传递给大语言模型,以此生成回应。
也就是说,RAG 包含了 召回部分外部知识库 以及 生成部分 三个组成部分。

LlamaIndex Basic RAG 示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from llama_index import SimpleDirectoryReader, VectorStoreIndex

# load data
documents = SimpleDirectoryReader(input_dir="...").load_data()

# build VectorStoreIndex that takes care of chunking documents
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex.from_documents(documents=documents)

# The QueryEngine class is equipped with the generator
# and facilitates the retrieval and generation steps
query_engine = index.as_query_engine()

# Use your Default RAG
response = query_engine.query("A user's query")

Success Requirements for RAG

要想让一个RAG被认定为成功(即能为用户的问题提供有用且相关的答案),实际上有两个核心要求:

  • 召回 必须是用户查询最相关的文档。
  • 生成 必须能够充分利用召回的文档来有效的回答用户查询。

Advanced RAG

一旦我们明确了成功的标准,就可以说,构建先进的 RAG 主要是要运用更精细的技术和策略(应用于召回或生成组件),以确保最终达到这些标准。
进一步说,我们可以把这些精细的技术分为两类:一类是独立解决两大成功要求中的一项(或多或少)的技术,另一类则是同时应对这两大要求的技术。

召回

接下来,我们将简述几种更高级的技术,这些技术能帮助我们实现第一个要求:

Chunk-Size Optimization

因为大语言模型的上下文长度有限,所以在构建外部知识库时,我们需要将文档切分成多个部分。如果切分的部分过大或过小,都可能给生成组件带来问题,从而导致生成的回答不准确。

LlamaIndex Chunk Size Optimization 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

from llama_index import ServiceContext
from llama_index.param_tuner.base import ParamTuner, RunResult
from llama_index.evaluation import SemanticSimilarityEvaluator, BatchEvalRunner

### Recipe
### Perform hyperparameter tuning as in traditional ML via grid-search
### 1. Define an objective function that ranks different parameter combos
### 2. Build ParamTuner object
### 3. Execute hyperparameter tuning with ParamTuner.tune()

# 1. Define objective function
def objective_function(params_dict):
chunk_size = params_dict["chunk_size"]
docs = params_dict["docs"]
top_k = params_dict["top_k"]
evals_qs = params_dict["eval_qs"]
ref_response_strs = params_dict["ref_response_strs"]

# build RAG pipeline
index = _build_index(chunk_size, docs) # hlper function
query_engine = index.as_query_engine(similarity_top_k=top_k)

# perform inference with RAG pipeline on a privoded questions `eval_qs`
pred_response_objs = get_responses(
eval_qs, query_engine, show_progress=true
)

# perform evaluations of predictions by comparing them to reference
# response `ref_response_strs`
evaluator = SemanticSimilarityEvaluator(...)
eval_batch_runner = BatchEvalRunner(
{"semantic_similarity": evaluator}, workers=2, show_progress=True
)
eval_results = eval_batch_runner.evaluate_responses(
evals_qs, responses=pred_response_objs, reference=ref_response_strs
)

# get semantic similarity metric
mean_score = np.array(
[r.score for r in eval_results["semantic_similarity"]]
).mean()

return RunResult(score=mean_score, params=params_dict)

# 2. Build ParamTuner object
param_dict = {"chunk_size": [256, 512, 1024]}
fixed_param_dict = {
"top_k": 2,
"docs": docs,
"evals_qs": evals_qs[:10],
"ref_response_strs": ref_response_strs[:10],
}
param_tuner = ParamTuner(
param_fn=objective_function,
param_dict=param_dict,
fixed_param_dict=fixed_param_dict,
show_progress=True,
)

# 3. Execute hyperparameter search
results = param_tuner.tune()
best_result = results.best_run_result
best_chunk_size = results.best_run_result.params["chunk_size"]

Structured External Knowledge

在复杂的情况下,我们可能需要构建一个比基本向量索引更有结构的外部知识库,这样才能在处理有明显区别的外部知识源时,进行递归召回或者路径召回。

LlamaIndex Recursive Retrieval 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.node_parser import SentenceSplitter
from llama_index.schema import IndexNode

### Recipe
### Build a recursive retriever that retrieves using small chunks
### but passes associated larger chunks to the generation stage

# load data
documents = SimpleDirectoryReader(
input_file="..."
).load_data()

# build parent chunks via NodeParser
node_parser = SentenceSplitter(chunk_size=1024)
base_nodes = node_parser.get_nodes_from_documents(documents)

# define smaller child chunks
sub_chunk_sizes = [256, 512]
sub_node_parsers = [
SentenceSplitter(chunk_size=c, chunk_overlap=20) for c in sub_chunk_sizes
]
all_nodes = []
for base_node in base_nodes;
for n in sub_node_parsers:
sub_nodes = n.get_nodes_from_documents([base_node])
sub_inodes = [
IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
]
all_nodes.extend(sub_inodes)
# also add original node to node
original_node = IndexNode.from_text_node(base_node, base_node.node_id)
all_nodes.append(original_node)


# define a VectorStoreIndex with all of the nodes
vector_index_chunk = VectorStoreIndex(
all_nodes, service_context=service_context
)

# define a VectorStoreIndex with all of the nodes
vector_index_chunk = VectorStoreIndex(
all_nodes, service_context=service_context
)
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)

# build RecursiveRetriever
all_node_dict = {n.node_id: n for n in all_nodes}
retriever_chunk = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever_chunk},
node_dict=all_nodes_dict,
verbose=True
)

# build RetrieverQueryEngine using recursive_retriever
query_engine_chunk = RetrieverQueryEngine.from_args(
retriever_chunk, service_context=service_context
)

# perform inference with advanced RAG (i.e. query engine)
response = query_engine_chunk.query(
"Can you tell me about the key concepts for safety finetuning"
)

我们提供了一些指南,展示了在复杂情况下如何应用其他高级技术来确保准确的召回。以下是其中一些选定的链接:

生成

与上一节类似,我们提供了一些高级技术的示例,这些技术的目的是确保召回到的文档能够很好地对齐LLM的生成器。

Information Compression

大语言模型(LLM)不仅受到上下文长度的限制,而且如果召回到的文档中含有太多的无关信息(也就是噪声),可能会使生成的回答质量下降。

LlamaIndex Information Compression 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.postprocessor import LongLLMLinguaPostprocessor

### Recipe
### Define a Postprocessor object, here LongLLMLinguaPostprocessor
### Build QueryEngine that uses this Postprocessor on retrieved docs

# Define Postprocessor
node_postprocessor = LongLLMLinguaPostprocessor(
instruction_str="Given the context, please answer the final question",
target_token=300,
rank_method="longllmlingua",
additional_compress_kwargs={
"condition_compare": True,
"condition_in_question": "after",
"context_budget": "+100",
"reorder_context": "sort",
},
)

# Define VectorStoreIndex
documents = SimpleDirectoryReader(input_dir="...").load_data()
index = VectorStoreIndex.from_documents(documents)

# Define QueryEngine
retriever = index.as_retriever(similarity_top_k=2)
retriever_query_engine = RetrieverQueryEngine.from_args(
retriever, node_postprocessor=[node_postprocessor]
)

# Used your advanced RAG
response = retriever_query_engine.query("A user's query")

Result Re-Rank

大语言模型(LLM)存在被称为“中间迷失”的现象,即模型主要关注输入提示的两端。因此,在将召回到的文档传递给生成部分之前,对它们进行重新排序是有帮助的。

LlamaIndex Re-Ranking For Better Generation 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import os
from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.postprocessor import LongLLMLinguaPostprocessor

### Recipe
### Define a Postprocessor object, here CohereRerank
### Build QueryEngine that uses this Postprocessor on retrieved docs

# Build CohereRerank post retrieval processor
api_key = os.environ["COHERE_API_KEY"]
cohere_rerank = CohereRerank(api_key=api_key, top_n=2)

# Build QueryEngine (RAG) using the post processor
documents = SimpleDirectoryReader("...").load_data()
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessor=[cohere_rerank],
)

# Use your advanced RAG
response = query_engine.query(
"What did Sam Altman do in this essay?"
)

召回&生成

在这一部分,我们研究了一些高级方法,这些方法利用召回和生成的相互配合,旨在提高召回效果,同时生成出更精确的对用户查询的回应。

Generator-Enhanced Retrieval

这些方法利用大语言模型的内在推理能力,在进行召回之前优化用户的查询,以便更准确地找出生成有用回应所需的信息。

LlamaIndex Generator-Enhanced Retrieval 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from llama_index.llms import OpenAI
from llama_index.query_engine import FLAREInstructQueryEngine
from llama_index import (
VectorStoreIndex,
SimpleDirectoryReader,
ServiceContext,
)

### Recipe
### Build a FLAREInstructQueryEngine which has the generator LLM play
### a more active role in retrieval by prompting it to elicit retrieval
### instructions on what it needs to answer the user query.

# Build FLAREInstructQueryEngine

documents = SimpleDirectoryReader("...").load_data()
index = VectorStoreIndex.from_documents(documents=documents)
index_query_engine = index.as_query_engine(similarity_top_k=2)
service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4")
flare_query_engine = FLAREInstructQueryEngine(
query_engine=index_query_engine,
service_context=service_context,
max_iterations=7,
verbose=True,
)

# Use your advanced RAG
response = flare_query_engine(
"Can you tell me about the author's trajectory in the startup world?"
)

Iterative Retrieval-Generator RAG

在一些复杂的场景下,我们可能需要进行多步骤的推理,才能给出对用户查询的有用且相关的回答。

LlamaIndex Iterative Retrieval-Generator 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from llama_index.query_engine import RetryQueryEngine
from llama_index.evaluation import RelevancyEvaluator

### Recipe
### Build a RetryQueryEngine which performs retrieval-generation cycles
### until it either achieves a passing evaluation or a max number of
### cycles has been reached

# Build RetryQueryEngine
documents = SimpleDirectoryReader("...").load_data()
index = VectorStoreIndex.from_documents(documents=documents)
base_query_engine = index.as_query_engine()
query_response_evaluator = RelevancyEvaluator() # evaluator to critique retrieval-generation cycles

retry_query_engine = RetryQueryEngine(
base_query_engine, query_response_evaluator
)

# Use your advanced rag
retry_response = retry_query_engine("A user's query")

RAG的评估指标

对 RAG 系统进行评估,无疑是非常重要的。在《Retrieval-Augmented Generation for Large Language Models: A Survey》中,作者们提出了7个评估指标,这些指标在CheetSheet右上角部分有所体现。

llama-index 库包含了一些评估抽象,还集成了对 RAGAs 的支持,以帮助开发者从这些评估指标的角度,理解他们的 RAG 系统在多大程度上达到了预期的成功要求。下面,我们列举了一些精选的评估笔记本指南:

Reference

希望在你阅读完这篇博客文章后,能有更多的信心和准备,去运用这些精妙的技术来打造先进的 RAG 系统!

A Cheat Sheet and Some Recipes For Building Advanced RAG