RAG Evaluation Toolkit on a Banking Supervisory Process Agent
Before starting
Giskard is an open-source framework for testing all ML models, from LLMs to tabular models. Don’t hesitate to give the project a star on GitHub ⭐️ if you find it useful!
In this notebook, you’ll learn how to create a test dataset for a RAG pipeline and use this dataset to test the model.
In this example, we illustrate the procedure using OpenAI Client that is the default one; however, please note that our platform supports a variety of language models. For details on configuring different models, visit our 🤖 Setting up the LLM Client page
In this tutorial we will use Giskard LLM RAG Evaluation Toolkit to automatically detect issues of a Retrieval Augmented Generation (RAG) pipeline. We will test a model that answers questions about the Banking Supervision report from the ECB.
Use-case:
QA over the Banking Supervision report
Foundational model: gpt-3.5-turbo
Context: Banking Supervision report
Outline:
Create a test dataset for the RAG pipeline
Automatically evaluate the RAG pipeline and provide a report with recommendations
Install dependencies and setup notebook
Let’s install the required dependencies. We will be using giskard[llm]
to create the test dataset and llama-index
to build the RAG pipeline. Additionally, we will use PyMuPDF
to load the IPCC report.
[ ]:
!pip install "giskard[llm]" --upgrade
!pip install llama-index PyMuPDF
Now, we download the Banking Supervision report from the ECB website.
[ ]:
!wget "https://www.bankingsupervision.europa.eu/ecb/pub/pdf/ssm.supervisory_guides202401_manual.en.pdf" -O "banking_supervision_report.pdf"
Now, we can import all of the required libraries and classess
[ ]:
import os
import warnings
import openai
import pandas as pd
from llama_index.core import VectorStoreIndex
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.readers.file import PyMuPDFReader
from giskard import Model, scan
from giskard.rag import (
AgentAnswer,
KnowledgeBase,
QATestset,
RAGReport,
evaluate,
generate_testset,
)
from giskard.rag.metrics.ragas_metrics import (
ragas_context_precision,
ragas_context_recall,
)
Now, let’s set the OpenAI API Key environment variable and some visual options.
[ ]:
# Set the OpenAI API Key environment variable.
OPENAI_API_KEY = "..."
openai.api_key = OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# Set pandas options
pd.set_option("display.max_colwidth", 400)
warnings.filterwarnings("ignore")
Build RAG Agent on the Banking Supervision report
We will use llama-index
to build the RAG pipeline. We will use the VectorStoreIndex
to create an index of the IPCC report. We will then use the as_chat_engine
method to create a chat engine from the index.
[ ]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./banking_supervision_report.pdf")
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Now, we can use the pyMuPDF
reader to load the IPCC report and create a VectorStoreIndex
. We will also use the SentenceSplitter
to split the report into chunks of 512 tokens to ensure that the context is not too large.
[5]:
splitter = SentenceSplitter(chunk_size=512)
index = VectorStoreIndex.from_documents(documents, transformations=[splitter])
chat_engine = index.as_chat_engine(llm=llm)
Let’s test the Agent
We can now simple chat with our agent using the chat_engine
and the chat
method. Under the hood, this will use the VectorStoreIndex
to retrieve the most relevant chunks of the report and the gpt-3.5-turbo
model to answer the question.
[4]:
str(chat_engine.chat("What is SSM?"))
[4]:
'SSM stands for Single Supervisory Mechanism.'
Scan LLM vulnerabilities
As a first step, we will run a scan on the chatbot model. This will help us identify the potential vulnerabilities in the model that the agent is built on. To do so, we need to define a function that will take a dataframe with a question column and return the answer from the chatbot. This will then be used to create a Giskard Model
object.
[ ]:
def model_predict(df: pd.DataFrame):
return [chat_engine.chat(question).response for question in df["question"]]
giskard_model = Model(
model=model_predict,
model_type="text_generation",
name="Banking Supervision Question Answering",
description="A model that answers questions about ECB Banking Supervision report",
feature_names=["question"],
)
We can now forward the model to the scan
function to get a report with the potential vulnerabilities. You can pass a custom dataset and features to the scan
function to get a more accurate report but for this example, we will use the default one. If you want to share the report with your team, you can use the to_html
or to_json
methods to save the report.
[8]:
scan_report = scan(giskard_model)
display(scan_report)
Generate a test set on for RAG the Banking Supervision report
We will now generate a test set for RAG on the Banking Supervision report. We first load the report and split it into chunks of 512 tokens.
[ ]:
text_nodes = splitter(documents)
knowledge_base_df = pd.DataFrame([node.text for node in text_nodes], columns=["text"])
knowledge_base = KnowledgeBase(knowledge_base_df)
We can now generate a test set with 100 questions.
[ ]:
testset = generate_testset(
knowledge_base=knowledge_base,
num_questions=100,
agent_description="A chatbot answering questions about banking supervision procedures and methodologies.",
language="en",
)
To avoid losing the test set, we can save it to a JSONL file and safely load it later. Note that, we need to ensure the documents in the KnowledgeBase
are the same as the ones in the testset
to evaluate the agent’s performance on this test set.
[11]:
# Save the testset
testset.save("banking_supervision_testset.jsonl")
# Load the testset
testset = QATestset.load("banking_supervision_testset.jsonl")
Let’s take a look at the first 5 questions in the test set. We can see that the questions are representative of the agent’s performance and get a good coverage of the IPCC report.
[12]:
testset.to_pandas().head(5)
[12]:
question | reference_answer | reference_context | conversation_history | metadata | |
---|---|---|---|---|---|
id | |||||
35202be3-9120-4bd1-9b3b-722d3b307e1c | What is the role of Joint Supervisory Teams (JSTs) in the supervision of Significant Institutions (SIs)? | The day-to-day supervision of SIs is primarily conducted off-site by the JSTs, which comprise staff from NCAs and the ECB and are supported by the horizontal and specialised expertise divisions of DG/HOL and similar staff at the NCAs. The JST analyses the supervisory reporting, financial statements and internal documentation of supervised entities, holds regular and ad hoc meetings with the su... | Document 76: This can involve on-site interventions at supervised institutions, if needed. \nDepending on a specific bank’s risk profile assessment, the ECB may impose a wide \nrange of supervisory measures. \n2.3.1 \nJoint Supervisory Teams \nThe day-to-day supervision of SIs is primarily conducted off-site by the JSTs, which \ncomprise staff from NCAs and the ECB and are supported by the hor... | [] | {'question_type': 'simple', 'seed_document_id': 76, 'topic': 'Others'} |
1beb42a0-ff1a-42e9-91c6-fe11774e909d | What happens if an urgent supervisory decision is necessary to prevent significant damage to the financial system? | The ECB may adopt a supervisory decision which would adversely affect the rights of the addressee without giving it the opportunity to comment on the decision prior to its adoption. In this case, the hearing is postponed, and a clear justification is provided in the decision as to why the postponement is necessary. The hearing is then organised as soon as possible after the adoption of the dec... | Document 34: Supervisory Manual – Functioning of the Single Supervisory Mechanism \n \n21 \nFigure 4 \nDecision-making process \n \n*The deadline for submitting comments/objections in a written procedure is five working days, while the deadline for non-objection \nprocedures is a maximum of ten working days. \n**The applicable legal deadlines for each specific case must be taken into account. ... | [] | {'question_type': 'simple', 'seed_document_id': 34, 'topic': 'Single Supervisory Mechanism'} |
562d7352-b2ee-4191-b6eb-96f0fca7b01c | What is required of banks and investment firms in the EU that are subsidiaries of third-country groups according to Article 21b of Directive 2013/36/EU? | Article 21b of Directive 2013/36/EU requires banks and investment firms in the EU that are subsidiaries of third-country groups to set up a single intermediate EU parent undertaking if the third-country group has two or more institutions established within the EU with a combined total asset value of at least €40 billion. | Document 169: Supervisory Manual – Supervision of significant institutions \n \n97 \ntransactions which go beyond the contractual obligations of a sponsor institution or \nan originator institution under Article 248(1) of Regulation (EU) No 575/2013. \nBased on the notifications received from significant institutions: \n• \nif the institution declares that there is implicit support, the JST ch... | [] | {'question_type': 'simple', 'seed_document_id': 169, 'topic': 'Others'} |
a9955bdc-165d-42ed-a259-53bef0d5e0ea | What are the purposes of macroprudential extensions in stress tests? | Macroprudential extensions in stress tests focus on system-wide effects rather than on individual banks and are run in a top-down manner. They capture important feedback effects or network effects, which can occur through adverse changes in the state of the environment triggered by a stress scenario with a negative impact on lending or through lending or funding links between institutions. | Document 125: These tasks are undertaken, where \nappropriate, in collaboration with other divisions of the ECB, the EBA and/or NCAs. \nMicroprudential stress tests are often complemented by macroprudential extensions \nthat focus on system-wide effects rather than on individual banks and which are run \nin a top-down manner, meaning that they do not involve the supervised entities. In \nparti... | [] | {'question_type': 'simple', 'seed_document_id': 125, 'topic': 'European Banking Supervision'} |
a7c255f1-9fd8-48d8-8a6a-5afa995dae21 | What happens if a quorum of 50% is not met during an emergency Supervisory Board meeting? | If a quorum of 50% in the Supervisory Board for emergency situations is not met, the meeting will be closed and an extraordinary meeting will be held soon afterwards. | Document 38: Supervisory Manual – Functioning of the Single Supervisory Mechanism \n \n24 \n• \nif an NCA which is concerned by the decision has different views regarding the \nobjection, the NCA may request mediation; \n• \nif no request for mediation is submitted, the Supervisory Board may amend the \ndraft decision in order to incorporate the comments of the Governing Council; \n• \nif the ... | [] | {'question_type': 'simple', 'seed_document_id': 38, 'topic': 'Single Supervisory Mechanism'} |
Evaluate and Diagnose the Agent
We can now evaluate the agent’s performance on the test set using the RAG Evaluation Toolkit (RAGET). We will use the evaluate
function to evaluate the agent’s performance on the test set. We will use the ragas_context_recall
and ragas_context_precision
metrics to evaluate the agent’s performance on the test set. We will also use the RAGReport
class to generate a report of the agent’s performance.
[ ]:
def answer_fn(question: str, history: list[dict] = None) -> AgentAnswer:
if history:
answer = chat_engine.chat(
question,
chat_history=[
ChatMessage(
role=(
MessageRole.USER
if msg["role"] == "user"
else MessageRole.ASSISTANT
),
content=msg["content"],
)
for msg in history
],
)
else:
answer = chat_engine.chat(question, chat_history=[])
return AgentAnswer(
message=answer.response, documents=[source.content for source in answer.sources]
)
rag_report = evaluate(
answer_fn,
testset=testset,
knowledge_base=knowledge_base,
metrics=[ragas_context_recall, ragas_context_precision],
)
Now, we can save the report and load it later to display it.
[15]:
# Save the RAG report
rag_report.save("banking_supervision_report")
# Load the RAG report
rag_report = RAGReport.load("banking_supervision_report")
We can now display the report.
[16]:
# RAG report
display(rag_report.to_html(embed=True))
RAGET question types
For RAGET, we have 6 different question types that assess different RAG components. Each question type assesses a few RAG components. This makes it possible to localize weaknesses in the RAG Agent and give feedback to the developers.
Question type |
Description |
Example |
Targeted RAG components |
---|---|---|---|
Simple |
Simple questions generated from an excerpt of the knowledge base |
What is the purpose of the holistic approach in the SREP? |
|
Complex |
Questions made more complex by paraphrasing |
In what capacity and with what frequency do NCAs contribute to the formulation and scheduling of supervisory activities, especially concerning the organization of on-site missions? |
|
Distracting |
Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question |
Under what conditions does the ECB levy fees to cover the costs of its supervisory tasks, particularly in the context of financial conglomerates requiring cross-sector supervision? |
|
Situational |
Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context |
As a bank manager looking to understand the appeal process for a regulatory decision made by the ECB, could you explain what role the ABoR plays in the supervisory decision review process? |
|
Double |
Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG |
What role does the SSM Secretariat Division play in the decision-making process of the ECB’s supervisory tasks, and which directorates general are involved in the preparation of draft decisions for supervised entities in the ECB Banking Supervision? |
|
Conversational |
Questions made as part of a conversation, first message describe the context of the question that is ask in the last message, also tests the rewriter |
What are these sources? |
|