RAG Evaluation Toolkit on a Banking Supervisory Process Agent

Before starting

Giskard is an open-source framework for testing all ML models, from LLMs to tabular models. Don’t hesitate to give the project a star on GitHub ⭐️ if you find it useful!

In this notebook, you’ll learn how to create a test dataset for a RAG pipeline and use this dataset to test the model.

In this example, we illustrate the procedure using OpenAI Client that is the default one; however, please note that our platform supports a variety of language models. For details on configuring different models, visit our 🤖 Setting up the LLM Client page

In this tutorial we will use Giskard LLM RAG Evaluation Toolkit to automatically detect issues of a Retrieval Augmented Generation (RAG) pipeline. We will test a model that answers questions about the Banking Supervision report from the ECB.

Use-case:

QA over the Banking Supervision report
Foundational model: gpt-3.5-turbo
Context: Banking Supervision report

Outline:

Create a test dataset for the RAG pipeline
Automatically evaluate the RAG pipeline and provide a report with recommendations

Install dependencies and setup notebook

Let’s install the required dependencies. We will be using giskard[llm] to create the test dataset and llama-index to build the RAG pipeline. Additionally, we will use PyMuPDF to load the IPCC report.

[ ]:

!pip install "giskard[llm]" --upgrade
!pip install llama-index PyMuPDF

Now, we download the Banking Supervision report from the ECB website.

[ ]:

!wget "https://www.bankingsupervision.europa.eu/ecb/pub/pdf/ssm.supervisory_guides202401_manual.en.pdf" -O "banking_supervision_report.pdf"

Now, we can import all of the required libraries and classess

[ ]:

import os
import warnings

import openai
import pandas as pd
from llama_index.core import VectorStoreIndex
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.readers.file import PyMuPDFReader

from giskard import Model, scan
from giskard.rag import (
    AgentAnswer,
    KnowledgeBase,
    QATestset,
    RAGReport,
    evaluate,
    generate_testset,
)
from giskard.rag.metrics.ragas_metrics import (
    ragas_context_precision,
    ragas_context_recall,
)

Now, let’s set the OpenAI API Key environment variable and some visual options.

[ ]:

# Set the OpenAI API Key environment variable.
OPENAI_API_KEY = "..."
openai.api_key = OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Set pandas options
pd.set_option("display.max_colwidth", 400)
warnings.filterwarnings("ignore")

Build RAG Agent on the Banking Supervision report

We will use llama-index to build the RAG pipeline. We will use the VectorStoreIndex to create an index of the IPCC report. We will then use the as_chat_engine method to create a chat engine from the index.

[ ]:

loader = PyMuPDFReader()
documents = loader.load(file_path="./banking_supervision_report.pdf")
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

Now, we can use the pyMuPDF reader to load the IPCC report and create a VectorStoreIndex. We will also use the SentenceSplitter to split the report into chunks of 512 tokens to ensure that the context is not too large.

[5]:

splitter = SentenceSplitter(chunk_size=512)
index = VectorStoreIndex.from_documents(documents, transformations=[splitter])
chat_engine = index.as_chat_engine(llm=llm)

Let’s test the Agent

We can now simple chat with our agent using the chat_engine and the chat method. Under the hood, this will use the VectorStoreIndex to retrieve the most relevant chunks of the report and the gpt-3.5-turbo model to answer the question.

[4]:

str(chat_engine.chat("What is SSM?"))

[4]:

'SSM stands for Single Supervisory Mechanism.'

Scan LLM vulnerabilities

As a first step, we will run a scan on the chatbot model. This will help us identify the potential vulnerabilities in the model that the agent is built on. To do so, we need to define a function that will take a dataframe with a question column and return the answer from the chatbot. This will then be used to create a Giskard Model object.

[ ]:

def model_predict(df: pd.DataFrame):
    return [chat_engine.chat(question).response for question in df["question"]]


giskard_model = Model(
    model=model_predict,
    model_type="text_generation",
    name="Banking Supervision Question Answering",
    description="A model that answers questions about ECB Banking Supervision report",
    feature_names=["question"],
)

We can now forward the model to the scan function to get a report with the potential vulnerabilities. You can pass a custom dataset and features to the scan function to get a more accurate report but for this example, we will use the default one. If you want to share the report with your team, you can use the to_html or to_json methods to save the report.

[8]:

scan_report = scan(giskard_model)
display(scan_report)

Generate a test set on for RAG the Banking Supervision report

We will now generate a test set for RAG on the Banking Supervision report. We first load the report and split it into chunks of 512 tokens.

[ ]:

text_nodes = splitter(documents)
knowledge_base_df = pd.DataFrame([node.text for node in text_nodes], columns=["text"])
knowledge_base = KnowledgeBase(knowledge_base_df)

We can now generate a test set with 100 questions.

[ ]:

testset = generate_testset(
    knowledge_base=knowledge_base,
    num_questions=100,
    agent_description="A chatbot answering questions about banking supervision procedures and methodologies.",
    language="en",
)

To avoid losing the test set, we can save it to a JSONL file and safely load it later. Note that, we need to ensure the documents in the KnowledgeBase are the same as the ones in the testset to evaluate the agent’s performance on this test set.

[11]:

# Save the testset
testset.save("banking_supervision_testset.jsonl")

# Load the testset
testset = QATestset.load("banking_supervision_testset.jsonl")

Let’s take a look at the first 5 questions in the test set. We can see that the questions are representative of the agent’s performance and get a good coverage of the IPCC report.

[12]:

testset.to_pandas().head(5)

[12]:

	question	reference_answer	reference_context	conversation_history	metadata
id
35202be3-9120-4bd1-9b3b-722d3b307e1c	What is the role of Joint Supervisory Teams (JSTs) in the supervision of Significant Institutions (SIs)?	The day-to-day supervision of SIs is primarily conducted off-site by the JSTs, which comprise staff from NCAs and the ECB and are supported by the horizontal and specialised expertise divisions of DG/HOL and similar staff at the NCAs. The JST analyses the supervisory reporting, financial statements and internal documentation of supervised entities, holds regular and ad hoc meetings with the su...	Document 76: This can involve on-site interventions at supervised institutions, if needed. \nDepending on a specific bank’s risk profile assessment, the ECB may impose a wide \nrange of supervisory measures. \n2.3.1 \nJoint Supervisory Teams \nThe day-to-day supervision of SIs is primarily conducted off-site by the JSTs, which \ncomprise staff from NCAs and the ECB and are supported by the hor...	[]	{'question_type': 'simple', 'seed_document_id': 76, 'topic': 'Others'}
1beb42a0-ff1a-42e9-91c6-fe11774e909d	What happens if an urgent supervisory decision is necessary to prevent significant damage to the financial system?	The ECB may adopt a supervisory decision which would adversely affect the rights of the addressee without giving it the opportunity to comment on the decision prior to its adoption. In this case, the hearing is postponed, and a clear justification is provided in the decision as to why the postponement is necessary. The hearing is then organised as soon as possible after the adoption of the dec...	Document 34: Supervisory Manual – Functioning of the Single Supervisory Mechanism \n \n21 \nFigure 4 \nDecision-making process \n \nThe deadline for submitting comments/objections in a written procedure is five working days, while the deadline for non-objection \nprocedures is a maximum of ten working days. \n*The applicable legal deadlines for each specific case must be taken into account. ...	[]	{'question_type': 'simple', 'seed_document_id': 34, 'topic': 'Single Supervisory Mechanism'}
562d7352-b2ee-4191-b6eb-96f0fca7b01c	What is required of banks and investment firms in the EU that are subsidiaries of third-country groups according to Article 21b of Directive 2013/36/EU?	Article 21b of Directive 2013/36/EU requires banks and investment firms in the EU that are subsidiaries of third-country groups to set up a single intermediate EU parent undertaking if the third-country group has two or more institutions established within the EU with a combined total asset value of at least €40 billion.	Document 169: Supervisory Manual – Supervision of significant institutions \n \n97 \ntransactions which go beyond the contractual obligations of a sponsor institution or \nan originator institution under Article 248(1) of Regulation (EU) No 575/2013. \nBased on the notifications received from significant institutions: \n• \nif the institution declares that there is implicit support, the JST ch...	[]	{'question_type': 'simple', 'seed_document_id': 169, 'topic': 'Others'}
a9955bdc-165d-42ed-a259-53bef0d5e0ea	What are the purposes of macroprudential extensions in stress tests?	Macroprudential extensions in stress tests focus on system-wide effects rather than on individual banks and are run in a top-down manner. They capture important feedback effects or network effects, which can occur through adverse changes in the state of the environment triggered by a stress scenario with a negative impact on lending or through lending or funding links between institutions.	Document 125: These tasks are undertaken, where \nappropriate, in collaboration with other divisions of the ECB, the EBA and/or NCAs. \nMicroprudential stress tests are often complemented by macroprudential extensions \nthat focus on system-wide effects rather than on individual banks and which are run \nin a top-down manner, meaning that they do not involve the supervised entities. In \nparti...	[]	{'question_type': 'simple', 'seed_document_id': 125, 'topic': 'European Banking Supervision'}
a7c255f1-9fd8-48d8-8a6a-5afa995dae21	What happens if a quorum of 50% is not met during an emergency Supervisory Board meeting?	If a quorum of 50% in the Supervisory Board for emergency situations is not met, the meeting will be closed and an extraordinary meeting will be held soon afterwards.	Document 38: Supervisory Manual – Functioning of the Single Supervisory Mechanism \n \n24 \n• \nif an NCA which is concerned by the decision has different views regarding the \nobjection, the NCA may request mediation; \n• \nif no request for mediation is submitted, the Supervisory Board may amend the \ndraft decision in order to incorporate the comments of the Governing Council; \n• \nif the ...	[]	{'question_type': 'simple', 'seed_document_id': 38, 'topic': 'Single Supervisory Mechanism'}

Evaluate and Diagnose the Agent

We can now evaluate the agent’s performance on the test set using the RAG Evaluation Toolkit (RAGET). We will use the evaluate function to evaluate the agent’s performance on the test set. We will use the ragas_context_recall and ragas_context_precision metrics to evaluate the agent’s performance on the test set. We will also use the RAGReport class to generate a report of the agent’s performance.

[ ]:

def answer_fn(question: str, history: list[dict] = None) -> AgentAnswer:
    if history:
        answer = chat_engine.chat(
            question,
            chat_history=[
                ChatMessage(
                    role=(
                        MessageRole.USER
                        if msg["role"] == "user"
                        else MessageRole.ASSISTANT
                    ),
                    content=msg["content"],
                )
                for msg in history
            ],
        )
    else:
        answer = chat_engine.chat(question, chat_history=[])

    return AgentAnswer(
        message=answer.response, documents=[source.content for source in answer.sources]
    )


rag_report = evaluate(
    answer_fn,
    testset=testset,
    knowledge_base=knowledge_base,
    metrics=[ragas_context_recall, ragas_context_precision],
)

Now, we can save the report and load it later to display it.

[15]:

# Save the RAG report
rag_report.save("banking_supervision_report")

# Load the RAG report
rag_report = RAGReport.load("banking_supervision_report")

We can now display the report.

[16]:

# RAG report
display(rag_report.to_html(embed=True))

RAGET question types

For RAGET, we have 6 different question types that assess different RAG components. Each question type assesses a few RAG components. This makes it possible to localize weaknesses in the RAG Agent and give feedback to the developers.

Question type	Description	Example	Targeted RAG components
Simple	Simple questions generated from an excerpt of the knowledge base	What is the purpose of the holistic approach in the SREP?	`Generator`, `Retriever`
Complex	Questions made more complex by paraphrasing	In what capacity and with what frequency do NCAs contribute to the formulation and scheduling of supervisory activities, especially concerning the organization of on-site missions?	`Generator`
Distracting	Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question	Under what conditions does the ECB levy fees to cover the costs of its supervisory tasks, particularly in the context of financial conglomerates requiring cross-sector supervision?	`Generator`, `Retriever`, `Rewriter`
Situational	Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context	As a bank manager looking to understand the appeal process for a regulatory decision made by the ECB, could you explain what role the ABoR plays in the supervisory decision review process?	`Generator`
Double	Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG	What role does the SSM Secretariat Division play in the decision-making process of the ECB’s supervisory tasks, and which directorates general are involved in the preparation of draft decisions for supervised entities in the ECB Banking Supervision?	`Generator`, `Rewriter`
Conversational	Questions made as part of a conversation, first message describe the context of the question that is ask in the last message, also tests the rewriter	I am interested in the sources used for the assessment of risks and vulnerabilities in ECB Banking Supervision. - What are these sources?	`Rewriter`, `Routing`