Skip to content

Giskard documentation

LLM Benchmarks

LLM benchmarks are standardized tests designed to measure and compare the capabilities of different language models across various tasks and domains. These benchmarks provide a consistent framework for evaluating model performance, enabling researchers and practitioners to assess how well different LLMs handle specific challenges.

Types of LLM Benchmarks

Reasoning and Language Understanding

Evaluations of logical inference, text comprehension, and language understanding.

Reasoning and Language Understanding Benchmarks

Math Problems

Tasks from basic arithmetic to complex calculus and mathematical problem-solving.

Mathematical Reasoning Benchmarks

Coding

Tests of code generation, debugging, and solving programming challenges.

Programming Benchmarks

Conversation and Chatbot

Assessments of dialogue engagement, context maintenance, and response helpfulness.

Conversation and Chatbot Benchmarks

Safety

Evaluations of harmful content avoidance, bias detection, and manipulation resistance.

Safety Benchmarks

Domain-Specific

Specialized benchmarks for fields like healthcare, finance, law, and medicine.

Domain-Specific Benchmarks

Creating your own evaluation benchmarks with Giskard

Giskard Hub AI security vulnerabilities evaluation

Our state-of-the-art enterprise-grade security evaluation datasets.

Detect security vulnerabilities by generating synthetic tests

Giskard Hub AI business failures evaluation

Our state-of-the-art enterprise-grade business failures evaluation datasets.

Detect business failures by generating synthetic tests

LLM Scan

Our open-source library for creating security evaluation datasets.

Detect security vulnerabilities in LLMs using LLM Scan

RAGET: RAG Evaluation Toolkit

Our open-source library for creating business evaluation datasets.

Detect business failures in LLMs using RAGET

Stereotypes & Discrimination

Reasoning and Language Understanding Benchmarks