Skip to content

Giskard-AI/giskard-oss

giskardlogo giskardlogo

Evals, Red Teaming and Test Generation for Agentic Systems

Modular, Lightweight, Dynamic and Async-first

GitHub release License Downloads CI Giskard on Discord

Docs β€’ Website β€’ Community


Important

Giskard v3 is a fresh rewrite designed for dynamic, multi-turn testing of AI agents. This release drops heavy dependencies for better efficiency while introducing a more powerful AI vulnerability scanner and enhanced RAG evaluation capabilities. For now, the vulnerability scanner and RAG evaluation still rely on Giskard v2. Giskard v2 remains available but is no longer actively maintained. Follow progress β†’ Read the v3 Annoucement Β· Roadmap

Install

pip install giskard

Requires Python 3.12+.


Giskard is an open-source Python library for testing and evaluating agentic systems. The v3 architecture is a modular set of focused packages β€” each carrying only the dependencies it needs β€” built from scratch to wrap anything: an LLM, a black-box agent, or a multi-step pipeline.

Status Package Description
βœ… Alpha giskard-checks Testing & evaluation β€” scenario API, built-in checks, LLM-as-judge
🚧 In progress giskard-scan Agent vulnerability scanner β€” red teaming, prompt injection, data leakage (successor of v2 Scan)
πŸ“‹ Planned giskard-rag RAG evaluation & synthetic data generation (successor of v2 RAGET)

Giskard Checks β€” create and apply evals for testing agents

pip install giskard-checks

Giskard Checks is a lightweight library for creating evaluations (evals) that test LLM-based systems β€” from simple assertions to LLM-as-judge assessments. Unlike traditional unit tests, evals are designed for non-deterministic outputs where the same input can produce different valid responses.

Use Giskard Checks to:

  • Catch regressions β€” verify your system still behaves correctly after changes
  • Validate RAG quality β€” check if answers are grounded in retrieved context
  • Enforce safety rules β€” ensure outputs conform to your content policies
  • Evaluate multi-turn agents β€” test full conversations, not just single exchanges

Built-in evals include string matching, comparisons, regex, semantic similarity, and LLM-as-judge checks (Groundedness, Conformity, LLMJudge).

Quickstart

from openai import OpenAI
from giskard.checks import Scenario, Groundedness

client = OpenAI()

def get_answer(inputs: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": inputs}],
    )
    return response.choices[0].message.content

scenario = (
    Scenario("test_dynamic_output")
    .interact(
        inputs="What is the capital of France?",
        outputs=get_answer,
    )
    .check(
        Groundedness(
            name="answer is grounded",
            answer_key="trace.last.outputs",
            context="France is a country in Western Europe. Its capital is Paris.",
        )
    )
)

result = await scenario.run()
result.print_report()

The run() method is async. In a script, wrap it with asyncio.run(). See the full docs for Suites, LLMJudge, multi-turn scenarios, and more.

Looking for Giskard v2?

Giskard v2 included Scan (automatic vulnerability detection) and RAGET (RAG evaluation test set generation) for both ML models and LLM applications. These features are not available in v3.

pip install "giskard[llm]>2,<3"

Scan β€” automatically detect performance, bias & security issues

Wrap your model and run the scan:

import giskard
import pandas as pd

# Replace my_llm_chain with your actual LLM chain or model inference logic
def model_predict(df: pd.DataFrame):
    """The function takes a DataFrame and must return a list of outputs (one per row)."""
    return [my_llm_chain.run({"query": question}) for question in df["question"]]

giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="My LLM Application",
    description="A question answering assistant",
    feature_names=["question"],
)

scan_results = giskard.scan(giskard_model)
display(scan_results)

Scan Example

RAGET β€” generate evaluation datasets for RAG applications

Automatically generate questions, reference answers, and context from your knowledge base:

import pandas as pd
from giskard.rag import generate_testset, KnowledgeBase

# Load your knowledge base documents
df = pd.read_csv("path/to/your/knowledge_base.csv")
knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    language='en',
    agent_description="A customer support chatbot for company X",
)

RAGET Example

Full v2 docs

πŸ‘‹ Community

We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.

Follow the progress and share feedback: v3 Announcement Β· Roadmap

🌟 Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! 🌟

❀️ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.

πŸ’š Current sponsors

We thank the following companies which are sponsoring our project with monthly donations:

Lunary

Lunary logo

Biolevate

Biolevate logo