LLM-as-a-Judge · BERTScore · Reproducible

Measure the reliability of your LLM chatbot¶

Open-source Python framework to assess the factual accuracy, semantic consistency and robustness of any LLM-based chatbot — in an automated, comparable and reproducible way.

Get started View on GitHub

Send test scenarios, let a judge model grade the answers with the help of computational metrics, and get structured reports — ready to run as a quality gate in your CI pipeline.

Three dimensions of reliability¶

Factual accuracy

Does the chatbot answer correctly? Responses are compared against a verifiable ground truth, with trap scenarios to surface hallucination.

Scenario bank
Semantic consistency

Does the same question, asked in different ways, get the same answer? The framework sends rephrasings and compares the outputs with each other.

How we evaluate
Robustness

Does quality hold up against noise, typos and adversarial inputs? The variants are compared against the original answer.

Human validation

Get started in 2 minutes¶

InstallConfigure (config.yaml)Run

pip install llm-eval-unb

provider:
  type: gemini
  model: gemini-2.0-flash
  api_key: ${GEMINI_API_KEY}
judge:
  model: gemini-2.0-flash
dimensions: [factual, consistency, robustness]
output_dir: results/

export GEMINI_API_KEY=...      # your key via environment variable
llm-eval run --config config.yaml
# → results/report.md  +  results/report.json

CI-ready

The same command becomes a quality gate that blocks quality regressions on every pull request. See the gate calibration guide.

Where to go next¶

Scenario methodology

How the bank was built and how to point to your own bank via scenarios_path.

Open
Human validation

Human × judge agreement protocol and the annotation templates.

Open
Quality gate in CI

Calibrate when and how the gate fails in your pipeline.

Open
Code & issues

Contribute, open issues or explore the source code on GitHub.

Repository

Documentation in progress

This is the first version of the site. Quickstart, configuration reference, providers and evaluation details are coming next (issue #71).

llm-eval started as an undergraduate thesis (TCC) in Software Engineering at the University of Brasília (UnB), filling a practical gap: applying LLM evaluation metrics and criteria in a systematic, accessible way.

The guide pages below are currently available in Portuguese; English translations are on the way.