Skip to content

LLM-as-a-Judge · BERTScore · Reproducible

Measure the reliability of your LLM chatbot

Open-source Python framework to assess the factual accuracy, semantic consistency and robustness of any LLM-based chatbot — in an automated, comparable and reproducible way.

Send test scenarios, let a judge model grade the answers with the help of computational metrics, and get structured reports — ready to run as a quality gate in your CI pipeline.

Three dimensions of reliability

  •  Factual accuracy


    Does the chatbot answer correctly? Responses are compared against a verifiable ground truth, with trap scenarios to surface hallucination.

    Scenario bank

  •  Semantic consistency


    Does the same question, asked in different ways, get the same answer? The framework sends rephrasings and compares the outputs with each other.

    How we evaluate

  •  Robustness


    Does quality hold up against noise, typos and adversarial inputs? The variants are compared against the original answer.

    Human validation

Get started in 2 minutes

pip install llm-eval-unb
provider:
  type: gemini
  model: gemini-2.0-flash
  api_key: ${GEMINI_API_KEY}
judge:
  model: gemini-2.0-flash
dimensions: [factual, consistency, robustness]
output_dir: results/
export GEMINI_API_KEY=...      # your key via environment variable
llm-eval run --config config.yaml
# → results/report.md  +  results/report.json

CI-ready

The same command becomes a quality gate that blocks quality regressions on every pull request. See the gate calibration guide.

Where to go next

  •  Scenario methodology


    How the bank was built and how to point to your own bank via scenarios_path.

    Open

  •  Human validation


    Human × judge agreement protocol and the annotation templates.

    Open

  •  Quality gate in CI


    Calibrate when and how the gate fails in your pipeline.

    Open

  •  Code & issues


    Contribute, open issues or explore the source code on GitHub.

    Repository

Documentation in progress

This is the first version of the site. Quickstart, configuration reference, providers and evaluation details are coming next (issue #71).


llm-eval started as an undergraduate thesis (TCC) in Software Engineering at the University of Brasília (UnB), filling a practical gap: applying LLM evaluation metrics and criteria in a systematic, accessible way.

The guide pages below are currently available in Portuguese; English translations are on the way.